Clean invalid characters from data held in a Spark RDD

Dave Poole

I have a PySpark RDD imported from JSON files. The data elements contain a number of values that have characters that are not desirable. For the sake of argument only those characters that are string.printable should be in those JSON files.

Given that there are a large number of elements that contain text information I have been trying to find a way of mapping the incoming RDD to a function to clean the data and returning a cleansed RDD as output. I can find ways of printing a cleansed element from the RDD but not the entire collection of elements and returning then as an RDD.

An example document might be as show below and undesirable characters might creep into the userAgent, marketingReference and pageTags elements or indeed any of the text elements.

{
    "documentId": "abcdef12-1234-5678-fedc-cba9876543210",
    "documentType": "contentSummary",
    "dateTimeCreated": "2017-01-01T03:00:22.478Z"
    "body": {
        "requestUrl": "http://www.our-web-site.com/en-gb/line-of-business/product-category/irritating-guid/",
        "requestMethod": "GET",
        "responseCode": "200",
        "userAgent": "Mozilla/5.0 etc",
        "requestHeaders": {
            "connection": "close",
            "host": "www.our-web-site.com",
            "accept-language": "en-gb",
            "via": "1.1 www.our-web-site.com",
            "user-agent": "Mozilla/5.0 etc",
            "x-forwarded-proto": "https",
            "clientIp": "99.99.99.99",
            "referer": "http://www.our-web-site.com/en-gb/line-of-business/product-category/irritating-guid/",
            "accept-encoding": "gzip, deflate",
            "incap-client-ip": "99.99.99.99"
        },
        "body": {
            "pageId": "/content/our-web-site/en-gb/holidays/interstitial",
            "pageVersion": "1.0",

            "pageClassification": "product-page",
            "pageTags": "spark, python, rdd, other words",
            "MarketingReference": "BUYMEPLEASE",
            "referrer": "http://www.our-web-site.com/en-gb/line-of-business/product-category/irritating-guid/",
            "webSessionId": "abcdef12-1234-5678-fedc-cba9876543210"
        }
    }
}
Dave Poole

The problem was trying to clean up data downstream for which poor (or totally absent) data quality practices existed upstream.

Eventually it was accepted that we were attempting to address a symptom and not the cause. The cost of retrospectively fixing data was proven to be massively more than the cost of handling data properly in the first place.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

Spark RDD data selection

Find size of data stored in rdd from a text file in apache spark

Spark Scala Get Data Back from rdd.foreachPartition

How to get data from a specific partition in Spark RDD?

How to batch clean filenames containing invalid characters

Cached Spark RDD ( read from Sequence File) has invalid entries, how do i fix this?

Spark-rdd manipulating data

Compare data in two RDD in spark

Remove elements from Spark RDD

Reshaping/Pivoting data in Spark RDD and/or Spark DataFrames

Apache Spark: reading RDD from Spark Cluster

How to filter and convert Stream data from apache flume to rdd/data freame using spark to write it to a table

In spark, how to tackle function with multiple arguments, the data of arguments come from rdd generated from local files

How to clean data with special characters in MySQL

Invalid characters in XML feed data

Does Spark from DSE laod all data into RDD before running SQL Query?

Spark pulling data into RDD or dataframe or dataset

Spark RDD affinity/Manual collocation of partitions with data

Spark : converting Array[Byte] data to RDD or DataFrame

Spark RDD vs Dataframe - Data storage

How to get all data in rdd pipeline in Spark?

Remove Invalid Characters from string

Remove invalid characters from domain

Combine results from batch RDD with streaming RDD in Apache Spark

RDD output in spark-shell differs from print(RDD) in idea

Unable to create an RDD from an existing RDD - Apache Spark

clean data in r from image

How to clean data from datagrid

load dataset into rdd from website in spark