Pyspark read csv and combine date and time column and filter based on it

Ehs4n Published at Dev

Ehs4n

I have about 10,000 csv files containing 14 columns each. They contain data regarding a financial organization, the trade values, date and its time.

Some of the csv files are just headers and have no data in them. I managed to load all the csv files on my local hadoop file system. What I want to achieve is to filter the data to include records occurring only between 9am to 6pm.

How would I achieve this? Im so confused with the lambda and filter and all the stuff exists in the spark-python.

Could you show me how I can filter this and use the filtered data to do other analyses?

P.S, the winter time and summer time also needs to be considered which I was thinking I should have some functions to change the time to UTC format perhaps?

As my concern is about filtering data based on the Time column in my csv file, I have simplified the csvs. lets say:

CSV 1:(Filter.csv)

ISIN,Currency,Date,Time
"1","EUR",2018-05-08,07:00
"2","EUR",2018-05-08,17:00
"3","EUR",2018-05-08,06:59
"4","EUR",2018-05-08,17:01

CSV 2:(NoFilter.csv)

ISIN,Currency,Date,Time
"1","EUR",2018-05-08,07:01
"2","EUR",2018-05-08,16:59
"3","EUR",2018-05-08,10:59
"4","EUR",2018-05-08,15:01

and my code is:

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext

sqlc = SQLContext(sc)

ehsanLocationFiltered = 'hdfs://localhost:54310/user/oxclo/ehsanDbs/Filter.csv'
ehsanLocationNonFiltered = 'hdfs://localhost:54310/user/oxclo/ehsanDbs/NoFilter.csv'

df = sqlContext.read.format('com.databricks.spark.csv')\
.options(header='true', inferschema='true')\
.load(ehsanLocationNonFiltered)

dfFilter = sqlContext.read.format('com.databricks.spark.csv')\
.options(header='true', inferschema='true')\
.load(ehsanLocationFiltered)

data = df.rdd
dataFilter = dfFilter.rdd

data.filter(lambda row: row.Time > '07:00' and row.Time < '17:00')
dataFilter.filter(lambda row: row.Time > '07:00' and row.Time < '17:00')

print data.count()
print dataFilter.count()

I am expecting to see the data.count returns 4 as all Times are fitting to the range and dataFilter.count returns 0 as there is no matching time.

Thanks!

devesh

In your code you can use only 'csv' as the format

from pyspark import SparkContext, SparkConf
ehsanLocationFiltered = '/FileStore/tables/stackoverflow.csv'
df = sqlContext.read.format('csv')\
.options(header='true', inferschema='true')\
.load(ehsanLocationFiltered).rdd
result=data.map(lambda row: row.Time > '07:00' and row.Time < '17:00')
result.count()

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-05-2

Comments

0 comments

Pyspark read csv and combine date and time column and filter based on it

Pyspark read csv and combine date and time column and filter based on it

pump.io port in URL

How to import an asset in swift using Bundle.main.path() in a react-native native module

Failed to listen on localhost:8000 (reason: Cannot assign requested address)

Inner Loop design for webscrapping

Can't pre-populate phone number and message body in SMS link on iPhones when SMS app is not running in the background

mysql.connector.errors.InterfaceError: 2003: Can't connect to MySQL server on '127.0.0.1:3306' (111 Connection refused)

Removed zsh, but forgot to change shell back to bash, and now Ubuntu crashes (wsl)

ggplotly no applicable method for 'plotly_build' applied to an object of class "NULL" if statements

How to run blender on webserver?

Resetting Value of <input type="time"> in Firefox

Converting a class method to a property with a backing field

Ambiguous use of 'init' with CFStringTransform and Swift 3

Execute ./script.sh with a crontab

How to set tab order for array of cluster,where cluster elements have different data types in LabVIEW?

How to pass data to the ng2-bs3-modal?

Retrieve Element Tag Value XML Using Bash

Spring Boot JPA PostgreSQL Web App - Internal Authentication Error

SQL Server : need add a dot before two last character

Making Array From Page Elements in jQuery

Laravel's ORM sync with timestamps doesn't update timestamps

Do animations stop css changes after animation completion?