Pyspark read csv and combine date and time column and filter based on it

Ehs4n

I have about 10,000 csv files containing 14 columns each. They contain data regarding a financial organization, the trade values, date and its time.

Some of the csv files are just headers and have no data in them. I managed to load all the csv files on my local hadoop file system. What I want to achieve is to filter the data to include records occurring only between 9am to 6pm.

How would I achieve this? Im so confused with the lambda and filter and all the stuff exists in the spark-python.

Could you show me how I can filter this and use the filtered data to do other analyses?

P.S, the winter time and summer time also needs to be considered which I was thinking I should have some functions to change the time to UTC format perhaps?

As my concern is about filtering data based on the Time column in my csv file, I have simplified the csvs. lets say:

CSV 1:(Filter.csv)

  • ISIN,Currency,Date,Time
  • "1","EUR",2018-05-08,07:00
  • "2","EUR",2018-05-08,17:00
  • "3","EUR",2018-05-08,06:59
  • "4","EUR",2018-05-08,17:01

CSV 2:(NoFilter.csv)

  • ISIN,Currency,Date,Time
  • "1","EUR",2018-05-08,07:01
  • "2","EUR",2018-05-08,16:59
  • "3","EUR",2018-05-08,10:59
  • "4","EUR",2018-05-08,15:01

and my code is:

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext

sqlc = SQLContext(sc)

ehsanLocationFiltered = 'hdfs://localhost:54310/user/oxclo/ehsanDbs/Filter.csv'
ehsanLocationNonFiltered = 'hdfs://localhost:54310/user/oxclo/ehsanDbs/NoFilter.csv'

df = sqlContext.read.format('com.databricks.spark.csv')\
.options(header='true', inferschema='true')\
.load(ehsanLocationNonFiltered)

dfFilter = sqlContext.read.format('com.databricks.spark.csv')\
.options(header='true', inferschema='true')\
.load(ehsanLocationFiltered)

data = df.rdd
dataFilter = dfFilter.rdd

data.filter(lambda row: row.Time > '07:00' and row.Time < '17:00')
dataFilter.filter(lambda row: row.Time > '07:00' and row.Time < '17:00')

print data.count()
print dataFilter.count()

I am expecting to see the data.count returns 4 as all Times are fitting to the range and dataFilter.count returns 0 as there is no matching time.

Thanks!

devesh

In your code you can use only 'csv' as the format

from pyspark import SparkContext, SparkConf
ehsanLocationFiltered = '/FileStore/tables/stackoverflow.csv'
df = sqlContext.read.format('csv')\
.options(header='true', inferschema='true')\
.load(ehsanLocationFiltered).rdd
result=data.map(lambda row: row.Time > '07:00' and row.Time < '17:00')
result.count()

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

Read and write date and time into CSV file

Read CSV and replace a column based on keyword list

How to combine date from one column and time from another

R combine Date and Time column but get todays date mixed into

Pandas filter DataFrame based on row , column and date

How append date info to a time column on read_csv

combine different columns of date and time into one column as a date time

Filter array column in a dataframe based on a given input array --Pyspark

How to treat csv column as date-time

Filter csv file based on extended column values

Single read filter csv based on 1st column values

Java sort a csv file based on column date

How do I combine two columns of date and time in a csv file to 1 datetime column in pandas?

Django: Combine a date and time field and filter

Filter pyspark dataframe based on time difference between two columns

combine two column into one - Date and Time Column with changing their format

Filter CSV rows based on count of column value

Pyspark: filter dataframe based on column name list

Filter date time in json based on date

Time and Date Column combine

How to join 2 dataframes and add a new column based on a filter pyspark

Separate date and time from csv column

Combine date and time column in pandas

pandas - combine time and date from two dataframe columns to a datetime column

Filter day time column rules using a date

Filter a grouped dataframe based on column value in pyspark

How do I combine columns (date, time, year) of a CSV file in the concat format (to create one new date column) and then delete them after?

Read a CSV two rows at a time, combine values, write to new CSV

Read CSV in Python with date column