Load multiple files from multiple folders in spark

Asif

I am having a data set that contains multiple folders inside main folder and each folder contains multiple CSV files. Every CSV file has three columns named X,Y and Z. I want to create a dataframe so that first three columns of dataframe are three X,Y,Z. I want two more columns such that fourth column contains name of folder from which CSV file is read. Fifth column contains the name of CSV file. How can I create this dataframe in Scala and Spark?

Shu

You can use spark.read.csv then use input_file_name to get the filename and extract directory from the filename.

Example:

1.extracting directory from filename:

// Lets take we have directory `tmp2` with folders having csv files in it
tmp2
|-folder1
|-folder2

//extracting directory from filename

spark.read.option("header",true).
csv("tmp2/*").
withColumn("file_name",input_file_name).
withColumn("directory",element_at(reverse(split(col("file_name"),"/")),2)).
show()

//+----+---+---------------------------+---------+
//|name|id |file_name                  |directory|
//+----+---+---------------------------+---------+
//|2   |b  |file:///tmp2/folder2/t1.csv|folder2  |
//|1   |a  |file:///tmp2/folder1/t.csv |folder1  |
//+----+---+---------------------------+---------+

2. Get folder name while reading file:

If you have folder structure like folder=<val> then spark reads folder as partition column and add folder as partition column.

//folder structure

tmp3
|-folder=1
|-folder=2

spark.read.
option("header",true).
csv("tmp3").\
withColumn("file_name",input_file_name).
show(false)

//+----+---+------+---------------------------+
//|name|id |folder|file_name                  |
//+----+---+------+---------------------------+
//|a   |1  |2     |file:///tmp3/folder=2/t.txt|
//|a   |1  |1     |file:///tmp3/folder=1/t.txt|
//+----+---+------+---------------------------+

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

XDocument Load Multiple XML Files From Multiple Folders At Once

Spark load settings from multiple configuration files

read multiple text files from multiple folders

apache-beam reading multiple files from multiple folders of GCS buckets and load it biquery python

RasterStack/Brick for multiple variables from multiple files in multiple folders R

WebPage For Users To Upload Files From Multiple Folders

Upload Multiple Files From DIfferent Folders

Load multiple text files from different folders into one file and account for special characters

load multiple files from different folders with same file name in bash-scipt

Spark Context Textfile: load multiple files

How to delete files from multiple folders excluding couple of folders?

Copy sequential files from multiple folders into new folders

upload multiple files on multiple folders

Multiple folders/files preview

Create files in multiple folders

how to combine multiple CSV files from multiple folders in Python?

Moving multiple files from a single folder to multiple folders according to their name

Reading multiple txt files from multiple folders in Java

Copy multiple files from multiple folders to a single folder using R

Reading multiple CSV files from multiple folders in Python?

How to import multiple files from multiple folders using readRDS

Generate object files from source files located in multiple folders

How to read a specific paragraph from from multiple folders and files

mv multiple files from different folders from specific date

Apache Camel Copying Files from multiple source folders to multiple destination folders

To move files from multiple source folders to multiple destination folders based on two hour delay

Load multiple csv files from a folder with a condition

Load files from one CDN or multiple CDNS

Load Flask config from multiple files