repartitioning by multiple columns for Pyspark dataframe

user7298979

EDIT: adding more context to the question now that I reread the post again:

Let's say I have a pyspark dataframe that I am working with and currently I can repartition the dataframe as such:

dataframe.repartition(200, col_name)

And I write that partitioned dataframe out to a parquet file. When reading the directory, I see that the directory in the warehouse is partitioned the way I want:

/apps/hive/warehouse/db/DATE/col_name=1
/apps/hive/warehouse/db/DATE/col_name=2

I want to understand how I can repartition this in multiple layers, meaning I partition one column for the top level partition, a second column for the second level partition, and a third column for the third level partition. Is it as easy as adding a partitionBy() to a write method?

dataframe.mode("overwrite").partitionBy("col_name1","col_name2","col_name3")

Thus creating the directories as such?

/apps/hive/warehouse/db/DATE/col_name1=1
|--------------------------------------->/col_name2=1
|--------------------------------------------------->/col_name3=1

If so, can I use a partitionBy() to write out a max number of files per partition?

Ramdev Sharma

Repartition

Function repartition will control memory partition of data. If you specify repartition as 200 then in memory you will have 200 partitions.

Physical Partition on file system

Function partitionBy with given columns list control directory structure. Physical partitions will be created based on column name and column value. Each partition can create as many files as specified in repartition (default 200) will be created provided you have enough data to write.

This is sample example based on your question.

dataframe.
repartition(200).
write.mode("overwrite").
partitionBy("col_name1","col_name2","col_name3")

It will give 200 files in each partition and partitions will be created based on given order.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

pyspark dataframe filtering on multiple columns

pyspark dataframe limiting on multiple columns

pySpark join dataframe on multiple columns

Explode 2 columns into multiple columns in pyspark dataframe

Pyspark dataframe repartitioning puts all data in one partition

Adding multiple columns in pyspark dataframe using a loop

Pyspark dataframe convert multiple columns to float

Dynamically rename multiple columns in PySpark DataFrame

Transpose each record into multiple columns in pyspark dataframe

How to explode multiple columns of a dataframe in pyspark

aggregate pyspark dataframe and create multiple columns

PySpark DataFrame - Join on multiple columns dynamically

Apply a transformation to multiple columns pyspark dataframe

How to pivot a DataFrame in PySpark on multiple columns?

pyspark dataframe transformation by grouping multiple columns independently

Creating multiple columns for a grouped pyspark dataframe

Pyspark Split Dataframe string column into multiple columns

Is it possible to cast multiple columns of a dataframe in pyspark?

Mapping a function to multiple columns of pyspark dataframe

PySpark repartitioning RDD elements

Pyspark Dataframe Convert category row values into columns with aggregate on multiple columns

Filter spark dataframe with multiple conditions on multiple columns in Pyspark

Pyspark > Dataframe with multiple array columns into multiple rows with one value each

Lazy repartitioning of dask dataframe

Convert multiple list columns to json array column in dataframe in pyspark

Pyspark eval or expr - Concatenating multiple dataframe columns using when statement

How to add multiple new columns with when condition in pyspark dataframe?

How to drop columns based on multiple filters in a dataframe using PySpark?

How to select and order multiple columns in a Pyspark Dataframe after a join