repartitioning by multiple columns for Pyspark dataframe

user7298979

EDIT: adding more context to the question now that I reread the post again:

Let's say I have a pyspark dataframe that I am working with and currently I can repartition the dataframe as such:

dataframe.repartition(200, col_name)

And I write that partitioned dataframe out to a parquet file. When reading the directory, I see that the directory in the warehouse is partitioned the way I want:

/apps/hive/warehouse/db/DATE/col_name=1
/apps/hive/warehouse/db/DATE/col_name=2

I want to understand how I can repartition this in multiple layers, meaning I partition one column for the top level partition, a second column for the second level partition, and a third column for the third level partition. Is it as easy as adding a partitionBy() to a write method?

dataframe.mode("overwrite").partitionBy("col_name1","col_name2","col_name3")

Thus creating the directories as such?

/apps/hive/warehouse/db/DATE/col_name1=1
|--------------------------------------->/col_name2=1
|--------------------------------------------------->/col_name3=1

If so, can I use a partitionBy() to write out a max number of files per partition?

Ramdev Sharma

Repartition

Function repartition will control memory partition of data. If you specify repartition as 200 then in memory you will have 200 partitions.

Physical Partition on file system

Function partitionBy with given columns list control directory structure. Physical partitions will be created based on column name and column value. Each partition can create as many files as specified in repartition (default 200) will be created provided you have enough data to write.

This is sample example based on your question.

dataframe.
repartition(200).
write.mode("overwrite").
partitionBy("col_name1","col_name2","col_name3")

It will give 200 files in each partition and partitions will be created based on given order.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2020-12-19

Comments

0 comments

TOP Ranking

Article

repartitioning by multiple columns for Pyspark dataframe

repartitioning by multiple columns for Pyspark dataframe

pump.io port in URL

Loopback Error: connect ECONNREFUSED 127.0.0.1:3306 (MAMP)

Can't pre-populate phone number and message body in SMS link on iPhones when SMS app is not running in the background

How to import an asset in swift using Bundle.main.path() in a react-native native module

Failed to listen on localhost:8000 (reason: Cannot assign requested address)

Spring Boot JPA PostgreSQL Web App - Internal Authentication Error

ngClass error (Can't bind ngClass since it isn't a known property of div) in Angular 11.0.3

Using Response.Redirect with Friendly URLS in ASP.NET

Can a 32-bit antivirus program protect you from 64-bit threats

Double spacing in rmarkdown pdf

How to fix "pickle_module.load(f, **pickle_load_args) _pickle.UnpicklingError: invalid load key, '<'" using YOLOv3?

3D Touch Peek Swipe Like Mail

Bootstrap 5 Static Modal Still Closes when I Click Outside

Assembly definition can't resolve namespaces from external packages

Vector input in shiny R and then use it

Emulator wrong screen resolution in Android Studio 1.3

Svchost high CPU from Microsoft.BingWeather app errors

Graphics Context misaligned on first paint

Python connect to firebird docker database

Is this docker-for-mac password dialog legit?

How to save models trained locally in Amazon SageMaker?