How to drop columns based on multiple filters in a dataframe using PySpark?

Aviral Srivastava

I have a list of valid values that a cell can have. If one cell in a column is invalid, I need to drop the whole column. I understand there are answers of dropping rows in a particular column but here I am dropping the whole column instead even if one cell in it is invalid. The conditions for valid/invalid are that a cell can have only three values: ['Messi', 'Ronaldo', 'Virgil']

I tried reading about filtering but all I could see was filtering columns and dropping the rows. For instance in this question. I also read that one should avoid too much scanning and shuffling in Spark, which I agree with.

I am not only looking at the code solution but more on the off-the-shelf code provided from PySpark. I hope it doesn't get out of the scope of a SO answer.

For the following input dataframe:

| Column 1      | Column 2      | Column 3      | Column 4      | Column 5      |
| --------------| --------------| --------------| --------------| --------------|
|  Ronaldo      | Salah         |  Messi        |               |Salah          |
|  Ronaldo      | Messi         |  Virgil       |  Messi        | null          |
|  Ronaldo      | Ronaldo       |  Messi        |  Ronaldo      | null          |

I expect the following output:

| Column 1      | Column 2      |
| --------------| --------------| 
|  Ronaldo      | Messi         |
|  Ronaldo      | Virgil        |
|  Ronaldo      | Messi         |

pault

I am not only looking at the code solution but more on the off-the-shelf code provided from PySpark.

Unfortunately, Spark is designed to operate in parallel on a row-by-row basis. Filtering out columns is not something for which there will be an "off-the-shelf code" solution.

Nevertheless, here is one approach you can take:

First collect the counts of the invalid elements in each column.

from pyspark.sql.functions import col, lit, sum as _sum, when

valid = ['Messi', 'Ronaldo', 'Virgil']
invalid_counts = df.select(
    *[_sum(when(col(c).isin(valid), lit(0)).otherwise(lit(1))).alias(c) for c in df.columns]
).collect()
print(invalid_counts)
#[Row(Column 1=0, Column 2=1, Column 3=0, Column 4=1, Column 5=3)]

This output will be a list with only one element. You can iterate over the items in this element to find the columns to keep.

valid_columns = [k for k,v in invalid_counts[0].asDict().items() if v == 0]
print(valid_columns)
#['Column 3', 'Column 1']

Now just select these columns from your original DataFrame. You can first sort valid_columns using list.index if you want to maintain the original column order.

valid_columns = sorted(valid_columns, key=df.columns.index)
df.select(valid_columns).show()
#+--------+--------+
#|Column 1|Column 3|
#+--------+--------+
#| Ronaldo|   Messi|
#| Ronaldo|  Virgil|
#| Ronaldo|   Messi|
#+--------+--------+

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2020-12-9

Comments

0 comments

TOP Ranking

Article

How to drop columns based on multiple filters in a dataframe using PySpark?

How to drop columns based on multiple filters in a dataframe using PySpark?

pump.io port in URL

Loopback Error: connect ECONNREFUSED 127.0.0.1:3306 (MAMP)

Can't pre-populate phone number and message body in SMS link on iPhones when SMS app is not running in the background

How to import an asset in swift using Bundle.main.path() in a react-native native module

Failed to listen on localhost:8000 (reason: Cannot assign requested address)

Spring Boot JPA PostgreSQL Web App - Internal Authentication Error

ngClass error (Can't bind ngClass since it isn't a known property of div) in Angular 11.0.3

Using Response.Redirect with Friendly URLS in ASP.NET

Can a 32-bit antivirus program protect you from 64-bit threats

Double spacing in rmarkdown pdf

How to fix "pickle_module.load(f, **pickle_load_args) _pickle.UnpicklingError: invalid load key, '<'" using YOLOv3?

3D Touch Peek Swipe Like Mail

Bootstrap 5 Static Modal Still Closes when I Click Outside

Assembly definition can't resolve namespaces from external packages

Vector input in shiny R and then use it

Emulator wrong screen resolution in Android Studio 1.3

Svchost high CPU from Microsoft.BingWeather app errors

Graphics Context misaligned on first paint

Python connect to firebird docker database

Is this docker-for-mac password dialog legit?

How to save models trained locally in Amazon SageMaker?