pyspark dataframe filtering on multiple columns

Anurag Published at Dev

73

anurag

I have a pyspark dataframe which looks like below

df

num11   num21 
10     10
20     30 
5      25

I am filtering above dataframe on all columns present, and selecting rows with number greater than 10 [no of columns can be more than two]

from pyspark.sql.functions import col
col_list = df.schema.names
df_fltered = df.where(col(c) >= 10 for c in col_list)

desired output is :

num11    num21
10       10
20       30

How can we achieve filtering on multiple columns using iteration on column list as above. [all efforts are appriciated]

[error i reveive is : condition should be string or column]

Psidom

You can use functools.reduce to combine the column conditions, to simulate an all condition, for instance, you can use reduce(lambda x, y: x & y, ...):

import pyspark.sql.functions as F
from functools import reduce

df.where(reduce(lambda x, y: x & y,  (F.col(x) >= 10 for x in df.columns))).show()
+-----+-----+
|num11|num21|
+-----+-----+
|   10|   10|
|   20|   30|
+-----+-----+

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2020-11-24

Comments

0 comments

Login to comment

Related

How to explode multiple columns of a dataframe in pyspark

Pyspark dataframe convert multiple columns to float

How to Join Multiple Columns in Spark SQL using Java for filtering in DataFrame

Filtering pandas dataframe with multiple Boolean columns

PySpark DataFrame - Join on multiple columns dynamically

Dynamically rename multiple columns in PySpark DataFrame

Pyspark Dataframe group by filtering

Apply a transformation to multiple columns pyspark dataframe

How to pivot a DataFrame in PySpark on multiple columns?

filtering or subsetting a dataframe using multiple columns matching values in a list

pandas dataframe filtering multiple columns and rows

Filtering text file data as columns in pyspark rdd and dataframe

Conditional filtering on multiple columns for a pandas dataframe

Filtering dataframe on multiple columns in R with at least 6 matches

Adding multiple columns in pyspark dataframe using a loop

repartitioning by multiple columns for Pyspark dataframe

Pyspark DataFrame Filtering

Creating multiple columns for a grouped pyspark dataframe

pyspark dataframe transformation by grouping multiple columns independently

pySpark join dataframe on multiple columns

Pyspark Split Dataframe string column into multiple columns

Pyspark: Filtering rows on multiple columns

Is it possible to cast multiple columns of a dataframe in pyspark?

Transpose each record into multiple columns in pyspark dataframe

Mapping a function to multiple columns of pyspark dataframe

aggregate pyspark dataframe and create multiple columns

How to filtering pandas dataframe by multiple columns

Explode 2 columns into multiple columns in pyspark dataframe

pyspark dataframe limiting on multiple columns

TOP Ranking

Article

HotTag

Archive