Check content of multiple columns of one row and add new column with value depending on contents

user2811630

I have a Dataset that contains channel information. What I want now is to aggregate f.e. all channels starting with X_ and if one of the status values is "not okay" the value in the new columns should also be "not okay", otherwise "okay"

+-----------------+-----------------+-----------------+-----------------+----------------+
|X_ChannelA_status|Y_ChannelB_status|X_ChannelC_status|X_ChannelD_status|X_channel_status|
+-----------------+-----------------+-----------------+-----------------+----------------+
|         not okay|             okay|             okay|         not okay|            true|
|         not okay|         not okay|         not okay|         not okay|            true|
+-----------------+-----------------+-----------------+-----------------+----------------+

I already achived something like this by mapping okay and not okay to zeros and ones where "not okay" = 1 and "okay" = 0. Then I summarized all the columns into a new one and if the value in the new column was > 0 then it was obvious that one of the columns had to contain a "not okay".

val df_grouped = df_filtered.select(list_groupX.map(col).reduce((c1, c2) => c1 + c2) as "sum")

I would love to get rid of the string to int mapping thing since I think it slows down the calculation.

Ramesh Maharjan

You can get your requirement fulfilled just by using array_contains and array inbuilt functions and of course by using withColumn function. But before that you need to find the column names starting with X to check for the condition

val xStartingCols = df.columns.filter(_.startsWith("X"))

And then use the column names to check for the condition using when otherwise

import org.apache.spark.sql.functions._
df.withColumn("new_col", when(array_contains(array(xStartingCols.map(col): _*), "not okay") === lit(true), "not okay").otherwise("okay"))

You should have your desired output dataframe

+-----------------+-----------------+-----------------+-----------------+----------------+--------+
|X_ChannelA_status|Y_ChannelB_status|X_ChannelC_status|X_ChannelD_status|X_channel_status|new_col |
+-----------------+-----------------+-----------------+-----------------+----------------+--------+
|okay             |okay             |okay             |okay             |true            |okay    |
|not okay         |not okay         |not okay         |not okay         |true            |not okay|
+-----------------+-----------------+-----------------+-----------------+----------------+--------+

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

Add multiple CHECK constraints on one column depending on the values of another column

For each row check if value in one column exists in two other columns

Add new columns to DataFrame from the value of one column

Find nearest value from multiple columns and add to a new column in Python

Compare values in multiple columns and add a new value in another column in Python

Pandas: add a new column with one single value at the last row of a dataframe

How can I seperate one column into multiple columns depending on their value when selecting it?

Check if string in one column is contained in string of another column in the same row and add new column with matching column name

Add data row(s) to a tibble depending on the content of a single column

Add new column with one value

New Pandas column with cumulative value depending on condition on the previous row

How to create new columns depending on row value in pandas

add new column in a dataframe depending on another dataframe's row values

Aggregate columns into new ones depending on row value in r

Add new column to dataframe depending on interqection of existing columns with pyspark

Add new column with values depending on other columns

Create new row depending on Column value

Format multiple columns depending on the value of one

Split one column to two columns depending one the content in pandas dataframe

Pandas dataframe check if a value exists in multiple columns for one row

Clear the contents of each row in a column on one sheet, depending on the change in a column on the same row on another sheet

How to add a new pandas column whose value is conditioned on one column, but value depends on other columns?

Add a new column if multiple columns have negative value

Deleting multiple rows depending on one row value

Replicating value from one column into other columns based on row content

Add a new row with a certain value into the first column and then "-" to the rest of the columns

Add a new column that specify multiple columns name that have value

Create a new column based on row in multiple columns

how to check the value in some columns and add the header to the new column in pandas