Update a column value in a spark dataframe based another column

Anandha Geethan

I have a spark dataframe as mentioned below.

val data = spark.sparkContext.parallelize(Seq(
    (1,"", "SNACKS", "BISCUITS - AMBIENT", "BISCUITS - AMBIENT", "", "REFLETS DE FRANCE CROQUANT", "UNCOATED  BISCUIT", "NO PROMOTION", "", "", "400G","",""),
    (2,"GROCERY", "BISCUITS", "SWEET BISCUITS ", "BISCUITS - AMBIENT", "", "", "AMBIENT BISCUIT", "NO PROMOTION", "", "", "400G","","CHOCOS")
  ))
  .toDF("id", "c4", "c1001", "c1002", "c1003", "c1008", "c1008_unmasked", "c1009", "c1011", "c1012", "c1013", "c1015", "c1016", "c1016_unmasked")

data.show(false)

SAMPLE INPUT:

+---+-------+--------+------------------+------------------+-----+--------------------------+-----------------+------------+-----+-----+-----+-----+--------------+
|id |c4     |c1001   |c1002             |c1003             |c1008|c1008_unmasked            |c1009            |c1011       |c1012|c1013|c1015|c1016|c1016_unmasked|
+---+-------+--------+------------------+------------------+-----+--------------------------+-----------------+------------+-----+-----+-----+-----+--------------+
|1  |       |SNACKS  |BISCUITS - AMBIENT|BISCUITS - AMBIENT|     |REFLETS DE FRANCE CROQUANT|UNCOATED  BISCUIT|NO PROMOTION|     |     |400G |     |              |
|2  |GROCERY|BISCUITS|SWEET BISCUITS    |BISCUITS - AMBIENT|     |                          |AMBIENT BISCUIT  |NO PROMOTION|     |     |400G |     |CHOCOS        |
+---+-------+--------+------------------+------------------+-----+--------------------------+-----------------+------------+-----+-----+-----+-----+--------------+

Need to populate column cXXXX with value "MASKED" only if the same cXXXX_unmasked has value in it. Please check the sample output for better understading.

+---+-------+--------+------------------+------------------+------+--------------------------+-----------------+------------+-----+-----+-----+------+--------------+
|id |c4     |c1001   |c1002             |c1003             |c1008 |c1008_unmasked            |c1009            |c1011       |c1012|c1013|c1015|c1016 |c1016_unmasked|
+---+-------+--------+------------------+------------------+------+--------------------------+-----------------+------------+-----+-----+-----+------+--------------+
|1  |       |SNACKS  |BISCUITS - AMBIENT|BISCUITS - AMBIENT|MASKED|REFLETS DE FRANCE CROQUANT|UNCOATED  BISCUIT|NO PROMOTION|     |     |400G |      |              |
|2  |GROCERY|BISCUITS|SWEET BISCUITS    |BISCUITS - AMBIENT|      |                          |AMBIENT BISCUIT  |NO PROMOTION|     |     |400G |MASKED|CHOCOS        |
+---+-------+--------+------------------+------------------+------+--------------------------+-----------------+------------+-----+-----+-----+------+--------------+

Thanks in advance

Lamanus

Here is my try.

val cols = data.columns.filter(_.endsWith("_unmasked"))

val new_data = cols.foldLeft(data) { (df, c) => 
    df.withColumn(c.split("_").head, when(col(c) =!= "" && col(c).isNotNull, lit("MASKED")).otherwise(col(c))) 
}
new_data.show

+---+-------+--------+------------------+------------------+------+--------------------+-----------------+------------+-----+-----+-----+------+--------------+
| id|     c4|   c1001|             c1002|             c1003| c1008|      c1008_unmasked|            c1009|       c1011|c1012|c1013|c1015| c1016|c1016_unmasked|
+---+-------+--------+------------------+------------------+------+--------------------+-----------------+------------+-----+-----+-----+------+--------------+
|  1|       |  SNACKS|BISCUITS - AMBIENT|BISCUITS - AMBIENT|MASKED|REFLETS DE FRANCE...|UNCOATED  BISCUIT|NO PROMOTION|     |     | 400G|      |              |
|  2|GROCERY|BISCUITS|   SWEET BISCUITS |BISCUITS - AMBIENT|      |                    |  AMBIENT BISCUIT|NO PROMOTION|     |     | 400G|MASKED|        CHOCOS|
+---+-------+--------+------------------+------------------+------+--------------------+-----------------+------------+-----+-----+-----+------+--------------+

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

Update an existing column in one dataframe based on the value of a column in another dataframe

Spark Scala: update dataframe column's value from another dataframe

Update column based on value from another column

Add a column based on the value of another column in a dataframe

Update the value into new column in current dataframe based on another column from different dataframe

Filtering the dataframe based on the column value of another dataframe

PySpark DataFrame update column value based on min/max condition on timestamp value in another column

Update Dataframe column value based on other Dataframe column value

update values based on if the values of another column is in the column of another dataframe

Update column of the dataframe based on another using an list

Update dataframe column based on another dataframe column without for loop

UPDATE sql column with value from another column based on a date column

Partition a spark dataframe based on column value?

Pyspark create new column based if a column isin another Spark Dataframe

Update column value based on ORDER of another query

Update column based on another value pandas

Update element in dataframe based on value in orther column

Fill DataFrame based on value from another column

Calculate percentage of occurences of a value in a dataframe column based on another column value

Selecting a column value based on the value from another dataframe column

Max value of a column based on every unique value of another column (Dataframe)

How to update an existing column of a dataframe based on values in another column?

How to update Spark DataFrame Column Values of a table from another table based on a condition using Pyspark

Update value in column based on other column values in Spark

Spark dataframes: Extract a column based on the value of another column

Apache spark aggregation: aggregate column based on another column value

new column based on another column and a value change in spark

Extract a column value and assign it to another column as an array in Spark dataframe

Update the dataframe based on a column