add new column in a dataframe depending on another dataframe's row values

Sonali Sharma

I need to add a new column to dataframe DF1 but the new column's value should be calculated using other columns' value present in that DF. Which of the other columns to be used will be given in another dataframe DF2.
eg. DF1

|protocolNo|serialNum|testMethod  |testProperty|
+----------+---------+------------+------------+       
|Product1  |  AB     |testMethod1 | TP1        |
|Product2  |  CD     |testMethod2 | TP2        |

DF2-

|action| type|               value       |        exploded |
+------------+---------------------------+-----------------+
|append|hash |        [protocolNo]       | protocolNo      |
|append|text |            _              |     _           | 
|append|hash | [serialNum,testProperty]  | serialNum       |
|append|hash | [serialNum,testProperty]  | testProperty    |

Now the value of exploded column in DF2 will be column names of DF1 if value of type column is hash.

Required - New column should be created in DF1. the value should be calculated like below-

hash[protocolNo]_hash[serialNumTestProperty] ~~~ here on place of column their corresponding row values should come.

eg. for Row1 of DF1, col value should be

hash[Product1]_hash[ABTP1]

this will result into something like this abc-df_egh-45e after hashing.

The above procedure should be followed for each and every row of DF1.

I've tried using map and withColumn function using UDF on DF1. But in UDF, outer dataframe value is not accessible(gives Null Pointer Exception], also I'm not able to give DataFrame as input to UDF.

Input DFs would be DF1 and DF2 as mentioned above.

Desired Output DF-

|protocolNo|serialNum|testMethod  |testProperty| newColumn      |
+----------+---------+------------+------------+----------------+       
|Product1  |  AB     |testMethod1 | TP1        | abc-df_egh-4je |
|Product2  |  CD     |testMethod2 | TP2        | dfg-df_ijk-r56 |

newColumn value is after hashing

Ganesh

Instead of DF2, you can translate DF2 to case class like Specifications, e.g

case class Spec(columnName:String,inputColumns:Seq[String],action:String,action:String,type:String*){}

Create instances of above class

val specifications = Seq(
Spec("new_col_name",Seq("serialNum","testProperty"),"hash","append")
                     )

Then you can process the below columns

 val transformed =  specifications
        .foldLeft(dtFrm)((df: DataFrame, spec: Specification) => df.transform(transformColumn(columnSpec)))

def transformColumn(spec: Spec)(df: DataFrame): DataFrame = { 

 spec.type.foldLeft(df)((df: DataFrame, type : String) => {
           type match {
                  case "append" => {have a case match of the action and do that , then append with df.withColumn}

}
}

Syntax may not be correct

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

R - Add a new column to a dataframe using matching values of another dataframe

Add new column in DataFrame with number of neighbours for another column's value

Making a list ouf of values in a DataFrame depending on values in another column

Add column with values depending on another column to a dataframe

Creating new column in dataframe, with ascending values, depending on value in other column

Replace values in dataframe column depending on another column with condition

Add new row to pyspark dataframe based on values

Add a new column to dataframe and add unique values to each row

Dataframe, creating a new column with values based on another column's indices

Add new column to dataframe depending on interqection of existing columns with pyspark

Add column to dataframe based on corresponding values of two row values from another dataframe

How to list a row values and add as a new column in a DataFrame?

Sort part of DataFrame in Python Panda, return new column with order depending on row values

Extend dataframe with a new column that is depending on values that are stored in another (variable-linked) dataframe

How to create a column depending on the row index values in a multiindex pandas dataframe?

Add column to dataframe depending on specific row values

Add column to dataframe depending on specific row values (2)

Row iteration over a dataframe to calculate values and add them to new column

Pandas DataFrame: Add new column with calculated values based on previous row

Creating new columns in dataframe that will be based on conditions and on another column's values

How to get values from a DataFrame depending on a predefiend row & column value

How to add a list of values ​in dataframe as a new row

How to add new column from another dataframe based on values in column of first dataframe?

If a word is in a column in dataframe, replace the word with another and make a new row with new info and add to another DataFrame

Add row values as new columns in a dataframe

how to change the row of a DataFrame depending on values of one column

Add new column to dataframe that is another column's values from the month before based repeating datetime index with other columns as identifiers

Taking specific DataFrame row values and moving them to another new column

Create a new column in first dataframe with transposed and repeated values from a row in another dataframe