Pyspark create new data frame with updating few columns from old data frame

Spark user

I want to create new data frame with updating data from few columns in old data frame in pyspark.

I have below data frame with parquet format which has columns like uid, name, start_dt, addr, extid

df = spark.read.parquet("s3a://testdata?src=ggl")
df1 = df.select("uid")

I have to create a new data frame in parquet with hashed uid and extid and include the remaining columns also. Please suggest how to do this? I am new :(

Sample input:

uid, name, start_dt, addr, extid
1124569-2, abc, 12/02/2018, 343 Beach Dr Newyork NY, 889

Sample output:

uid, name, start_dt, addr, extid
a8ghshd345698cd, abc, 12/02/2018, 343 Beach Dr Newyork NY, shhj676ssdhghje

Here uid and extid are sha256 hashed.

Thanks in advance.

Manoj Singh

You can create a UDF function which call the hashlib.sha256() on the column and use the withColumn to transform the column.

import pyspark.sql.functions as F
import pyspark.sql.types as T
import hashlib

df = spark.read.parquet("s3a://testdata?src=ggl")

sha256_udf = F.udf(lambda x: hashlib.sha256(str(x).encode('utf-8')).hexdigest(), T.StringType()) 
df1 = df.withColumn('uid', sha256_udf('uid')).withColumn('extid', sha256_udf('extid'))
df1.show()

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

Create frequency data frame and transfer columns from old data frame

Updating old column entries from new data frame

how can I create a new data frame using exact rows from the old data frame in R Studio?

Create new Data frame from an existing one in pyspark

Creating new data frame from old dataframe

How to create new columns in a new data frame from information in an existing data frame in R

How to create a new column in a data frame depending on multiple criteria from multiple columns from the same data frame

Create a new column from different columns of one data frame conditioned on another column from another data frame

Create Pandas data frame with statistics from PySpark data frame

create a new data frame with columns from another data frame based on column names in R

Splitting a data frame to create new columns

Create a new data frame using a few conditions based on values in columns (in R)

updating a column by comparing multiple columns in pyspark data frame

Create new data frame based on values from another data frame

create a new data frame from existing data frame based on condition

is python possible to create a new data frame from the existing data frame?

Create pairwise data frame from two columns

creating a new data frame by extracting columns from one data frame based on the value of column in another data frame

Geting new data into old data frame in R

Update few rows and columns of data.frame from another data.frame using dplyr or other solution

Create a new data frame cleaning NA and updating by column in r

normalize a pandas data frame but skip a few columns

Merge two data frame with few different columns

Populating new variable from ddply within old data frame in R

Create data frame from another data frame

How to predict values of column in new Python data frame using info from the old data frame

New columns from all possible combinations of columns of a data frame

Create new data frame column

Reshaping data frame with old variables to be new rownames