I want to create new data frame with updating data from few columns in old data frame in pyspark.
I have below data frame with parquet format which has columns like uid, name, start_dt, addr, extid
df = spark.read.parquet("s3a://testdata?src=ggl")
df1 = df.select("uid")
I have to create a new data frame in parquet with hashed uid and extid and include the remaining columns also. Please suggest how to do this? I am new :(
Sample input:
uid, name, start_dt, addr, extid
1124569-2, abc, 12/02/2018, 343 Beach Dr Newyork NY, 889
Sample output:
uid, name, start_dt, addr, extid
a8ghshd345698cd, abc, 12/02/2018, 343 Beach Dr Newyork NY, shhj676ssdhghje
Here uid and extid are sha256 hashed.
Thanks in advance.
You can create a UDF function which call the hashlib.sha256()
on the column and use the withColumn
to transform the column.
import pyspark.sql.functions as F
import pyspark.sql.types as T
import hashlib
df = spark.read.parquet("s3a://testdata?src=ggl")
sha256_udf = F.udf(lambda x: hashlib.sha256(str(x).encode('utf-8')).hexdigest(), T.StringType())
df1 = df.withColumn('uid', sha256_udf('uid')).withColumn('extid', sha256_udf('extid'))
df1.show()
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments