How to zip two columns, explode them and finally pivot in Pyspark

rd90080

I have two array columns (names, score). I need to explode both of them. Make names as column name for score(similar to pivot).

+------------+-------------------------+--------------------+                        
|      id    |     names               |      score         |
+------------+-------------------------+--------------------+
|ab01        |[F1 , F2, F3, F4, F5]    |[00123, 000.001, 00127, 00.0123, 111]
|ab02        |[F1 , F2, F3, F4, F5, F6]|[00124, 000.003, 00156, 00.067,  156, 254]
|ab03        |[F1 , F2, F3, F4, F5]    |[00234, 000.078, 00188, 00.0144, 188]
|ab04        |[F1 , F2, F3, F4, F5]    |[00345, 000.01112, 001567, 00.0186, 555]

Expected output:

 id       F1      F2        F3        F4    F5  F6
ab01    00123   000.001    00127    00.0123 111 null
ab02    00124   000.003    00156    00.067  156 254
ab03    00234   000.078    00188    00.0144 188 null
ab04    00345   000.01112  001567   00.0186 555 null

I tried zipping up names and score and then exploding them

combine = F.udf(lambda x, y: list(zip(x, y)),
                ArrayType(
                          StructType(
                                     [StructField("names", StringType()),
                                      StructField("score", StringType())
                                     ]
                                    )
                         )
               )

df2 = df.withColumn("new", combine("score", "names"))
         .withColumn("new", F.explode("new"))
         .select("id", 
                 F.col("new.names").alias("names"), 
                 F.col("new.score").alias("score")
                )

I'm getting an error:

TypeError: zip argument #1 must support iteration

I also tried exploding using rdd flatMap() and I still get the same error.

Is there an alternate way to achieve this?

Thanks in advance.

Pygirl

Try:

df2 = df.set_index('id').apply(pd.Series.explode).reset_index()
df3 = df2.pivot(columns='names', values='score', index='id')

df3:

names   F1       F2         F3      F4      F5  F6
id                      
ab01    00123   000.001     00127   00.0123 111 NaN
ab02    00123   000.003     00156   00.067  156 254
ab03    00234   000.078     00188   00.0144 188 NaN
ab04    00345   000.01112   001567  00.0186 555 NaN

edit:

x = (df.apply(lambda x: dict(zip(x['names'], x['score'])), axis=1))
y = pd.DataFrame(x.values.tolist(), index=x.index).fillna("null").join(df.id)

or

x = (df.apply(lambda x: dict(zip(x['names'], x['score'])), axis=1))
z = pd.DataFrame(x.values.tolist(), index=x.index).fillna("null")
y = pd.concat([df.id , z], axis=1)

y:

    F1      F2         F3       F4      F5  F6      id
0   00123   000.001    00127    00.0123 111 null    ab01
1   00123   000.003    00156    00.067  156 254     ab02
2   00234   000.078    00188    00.0144 188 null    ab03
3   00345   000.01112  001567   00.0186 555 null    ab04

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

PySpark: How to explode two columns of arrays

Pivot array of structs into columns using pyspark - not explode the array

How to explode multiple columns of a dataframe in pyspark

pyspark: Explode struct into columns

How to explode structs with pyspark explode()

How to zip two column in pyspark?

How to pivot a DataFrame in PySpark on multiple columns?

How to pivot a table with dynamic columns in pyspark

Pyspark: explode columns to new dataframe

Explode multiple columns to rows in pyspark

how to use explode in pyspark?

How to implement a custom Pyspark explode (for array of structs), 4 columns in 1 explode?

pivot only two columns in pandas dataframe and add one of them

Pyspark pivot with Dynamic columns

Explode 2 columns into multiple columns in pyspark dataframe

pyspark aggregation across columns via explode on columns?

How to explode two array fields to multiple columns in Spark?

How to explode two columns of lists with different length using pandas

Pandas pivot to explode columns and fill values?

How to Pivot-and-Sort for two columns in python?

How to pivot two columns in SQL Server?

How to make a PIVOT query with two values (columns)?

How to add two columns in pivot table (Excel)

How to use groupBy, collect_list, arrays_zip, & explode together in pyspark to solve certain business problem

Zip and Explode multiple Columns in Spark SQL Dataframe

how to search values in column and make them as columns using pivot

Explode column values into multiple columns in pyspark

Pyspark explode multiple columns with sliding window

Pyspark: explode json in column to multiple columns