How to zip two columns, explode them and finally pivot in Pyspark

rd90080 Published at Dev

rd90080

I have two array columns (names, score). I need to explode both of them. Make names as column name for score(similar to pivot).

+------------+-------------------------+--------------------+                        
|      id    |     names               |      score         |
+------------+-------------------------+--------------------+
|ab01        |[F1 , F2, F3, F4, F5]    |[00123, 000.001, 00127, 00.0123, 111]
|ab02        |[F1 , F2, F3, F4, F5, F6]|[00124, 000.003, 00156, 00.067,  156, 254]
|ab03        |[F1 , F2, F3, F4, F5]    |[00234, 000.078, 00188, 00.0144, 188]
|ab04        |[F1 , F2, F3, F4, F5]    |[00345, 000.01112, 001567, 00.0186, 555]

Expected output:

 id       F1      F2        F3        F4    F5  F6
ab01    00123   000.001    00127    00.0123 111 null
ab02    00124   000.003    00156    00.067  156 254
ab03    00234   000.078    00188    00.0144 188 null
ab04    00345   000.01112  001567   00.0186 555 null

I tried zipping up names and score and then exploding them

combine = F.udf(lambda x, y: list(zip(x, y)),
                ArrayType(
                          StructType(
                                     [StructField("names", StringType()),
                                      StructField("score", StringType())
                                     ]
                                    )
                         )
               )

df2 = df.withColumn("new", combine("score", "names"))
         .withColumn("new", F.explode("new"))
         .select("id", 
                 F.col("new.names").alias("names"), 
                 F.col("new.score").alias("score")
                )

I'm getting an error:

TypeError: zip argument #1 must support iteration

I also tried exploding using rdd flatMap() and I still get the same error.

Is there an alternate way to achieve this?

Thanks in advance.

Pygirl

Try:

df2 = df.set_index('id').apply(pd.Series.explode).reset_index()
df3 = df2.pivot(columns='names', values='score', index='id')

df3:

names   F1       F2         F3      F4      F5  F6
id                      
ab01    00123   000.001     00127   00.0123 111 NaN
ab02    00123   000.003     00156   00.067  156 254
ab03    00234   000.078     00188   00.0144 188 NaN
ab04    00345   000.01112   001567  00.0186 555 NaN

edit:

x = (df.apply(lambda x: dict(zip(x['names'], x['score'])), axis=1))
y = pd.DataFrame(x.values.tolist(), index=x.index).fillna("null").join(df.id)

x = (df.apply(lambda x: dict(zip(x['names'], x['score'])), axis=1))
z = pd.DataFrame(x.values.tolist(), index=x.index).fillna("null")
y = pd.concat([df.id , z], axis=1)

    F1      F2         F3       F4      F5  F6      id
0   00123   000.001    00127    00.0123 111 null    ab01
1   00123   000.003    00156    00.067  156 254     ab02
2   00234   000.078    00188    00.0144 188 null    ab03
3   00345   000.01112  001567   00.0186 555 null    ab04

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-01-26

Comments

0 comments

How to zip two columns, explode them and finally pivot in Pyspark

How to zip two columns, explode them and finally pivot in Pyspark

pump.io port in URL

Loopback Error: connect ECONNREFUSED 127.0.0.1:3306 (MAMP)

Can't pre-populate phone number and message body in SMS link on iPhones when SMS app is not running in the background

How to import an asset in swift using Bundle.main.path() in a react-native native module

Failed to listen on localhost:8000 (reason: Cannot assign requested address)

Spring Boot JPA PostgreSQL Web App - Internal Authentication Error

ngClass error (Can't bind ngClass since it isn't a known property of div) in Angular 11.0.3

Using Response.Redirect with Friendly URLS in ASP.NET

Can a 32-bit antivirus program protect you from 64-bit threats

Double spacing in rmarkdown pdf

How to fix "pickle_module.load(f, **pickle_load_args) _pickle.UnpicklingError: invalid load key, '<'" using YOLOv3?

3D Touch Peek Swipe Like Mail

Bootstrap 5 Static Modal Still Closes when I Click Outside

Assembly definition can't resolve namespaces from external packages

Vector input in shiny R and then use it

Emulator wrong screen resolution in Android Studio 1.3

Svchost high CPU from Microsoft.BingWeather app errors

Graphics Context misaligned on first paint

Python connect to firebird docker database

Is this docker-for-mac password dialog legit?

How to save models trained locally in Amazon SageMaker?