Pyspark: How to return a tuple list of existing non null columns as one of the column values in dataframe

Mia21

i'm working with a pyspark dataframe which is:

+----+----+---+---+---+----+
|   a|   b|  c|  d|  e|   f|
+----+----+---+---+---+----+
|   2|12.3|  5|5.6|  6|44.7|
|null|null|  9|9.3| 19|23.5|
|   8| 4.3|  7|0.5| 21| 8.2|
|   9| 3.8|  3|6.5| 45| 4.9|
|   3| 8.7|  2|2.8| 32| 2.9|
+----+----+---+---+---+----+

To create the above dataframe:

rdd =  sc.parallelize([(2,12.3,5,5.6,6,44.7), 
                (None,None,9,9.3,19,23.5), 
                (8,4.3,7,0.5,21,8.2),
                 (9,3.8,3,6.5,45,4.9),
                  (3,8.7,2,2.8,32,2.9)])
df = sqlContext.createDataFrame(rdd, ('a', 'b','c','d','e','f'))
df.show()

I want to create another column 'g' whose values are list of tuples based on existing non null columns. The list of tuples are of form :

((column a, column b),(column c, column d),(column e, column f))

Requirements for output col: 1) Only consider the non null columns while creating the list of tuples. 2) Return the list of tuples.

So the final dataframe with column 'g' would be:

+---+----+---+---+---+----+--------------------------+
|  a|   b|  c|  d|  e|   f|                   g      |
+---+----+---+---+---+----+--------------------------+
|  2|12.3|  5|5.6|  6|44.7|[[2,12.3],[5,5.6],[6,44.7]|
|nul|nul|  9 |9.3| 19|23.5|[[9,9.3],[19,23.5]        |
|  8| 4.3|  7|0.5| 21| 8.2|[[8,4.3],[7,0.5],[21,8.2] |
|  9| 3.8|  3|6.5| 45| 4.9|[[9,3.8],[3,6.5],[45,4.9] |
|  3| 8.7|  2|2.8| 32| 2.9|[[3,8.7],[2,2.8],[32,2.9] |
+---+----+---+---+---+----+--------------------------+

In column "g", the second row tuple has only two pairs as opposed to three, because for second row, we omit column 'a' and 'b' values since they are nulls.

I'm not sure how to dynamically omit the null values in columns and form the tuple list

I tried to partially achieve the final column by a udf:

l1=['a','c','e']
l2=['b','d','f']
def func1(r1,r2):
    l=[]
    for i in range(len(l1)):
        l.append((r1[i],r2[i]))
    return l
func1_udf=udf(func1)
df=df.withColumn('g',func1_udf(array(l1),array(l2)))
df.show()

I tried declaring the udf as ArrayType, it did not work. Any help would be much appreciated. I'm working with pyspark 1.6. Thank you!

mayank agrawal

I think UDFs should work just fine.

import pyspark.sql.functions as F
from pyspark.sql.types import *

rdd =  sc.parallelize([(2,12.3,5,5.6,6,44.7), 
            (None,None,9,9.3,19,23.5), 
            (8,4.3,7,0.5,21,8.2),
             (9,3.8,3,6.5,45,4.9),
              (3,8.7,2,2.8,32,2.9)])
df = sql.createDataFrame(rdd, ('a', 'b','c','d','e','f'))
df = df.select(*(F.col(c).cast("float").alias(c) for c in df.columns))

def combine(a,b,c,d,e,f):

    combine_ = []
    if None not in [a,b]:
        combine_.append([a,b])
    if None not in [c,d]:
        combine_.append([c,d])
    if None not in [e,f]:
        combine_.append([e,f])
    return combine_

combine_udf = F.udf(combine,ArrayType(ArrayType(FloatType())))
df = df.withColumn('combined', combine_udf(F.col('a'),F.col('b'),F.col('c'),\
               F.col('d'),F.col('e'),F.col('f')))
df.show()

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

In Pyspark, how to add a list of values as a new column to an existing Dataframe?

how to split a list of values in one column of a dataframe into various columns equally

Selecting values from non-null columns in a PySpark DataFrame

How to get a list column with values of multiple columns given in another column in Pyspark Dataframe?

How to list distinct values of pyspark dataframe wrt null values in another column

How to return null in a GroupBy statement when one of the values in the non-key column contains null?

How to assign values to multiple non existing columns in a pandas dataframe?

How to convert columns and values of DataFrame to one list?

Combing non-null values from two columns into one column

How can I extract columns from a dataframe according to an array and if it does not find a column, that column should contain null values? - pyspark

How to drop all columns with null values in a PySpark DataFrame?

For a DataFrame with sorted, numeric index and columns return the tuple of index values and column names closest to the given values passed

how to split one column and keep other columns in pyspark dataframe?

How to return one record if all columns return null or return only non-null columns

Make new dataframe from existing dataframe with unique values from one column and corresponding values from other columns

How to convert PySpark dataframe columns into list of dictionary based on groupBy column

How to create columns from list values in Pyspark dataframe

How to convert multiple columns to a list of key values in one cell - Pyspark?

Pyspark DataFrame - How to convert one column from categorical values to int?

How to combine non-null entries of columns of a DataFrame into a new column?

Values of the columns are null and swapped in pyspark dataframe

pyspark count Non Null values in column

How to create a column with the sum of list values in a pyspark dataframe

How to fill Non-Null values from some columns in Pandas Dataframe into a new column? How to use np.where() for multiple conditions?

issues in creating a new column of tuple from two dataframe columns in pyspark

Populate new columns when list values match substring of column values in Pyspark dataframe

Pyspark Removing null values from a column in dataframe

Replacing null values in a column in Pyspark Dataframe

Assign date values for null in a column in a pyspark dataframe