How to add strings of one columns of the dataframe and form another column that will have the incremental value of the original column

Chaitanya Kirty

I have a DataFrame whose data I am pasting below:

+---------------+--------------+----------+------------+----------+
|name           |      DateTime|       Seq|sessionCount|row_number|
+---------------+--------------+----------+------------+----------+
|            abc| 1521572913344|        17|           5|         1|
|            xyz| 1521572916109|        17|           5|         2|
|           rafa| 1521572916118|        17|           5|         3|
|             {}| 1521572916129|        17|           5|         4|
|     experience| 1521572917816|        17|           5|         5|
+---------------+--------------+----------+------------+----------+

The column 'name' is of type string. I want a new column "effective_name" which will contain the incremental values of "name" like shown below:

+---------------+--------------+----------+------------+----------+-------------------------+
|name          | DateTime |sessionSeq|sessionCount|row_number |effective_name|
+---------------+--------------+----------+------------+----------+-------------------------+
|abc            |1521572913344 |17        |5           |1         |abc                      |
|xyz            |1521572916109 |17        |5           |2         |abcxyz                   |
|rafa           |1521572916118 |17        |5           |3         |abcxyzrafa               |
|{}             |1521572916129 |17        |5           |4         |abcxyzrafa{}             |
|experience     |1521572917816 |17        |5           |5         |abcxyzrafa{}experience   |
+---------------+--------------+----------+------------+----------+-------------------------+

The new column contains the incremental concatenation of its previous values of the name column.

pault

You can achieve this by using a pyspark.sql.Window, which orders by the clientDateTime, pyspark.sql.functions.concat_ws, and pyspark.sql.functions.collect_list:

import pyspark.sql.functions as f
from pyspark.sql import Window

w = Window.orderBy("DateTime")  # define Window for ordering

df.drop("Seq", "sessionCount", "row_number").select(
    "*",
    f.concat_ws(
        "",
        f.collect_list(f.col("name")).over(w)
    ).alias("effective_name")
).show(truncate=False)
#+---------------+--------------+-------------------------+
#|name           |      DateTime|effective_name           |
#+---------------+--------------+-------------------------+
#|abc            |1521572913344 |abc                      |
#|xyz            |1521572916109 |abcxyz                   |
#|rafa           |1521572916118 |abcxyzrafa               |
#|{}             |1521572916129 |abcxyzrafa{}             |
#|experience     |1521572917816 |abcxyzrafa{}experience   |
#+---------------+--------------+-------------------------+

I dropped "Seq", "sessionCount", "row_number" to make the output display friendlier.

If you needed to do this per group, you can add a partitionBy to the Window. Say in this case you want to group by sessionSeq, you can do the following:

w = Window.partitionBy("Seq").orderBy("DateTime")

df.drop("sessionCount", "row_number").select(
    "*",
    f.concat_ws(
        "",
        f.collect_list(f.col("name")).over(w)
    ).alias("effective_name")
).show(truncate=False)
#+---------------+--------------+----------+-------------------------+
#|name           |      DateTime|sessionSeq|effective_name           |
#+---------------+--------------+----------+-------------------------+
#|abc            |1521572913344 |17        |abc                      |
#|xyz            |1521572916109 |17        |abcxyz                   |
#|rafa           |1521572916118 |17        |abcxyzrafa               |
#|{}             |1521572916129 |17        |abcxyzrafa{}             |
#|experience     |1521572917816 |17        |abcxyzrafa{}experience   |
#+---------------+--------------+----------+-------------------------+

If you prefer to use withColumn, the above is equivalent to:

df.drop("sessionCount", "row_number").withColumn(
    "effective_name",
    f.concat_ws(
        "",
        f.collect_list(f.col("name")).over(w)
    )
).show(truncate=False)

Explanation

You want to apply a function over multiple rows, which is called an aggregation. With any aggregation, you need to define which rows to aggregate over and the order. We do this using a Window. In this case, w = Window.partitionBy("Seq").orderBy("DateTime") will partition the data by the Seq and sort by the DateTime.

We first apply the aggregate function collect_list("name") over the window. This gathers all of the values from the name column and puts them in a list. The order of insertion is defined by the Window's order.

For example, the intermediate output of this step would be:

df.select(
    f.collect_list("name").over(w).alias("collected")
).show()
#+--------------------------------+
#|collected                       |
#+--------------------------------+
#|[abc]                           |
#|[abc, xyz]                      |
#|[abc, xyz, rafa]                |
#|[abc, xyz, rafa, {}]            |
#|[abc, xyz, rafa, {}, experience]|
#+--------------------------------+

Now that the appropriate values are in the list, we can concatenate them together with an empty string as the separator.

df.select(
    f.concat_ws(
        "",
        f.collect_list("name").over(w)
    ).alias("concatenated")
).show()
#+-----------------------+
#|concatenated           |
#+-----------------------+
#|abc                    |
#|abcxyz                 |
#|abcxyzrafa             |
#|abcxyzrafa{}           |
#|abcxyzrafa{}experience |
#+-----------------------+

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

How can I add a column from one dataframe to another dataframe?

add strings from one file to another when they have a common column

Add new columns to DataFrame from the value of one column

How to transform a column value of a dataframe with values of another dataframe columns

How to get matches the value of one column with another column in the dataframe

How to add values in one column that have '0' in another column

How to join a column of lists in one dataframe with a column of strings in another dataframe?

How to populating one column in a dataframe from the truncated value of another column

how to set the column value of one dataframe depending on whether the value in another column is included in the specific column of another dataframe

How to add column based on another column value in Pandas dataframe?

In R, how to add the fitted value column to the original dataframe?

How to add columns to another table based on one column?

Add the elements of a column in one dataframe to the elements of another column in another dataframe

How to add a column of random numbers to a dataframe by each value in one of the columns?

How to copy all column values of a dataframe into new columns of another one according to the index of the first and a column value of the second

Pandas Dataframe: How to create a column of incremental unique value count of another column

Subset DataFrame by one column then value in another column

Add column if value is in another column of another dataframe

Add a column based on the value of another column in a dataframe

How do I add a column from one DataFrame to another when both have index in different formats

how to add data from one dataframe to a new column on another dataframe

How do I add the value from one dataframe based on a column to another dataframe based on a row?

How to find previous value in dataframe column based on another columns value

How update column value in dataframe if there are values in another column and keep original value when exist NAN

How to filter DataFrame columns by value keeping original column order?

How to split a column containing strings with comma into multiple columns, then drop the original column in Python dataframe?

How to split a dataframe column into more columns, conditional to another column value?

How to match one column of dataframe to another based on multiple columns?

How to add incremental value base on a column?