How to add strings of one columns of the dataframe and form another column that will have the incremental value of the original column

Chaitanya Kirty

I have a DataFrame whose data I am pasting below:

+---------------+--------------+----------+------------+----------+
|name           |      DateTime|       Seq|sessionCount|row_number|
+---------------+--------------+----------+------------+----------+
|            abc| 1521572913344|        17|           5|         1|
|            xyz| 1521572916109|        17|           5|         2|
|           rafa| 1521572916118|        17|           5|         3|
|             {}| 1521572916129|        17|           5|         4|
|     experience| 1521572917816|        17|           5|         5|
+---------------+--------------+----------+------------+----------+

The column 'name' is of type string. I want a new column "effective_name" which will contain the incremental values of "name" like shown below:

+---------------+--------------+----------+------------+----------+-------------------------+
|name          | DateTime |sessionSeq|sessionCount|row_number |effective_name|
+---------------+--------------+----------+------------+----------+-------------------------+
|abc            |1521572913344 |17        |5           |1         |abc                      |
|xyz            |1521572916109 |17        |5           |2         |abcxyz                   |
|rafa           |1521572916118 |17        |5           |3         |abcxyzrafa               |
|{}             |1521572916129 |17        |5           |4         |abcxyzrafa{}             |
|experience     |1521572917816 |17        |5           |5         |abcxyzrafa{}experience   |
+---------------+--------------+----------+------------+----------+-------------------------+

The new column contains the incremental concatenation of its previous values of the name column.

pault

You can achieve this by using a pyspark.sql.Window, which orders by the clientDateTime, pyspark.sql.functions.concat_ws, and pyspark.sql.functions.collect_list:

import pyspark.sql.functions as f
from pyspark.sql import Window

w = Window.orderBy("DateTime")  # define Window for ordering

df.drop("Seq", "sessionCount", "row_number").select(
    "*",
    f.concat_ws(
        "",
        f.collect_list(f.col("name")).over(w)
    ).alias("effective_name")
).show(truncate=False)
#+---------------+--------------+-------------------------+
#|name           |      DateTime|effective_name           |
#+---------------+--------------+-------------------------+
#|abc            |1521572913344 |abc                      |
#|xyz            |1521572916109 |abcxyz                   |
#|rafa           |1521572916118 |abcxyzrafa               |
#|{}             |1521572916129 |abcxyzrafa{}             |
#|experience     |1521572917816 |abcxyzrafa{}experience   |
#+---------------+--------------+-------------------------+

I dropped "Seq", "sessionCount", "row_number" to make the output display friendlier.

If you needed to do this per group, you can add a partitionBy to the Window. Say in this case you want to group by sessionSeq, you can do the following:

w = Window.partitionBy("Seq").orderBy("DateTime")

df.drop("sessionCount", "row_number").select(
    "*",
    f.concat_ws(
        "",
        f.collect_list(f.col("name")).over(w)
    ).alias("effective_name")
).show(truncate=False)
#+---------------+--------------+----------+-------------------------+
#|name           |      DateTime|sessionSeq|effective_name           |
#+---------------+--------------+----------+-------------------------+
#|abc            |1521572913344 |17        |abc                      |
#|xyz            |1521572916109 |17        |abcxyz                   |
#|rafa           |1521572916118 |17        |abcxyzrafa               |
#|{}             |1521572916129 |17        |abcxyzrafa{}             |
#|experience     |1521572917816 |17        |abcxyzrafa{}experience   |
#+---------------+--------------+----------+-------------------------+

If you prefer to use withColumn, the above is equivalent to:

df.drop("sessionCount", "row_number").withColumn(
    "effective_name",
    f.concat_ws(
        "",
        f.collect_list(f.col("name")).over(w)
    )
).show(truncate=False)

Explanation

You want to apply a function over multiple rows, which is called an aggregation. With any aggregation, you need to define which rows to aggregate over and the order. We do this using a Window. In this case, w = Window.partitionBy("Seq").orderBy("DateTime") will partition the data by the Seq and sort by the DateTime.

We first apply the aggregate function collect_list("name") over the window. This gathers all of the values from the name column and puts them in a list. The order of insertion is defined by the Window's order.

For example, the intermediate output of this step would be:

df.select(
    f.collect_list("name").over(w).alias("collected")
).show()
#+--------------------------------+
#|collected                       |
#+--------------------------------+
#|[abc]                           |
#|[abc, xyz]                      |
#|[abc, xyz, rafa]                |
#|[abc, xyz, rafa, {}]            |
#|[abc, xyz, rafa, {}, experience]|
#+--------------------------------+

Now that the appropriate values are in the list, we can concatenate them together with an empty string as the separator.

df.select(
    f.concat_ws(
        "",
        f.collect_list("name").over(w)
    ).alias("concatenated")
).show()
#+-----------------------+
#|concatenated           |
#+-----------------------+
#|abc                    |
#|abcxyz                 |
#|abcxyzrafa             |
#|abcxyzrafa{}           |
#|abcxyzrafa{}experience |
#+-----------------------+

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2020-11-17

Comments

0 comments

how to set the column value of one dataframe depending on whether the value in another column is included in the specific column of another dataframe

How to copy all column values of a dataframe into new columns of another one according to the index of the first and a column value of the second

How to split a column containing strings with comma into multiple columns, then drop the original column in Python dataframe?

TOP Ranking

Article

How to add strings of one columns of the dataframe and form another column that will have the incremental value of the original column

How to add strings of one columns of the dataframe and form another column that will have the incremental value of the original column

Can't pre-populate phone number and message body in SMS link on iPhones when SMS app is not running in the background

pump.io port in URL

Failed to listen on localhost:8000 (reason: Cannot assign requested address)

grouping by column variables and appending a new variable based on condition

Python Read Directory And Output to CSV

BigQuery - concatenate ignoring NULL

Angular 8. Unknown amount of http.get requests in array to call, must be sequential, what to use

Remove adjacent duplicates in linked list in C

Can a 32-bit antivirus program protect you from 64-bit threats

How to keep curl session alive between two php processes?

Limit number of characters in uitextview

Unable to use switch toggle for dark mode in material-ui

In C#, is there a way to create a List directly from an Array without copying?

Laravel getting value from another table using eloquent

Spring Boot JPA PostgreSQL Web App - Internal Authentication Error

MTKView Displaying Wide Gamut P3 Colorspace

Vector input in shiny R and then use it

Modify c# Windows Forms control library

SQL Server : are transaction locking table for other users?

When I click any button in my view page the form is submitted

Can you sort columns (horizontally) in Google Sheets?