Add a string to a column only when the value matches a condition in pyspark

Mikesama

I got a date column in string format in a PySpark dataframe in below format.

| Date |
| -------- |
| 3/28/2023|

and want my output to be:

| Date |
| -------- |
| 2023-03-28|

I replaced the '/' symbol to '-' and convert the string to date format.

from pyspark.sql import functions as F 
df = df.withColumn("Date", F.regexp_replace('Date', '/', '-'))\
        .withColumn("Date", F.date_format(F.to_date(F.col("Date"),"MM-dd-yyyy"), 'yyyy-MM-dd'))

But I got below error:

SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to parse '3-28-2023' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.
Caused by: DateTimeParseException: Text '3-28-2023' could not be parsed at index 0

So to avoid that, I added below line to the code where it adds '0' at the begining of the string and it executed successfully.

df = df.withColumn("Date", F.regexp_replace('Date', '/', '-'))\
       .withColumn('Date', F.concat(F.lit("0"), F.col('Date')))\
       .withColumn("Date", F.date_format(F.to_date(F.col("Date"),"MM-dd-yyyy"), 'yyyy-MM-dd'))

Now, I want to add this zero conditionally such that if only one digit exists before first '-' in the string. I got stuck here because I am not sure if regex expression can aggregate how many values are present before a character. Please help here.

notNull

Spark will automatically handles the Zeros by specifying MM as month.

Example:

#sample data
#+----------+
#|      Date|
#+----------+
#| 3/28/2023|
#|04/28/2023|
#+----------+
#set this legacy parameter for timeparserpolicy
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
df.withColumn("Date",F.date_format(F.to_date(col("Date"),"MM/dd/yyyy"), 'yyyy-MM-dd')).show(10,False)

#+----------+
#|Date      |
#+----------+
#|2023-03-28|
#|2023-04-28|
#+----------+

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

Pyspark multiply only some Column Values when condition is met, otherwise keep the same value

PySpark: modify column values when another column value satisfies a condition

Group by a column and display the value of the column that matches the condition

IN only matches on first value when using STRING_SPLIT

For each row, add column name to list in new column if row value matches a condition

PySpark, Extract string from a column that matches a regex which itself has a placeholder value coming from another column

how to add a condition such that when a dataframe has some column then only add respective "when" condition

Identify only when value matches

How to change only the beginning of value that matches condition?

Add string when column equals to specific value

Create a new column with the first value that matches a condition

Excel If Condition matches in two column fetch value

How do I add an extra column to the results of a GROUP BY with a value that matches a condition?

TypeORM, need to add "WHERE IN (...)" in query condition & only when there is a value for it

Add a new column to the data frame where the new column is the smallest date value of a group where another column matches a condition

How do I replace all instances of a string in a column with the value in a dictionary only if the column name matches the key?

Add column to pyspark dataframe based on a condition

Pythonic way to iterate over a Dataframe when column row value matches condition

Calculate difftime only if a column value matches in R

Pyspark apply function to column value if condition is met

Maximum of column 1 where value of column 2 matches some condition

Multiply two Dataframes if the column names and the condition on a column value matches

PySpark: Add a column to DataFrame when column is a list

xor logical condition on the value of pyspark.sql.column in python pyspark

Python matplotlib add text below plot only if condition matches

PySpark when condition not working for smaller string values

Add "?" to a string when a condition is not met

Insert or replace when column matches specific value

PySpark dataframe : Add new column For Each Unique ID and Column Condition

TOP Ranking

  1. 1

    Failed to listen on localhost:8000 (reason: Cannot assign requested address)

  2. 2

    pump.io port in URL

  3. 3

    How to import an asset in swift using Bundle.main.path() in a react-native native module

  4. 4

    Loopback Error: connect ECONNREFUSED 127.0.0.1:3306 (MAMP)

  5. 5

    Compiler error CS0246 (type or namespace not found) on using Ninject in ASP.NET vNext

  6. 6

    BigQuery - concatenate ignoring NULL

  7. 7

    Spring Boot JPA PostgreSQL Web App - Internal Authentication Error

  8. 8

    ggplotly no applicable method for 'plotly_build' applied to an object of class "NULL" if statements

  9. 9

    ngClass error (Can't bind ngClass since it isn't a known property of div) in Angular 11.0.3

  10. 10

    How to remove the extra space from right in a webview?

  11. 11

    Change dd-mm-yyyy date format of dataframe date column to yyyy-mm-dd

  12. 12

    Jquery different data trapped from direct mousedown event and simulation via $(this).trigger('mousedown');

  13. 13

    maven-jaxb2-plugin cannot generate classes due to two declarations cause a collision in ObjectFactory class

  14. 14

    java.lang.NullPointerException: Cannot read the array length because "<local3>" is null

  15. 15

    How to use merge windows unallocated space into Ubuntu using GParted?

  16. 16

    flutter: dropdown item programmatically unselect problem

  17. 17

    Pandas - check if dataframe has negative value in any column

  18. 18

    Nuget add packages gives access denied errors

  19. 19

    Can't pre-populate phone number and message body in SMS link on iPhones when SMS app is not running in the background

  20. 20

    Generate random UUIDv4 with Elm

  21. 21

    Client secret not provided in request error with Keycloak

HotTag

Archive