Add a string to a column only when the value matches a condition in pyspark

Mikesama Published at Dev

Mikesama

I got a date column in string format in a PySpark dataframe in below format.

| Date |
| -------- |
| 3/28/2023|

and want my output to be:

| Date |
| -------- |
| 2023-03-28|

I replaced the '/' symbol to '-' and convert the string to date format.

from pyspark.sql import functions as F 
df = df.withColumn("Date", F.regexp_replace('Date', '/', '-'))\
        .withColumn("Date", F.date_format(F.to_date(F.col("Date"),"MM-dd-yyyy"), 'yyyy-MM-dd'))

But I got below error:

SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to parse '3-28-2023' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.
Caused by: DateTimeParseException: Text '3-28-2023' could not be parsed at index 0

So to avoid that, I added below line to the code where it adds '0' at the begining of the string and it executed successfully.

df = df.withColumn("Date", F.regexp_replace('Date', '/', '-'))\
       .withColumn('Date', F.concat(F.lit("0"), F.col('Date')))\
       .withColumn("Date", F.date_format(F.to_date(F.col("Date"),"MM-dd-yyyy"), 'yyyy-MM-dd'))

Now, I want to add this zero conditionally such that if only one digit exists before first '-' in the string. I got stuck here because I am not sure if regex expression can aggregate how many values are present before a character. Please help here.

notNull

Spark will automatically handles the Zeros by specifying MM as month.

Example:

#sample data
#+----------+
#|      Date|
#+----------+
#| 3/28/2023|
#|04/28/2023|
#+----------+
#set this legacy parameter for timeparserpolicy
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
df.withColumn("Date",F.date_format(F.to_date(col("Date"),"MM/dd/yyyy"), 'yyyy-MM-dd')).show(10,False)

#+----------+
#|Date      |
#+----------+
#|2023-03-28|
#|2023-04-28|
#+----------+

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2023-04-21

Comments

0 comments

Add a string to a column only when the value matches a condition in pyspark

Add a string to a column only when the value matches a condition in pyspark

Failed to listen on localhost:8000 (reason: Cannot assign requested address)

pump.io port in URL

How to import an asset in swift using Bundle.main.path() in a react-native native module

Loopback Error: connect ECONNREFUSED 127.0.0.1:3306 (MAMP)

Compiler error CS0246 (type or namespace not found) on using Ninject in ASP.NET vNext

BigQuery - concatenate ignoring NULL

Spring Boot JPA PostgreSQL Web App - Internal Authentication Error

ggplotly no applicable method for 'plotly_build' applied to an object of class "NULL" if statements

ngClass error (Can't bind ngClass since it isn't a known property of div) in Angular 11.0.3

How to remove the extra space from right in a webview?

Change dd-mm-yyyy date format of dataframe date column to yyyy-mm-dd

Jquery different data trapped from direct mousedown event and simulation via $(this).trigger('mousedown');

maven-jaxb2-plugin cannot generate classes due to two declarations cause a collision in ObjectFactory class

java.lang.NullPointerException: Cannot read the array length because "<local3>" is null

How to use merge windows unallocated space into Ubuntu using GParted?

flutter: dropdown item programmatically unselect problem

Pandas - check if dataframe has negative value in any column

Nuget add packages gives access denied errors

Can't pre-populate phone number and message body in SMS link on iPhones when SMS app is not running in the background

Generate random UUIDv4 with Elm

Client secret not provided in request error with Keycloak