I got a date column in string format in a PySpark dataframe in below format.
| Date |
| -------- |
| 3/28/2023|
and want my output to be:
| Date |
| -------- |
| 2023-03-28|
I replaced the '/' symbol to '-' and convert the string to date format.
from pyspark.sql import functions as F
df = df.withColumn("Date", F.regexp_replace('Date', '/', '-'))\
.withColumn("Date", F.date_format(F.to_date(F.col("Date"),"MM-dd-yyyy"), 'yyyy-MM-dd'))
But I got below error:
SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to parse '3-28-2023' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.
Caused by: DateTimeParseException: Text '3-28-2023' could not be parsed at index 0
So to avoid that, I added below line to the code where it adds '0' at the begining of the string and it executed successfully.
df = df.withColumn("Date", F.regexp_replace('Date', '/', '-'))\
.withColumn('Date', F.concat(F.lit("0"), F.col('Date')))\
.withColumn("Date", F.date_format(F.to_date(F.col("Date"),"MM-dd-yyyy"), 'yyyy-MM-dd'))
Now, I want to add this zero conditionally such that if only one digit exists before first '-' in the string. I got stuck here because I am not sure if regex expression can aggregate how many values are present before a character. Please help here.
Spark will automatically handles the Zeros
by specifying MM
as month.
Example:
#sample data
#+----------+
#| Date|
#+----------+
#| 3/28/2023|
#|04/28/2023|
#+----------+
#set this legacy parameter for timeparserpolicy
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
df.withColumn("Date",F.date_format(F.to_date(col("Date"),"MM/dd/yyyy"), 'yyyy-MM-dd')).show(10,False)
#+----------+
#|Date |
#+----------+
#|2023-03-28|
#|2023-04-28|
#+----------+
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments