In PySpark, you can do almost all the date operations you can think of using in-built functions. Let’s quickly jump to example and see it one by one.
Create a dataframe with sample date values:
>>>df_1 = spark.createDataFrame([('2019-02-20','2019-10-18',)],['start_dt','end_dt'])
Check dataframe info
>>> df_1 DataFrame[start_dt: string, end_dt: string]
Now the problem I see here is that columns start_dt & end_dt are of type string and not date. So let’s quickly convert it into date.
>>> df_2 = df_1.select(df_1.start_dt.cast('date'),df_1.end_dt.cast('date')) >>> df_2 DataFrame[start_dt: date, end_dt: date]
Now we are good. We have a dataframe with 2 columns start_dt & end_dt. Both the columns are of datatype ‘date’. Let’s do some Date operations on this.
PySpark Change Date Format
>>> df_2.select("start_dt","end_dt",date_format("start_dt",'dd/MM/yyyy').alias("dt_format")).show() +----------+----------+----------+ | start_dt| end_dt| dt_format| +----------+----------+----------+ |2019-02-20|2019-10-18|20/02/2019| +----------+----------+----------+
PySpark Fetch Current Date
>>> df_2.select("start_dt","end_dt",current_date().alias("cur_dt")).show() +----------+----------+----------+ | start_dt| end_dt| cur_dt| +----------+----------+----------+ |2019-02-20|2019-10-18|2019-03-17| +----------+----------+----------+
PySpark Add Days to date
>>> df_2.select("start_dt","end_dt",date_add("start_dt",2).alias("add_2_dt")).show() +----------+----------+----------+ | start_dt| end_dt| add_2_dt| +----------+----------+----------+ |2019-02-20|2019-10-18|2019-02-22| +----------+----------+----------+
Here we have added 2 days to start_dt. Just change the number with number of days you want to add to any date.
PySpark Add Months to date
>>> df_2.select("start_dt","end_dt",add_months("start_dt",2).alias("add_2_mth")).show() +----------+----------+----------+ | start_dt| end_dt| add_2_mth| +----------+----------+----------+ |2019-02-20|2019-10-18|2019-04-20| +----------+----------+----------+
Here we have added 2 months to start_dt. Just change the number with number of months you want to add to any date.
PySpark Add Years to date
>>> df_2.select("start_dt","end_dt",add_months("start_dt",2*12).alias("add_2_yrs")).show() +----------+----------+----------+ | start_dt| end_dt| add_2_yrs| +----------+----------+----------+ |2019-02-20|2019-10-18|2021-02-20| +----------+----------+----------+
Here we have added 24 (12*2) months to start_dt. Just change the number with number of years you want to add to any date.
PySpark Subtract 2 dates
>>> df_2.select("start_dt","end_dt",datediff("end_dt","start_dt").alias("sub_2_dt")).show() +----------+----------+--------+ | start_dt| end_dt|sub_2_dt| +----------+----------+--------+ |2019-02-20|2019-10-18| 240| +----------+----------+--------+
datediff returns the number of days between 2 dates.
PySpark Extract Year from Date
>>> df_2.select("start_dt","end_dt",year("start_dt").alias("ext_year")).show() +----------+----------+--------+ | start_dt| end_dt|ext_year| +----------+----------+--------+ |2019-02-20|2019-10-18| 2019| +----------+----------+--------+
PySpark Extract Month from Date
>>> df_2.select("start_dt","end_dt",month("start_dt").alias("ext_month")).show() +----------+----------+--------+ | start_dt| end_dt|ext_month| +----------+----------+--------+ |2019-02-20|2019-10-18| 2| +----------+----------+--------+
PySpark Extract Day from Date
>>> df_2.select("start_dt","end_dt",dayofmonth("start_dt").alias("ext_day")).show() +----------+----------+-------+ | start_dt| end_dt|ext_day| +----------+----------+-------+ |2019-02-20|2019-10-18| 20| +----------+----------+-------+
PySpark Subtract days from date
>>> df_2.select("start_dt","end_dt",date_sub("start_dt",2).alias("sub_2_dt")).show() +----------+----------+----------+ | start_dt| end_dt| sub_2_dt| +----------+----------+----------+ |2019-02-20|2019-10-18|2019-02-18| +----------+----------+----------+
Here we have subtracted 2 days from start_dt.
PySpark Fetch Day of the year
>>> df_2.select("start_dt","end_dt",dayofyear("start_dt").alias("day_of_year")).show() +----------+----------+-----------+ | start_dt| end_dt|day_of_year| +----------+----------+-----------+ |2019-02-20|2019-10-18| 51| +----------+----------+-----------+
This is useful if you want to check “Julian Date”
PySpark Last Day of Month
>>> df_2.select("start_dt","end_dt",last_day("start_dt").alias("last_day")).show() +----------+----------+----------+ | start_dt| end_dt| last_day| +----------+----------+----------+ |2019-02-20|2019-10-18|2019-02-28| +----------+----------+----------+
PySpark Determine how many months between 2 Dates
>>> df_2.select("start_dt","end_dt",months_between("end_dt","start_dt").alias("mth_btw")).show() +----------+----------+----------+ | start_dt| end_dt| mth_btw| +----------+----------+----------+ |2019-02-20|2019-10-18|7.93548387| +----------+----------+----------+
If you want months between to be whole number only then you can use below transformation
>>> df_2.select("start_dt","end_dt",round(months_between("end_dt","start_dt")).cast('int').alias("mth_btw")).show() +----------+----------+-------+ | start_dt| end_dt|mth_btw| +----------+----------+-------+ |2019-02-20|2019-10-18| 8| +----------+----------+-------+
PySpark Identify date of next Monday
>>> df_2.select("start_dt","end_dt",next_day("start_dt","Mon").alias("nxt_Mon")).show() +----------+----------+----------+ | start_dt| end_dt| nxt_Mon| +----------+----------+----------+ |2019-02-20|2019-10-18|2019-02-25| +----------+----------+----------+
Identify date of next Tuesday
>>> df_2.select("start_dt","end_dt",next_day("start_dt","Tue").alias("nxt_Tue")).show() +----------+----------+----------+ | start_dt| end_dt| nxt_Tue| +----------+----------+----------+ |2019-02-20|2019-10-18|2019-02-26| +----------+----------+----------+
Identify date of next Wednesday
>>> df_2.select("start_dt","end_dt",next_day("start_dt","Wed").alias("nxt_Wed")).show() +----------+----------+----------+ | start_dt| end_dt| nxt_Wed| +----------+----------+----------+ |2019-02-20|2019-10-18|2019-02-27| +----------+----------+----------+
Identify date of next Thursday
>>> df_2.select("start_dt","end_dt",next_day("start_dt","Thu").alias("nxt_Thu")).show() +----------+----------+----------+ | start_dt| end_dt| nxt_Thu| +----------+----------+----------+ |2019-02-20|2019-10-18|2019-02-21| +----------+----------+----------+
Identify date of next Friday
>>> df_2.select("start_dt","end_dt",next_day("start_dt","Fri").alias("nxt_Fri")).show() +----------+----------+----------+ | start_dt| end_dt| nxt_Fri| +----------+----------+----------+ |2019-02-20|2019-10-18|2019-02-22| +----------+----------+----------+
Identify date of next Saturday
>>> df_2.select("start_dt","end_dt",next_day("start_dt","Sat").alias("nxt_Sat")).show() +----------+----------+----------+ | start_dt| end_dt| nxt_Sat| +----------+----------+----------+ |2019-02-20|2019-10-18|2019-02-23| +----------+----------+----------+
Identify date of next Sunday
>>> df_2.select("start_dt","end_dt",next_day("start_dt","Sun").alias("nxt_Sun")).show() +----------+----------+----------+ | start_dt| end_dt| nxt_Sun| +----------+----------+----------+ |2019-02-20|2019-10-18|2019-02-24| +----------+----------+----------+
PySpark Fetch quarter of the year
>>> df_2.select("start_dt","end_dt",quarter("start_dt").alias("Quarter_of_Year")).show() +----------+----------+---------------+ | start_dt| end_dt|Quarter_of_Year| +----------+----------+---------------+ |2019-02-20|2019-10-18| 1| +----------+----------+---------------+
PySpark Truncate Date to Year
>>> df_2.select("start_dt","end_dt",trunc("start_dt","year").alias("trunc_Year")).show() +----------+----------+----------+ | start_dt| end_dt|trunc_Year| +----------+----------+----------+ |2019-02-20|2019-10-18|2019-01-01| +----------+----------+----------+
PySpark Truncate Date to Month
>>> df_2.select("start_dt","end_dt",trunc("start_dt","month").alias("trunc_Month")).show() +----------+----------+-----------+ | start_dt| end_dt|trunc_Month| +----------+----------+-----------+ |2019-02-20|2019-10-18| 2019-02-01| +----------+----------+-----------+
PySpark Fetch week of the Year
>>> df_2.select("start_dt","end_dt",weekofyear("start_dt").alias("week_of_year")).show() +----------+----------+------------+ | start_dt| end_dt|week_of_year| +----------+----------+------------+ |2019-02-20|2019-10-18| 8| +----------+----------+------------+
If there is any other DATE operation which you do and is missing here, let us know in comments.
What about a minimum date – say you want to replace all dates that are less than a certain date with like 1900-01-01?
How to do WTD ? and idetify week start date and week end date