PySpark

PySpark Date Functions

In PySpark, you can do almost all the date operations you can think of using in-built functions. Let’s quickly jump to example and see it one by one.

Create a dataframe with sample date values:

>>>df_1 = spark.createDataFrame([('2019-02-20','2019-10-18',)],['start_dt','end_dt'])

Check dataframe info

>>> df_1
DataFrame[start_dt: string, end_dt: string]

Now the problem I see here is that columns start_dt & end_dt are of type string and not date. So let’s quickly convert it into date.

>>> df_2 = df_1.select(df_1.start_dt.cast('date'),df_1.end_dt.cast('date'))
>>> df_2
DataFrame[start_dt: date, end_dt: date]

Now we are good. We have a dataframe with 2 columns start_dt & end_dt. Both the columns are of datatype ‘date’. Let’s do some Date operations on this.

PySpark Change Date Format

>>> df_2.select("start_dt","end_dt",date_format("start_dt",'dd/MM/yyyy').alias("dt_format")).show()
+----------+----------+----------+
|  start_dt|    end_dt| dt_format|
+----------+----------+----------+
|2019-02-20|2019-10-18|20/02/2019|
+----------+----------+----------+

PySpark Fetch Current Date

>>> df_2.select("start_dt","end_dt",current_date().alias("cur_dt")).show()
+----------+----------+----------+
|  start_dt|    end_dt|    cur_dt|
+----------+----------+----------+
|2019-02-20|2019-10-18|2019-03-17|
+----------+----------+----------+

PySpark Add Days to date

>>> df_2.select("start_dt","end_dt",date_add("start_dt",2).alias("add_2_dt")).show()
+----------+----------+----------+
|  start_dt|    end_dt|  add_2_dt|
+----------+----------+----------+
|2019-02-20|2019-10-18|2019-02-22|
+----------+----------+----------+

Here we have added 2 days to start_dt. Just change the number with number of days you want to add to any date.

PySpark Add Months to date

>>> df_2.select("start_dt","end_dt",add_months("start_dt",2).alias("add_2_mth")).show()
+----------+----------+----------+
|  start_dt|    end_dt| add_2_mth|
+----------+----------+----------+
|2019-02-20|2019-10-18|2019-04-20|
+----------+----------+----------+

Here we have added 2 months to start_dt. Just change the number with number of months you want to add to any date.

PySpark Add Years to date

>>> df_2.select("start_dt","end_dt",add_months("start_dt",2*12).alias("add_2_yrs")).show()
+----------+----------+----------+
|  start_dt|    end_dt| add_2_yrs|
+----------+----------+----------+
|2019-02-20|2019-10-18|2021-02-20|
+----------+----------+----------+

Here we have added 24 (12*2) months to start_dt. Just change the number with number of years you want to add to any date.

PySpark Subtract 2 dates

>>> df_2.select("start_dt","end_dt",datediff("end_dt","start_dt").alias("sub_2_dt")).show()
+----------+----------+--------+
|  start_dt|    end_dt|sub_2_dt|
+----------+----------+--------+
|2019-02-20|2019-10-18|     240|
+----------+----------+--------+

datediff returns the number of days between 2 dates.

PySpark Extract Year from Date

>>> df_2.select("start_dt","end_dt",year("start_dt").alias("ext_year")).show()
+----------+----------+--------+
|  start_dt|    end_dt|ext_year|
+----------+----------+--------+
|2019-02-20|2019-10-18|    2019|
+----------+----------+--------+

PySpark Extract Month from Date

>>> df_2.select("start_dt","end_dt",month("start_dt").alias("ext_month")).show()
+----------+----------+--------+
|  start_dt|    end_dt|ext_month|
+----------+----------+--------+
|2019-02-20|2019-10-18|       2|
+----------+----------+--------+

PySpark Extract Day from Date

>>> df_2.select("start_dt","end_dt",dayofmonth("start_dt").alias("ext_day")).show()
+----------+----------+-------+
|  start_dt|    end_dt|ext_day|
+----------+----------+-------+
|2019-02-20|2019-10-18|     20|
+----------+----------+-------+

PySpark Subtract days from date

>>> df_2.select("start_dt","end_dt",date_sub("start_dt",2).alias("sub_2_dt")).show()
+----------+----------+----------+
|  start_dt|    end_dt|  sub_2_dt|
+----------+----------+----------+
|2019-02-20|2019-10-18|2019-02-18|
+----------+----------+----------+

Here we have subtracted 2 days from start_dt.

PySpark Fetch Day of the year

>>> df_2.select("start_dt","end_dt",dayofyear("start_dt").alias("day_of_year")).show()
+----------+----------+-----------+
|  start_dt|    end_dt|day_of_year|
+----------+----------+-----------+
|2019-02-20|2019-10-18|         51|
+----------+----------+-----------+

This is useful if you want to check “Julian Date”

PySpark Last Day of Month

>>> df_2.select("start_dt","end_dt",last_day("start_dt").alias("last_day")).show()
+----------+----------+----------+
|  start_dt|    end_dt|  last_day|
+----------+----------+----------+
|2019-02-20|2019-10-18|2019-02-28|
+----------+----------+----------+

PySpark Determine how many months between 2 Dates

>>> df_2.select("start_dt","end_dt",months_between("end_dt","start_dt").alias("mth_btw")).show()
+----------+----------+----------+
|  start_dt|    end_dt|   mth_btw|
+----------+----------+----------+
|2019-02-20|2019-10-18|7.93548387|
+----------+----------+----------+

If you want months between to be whole number only then you can use below transformation

>>> df_2.select("start_dt","end_dt",round(months_between("end_dt","start_dt")).cast('int').alias("mth_btw")).show()
+----------+----------+-------+
|  start_dt|    end_dt|mth_btw|
+----------+----------+-------+
|2019-02-20|2019-10-18|      8|
+----------+----------+-------+

PySpark Identify date of next Monday

>>> df_2.select("start_dt","end_dt",next_day("start_dt","Mon").alias("nxt_Mon")).show()
+----------+----------+----------+
|  start_dt|    end_dt|   nxt_Mon|
+----------+----------+----------+
|2019-02-20|2019-10-18|2019-02-25|
+----------+----------+----------+

Identify date of next Tuesday

>>> df_2.select("start_dt","end_dt",next_day("start_dt","Tue").alias("nxt_Tue")).show()
+----------+----------+----------+
|  start_dt|    end_dt|   nxt_Tue|
+----------+----------+----------+
|2019-02-20|2019-10-18|2019-02-26|
+----------+----------+----------+

Identify date of next Wednesday

>>> df_2.select("start_dt","end_dt",next_day("start_dt","Wed").alias("nxt_Wed")).show()
+----------+----------+----------+
|  start_dt|    end_dt|   nxt_Wed|
+----------+----------+----------+
|2019-02-20|2019-10-18|2019-02-27|
+----------+----------+----------+

Identify date of next Thursday

>>> df_2.select("start_dt","end_dt",next_day("start_dt","Thu").alias("nxt_Thu")).show()
+----------+----------+----------+
|  start_dt|    end_dt|   nxt_Thu|
+----------+----------+----------+
|2019-02-20|2019-10-18|2019-02-21|
+----------+----------+----------+

Identify date of next Friday

>>> df_2.select("start_dt","end_dt",next_day("start_dt","Fri").alias("nxt_Fri")).show()
+----------+----------+----------+
|  start_dt|    end_dt|   nxt_Fri|
+----------+----------+----------+
|2019-02-20|2019-10-18|2019-02-22|
+----------+----------+----------+

Identify date of next Saturday

>>> df_2.select("start_dt","end_dt",next_day("start_dt","Sat").alias("nxt_Sat")).show()
+----------+----------+----------+
|  start_dt|    end_dt|   nxt_Sat|
+----------+----------+----------+
|2019-02-20|2019-10-18|2019-02-23|
+----------+----------+----------+

Identify date of next Sunday

>>> df_2.select("start_dt","end_dt",next_day("start_dt","Sun").alias("nxt_Sun")).show()
+----------+----------+----------+
|  start_dt|    end_dt|   nxt_Sun|
+----------+----------+----------+
|2019-02-20|2019-10-18|2019-02-24|
+----------+----------+----------+

PySpark Fetch quarter of the year

>>> df_2.select("start_dt","end_dt",quarter("start_dt").alias("Quarter_of_Year")).show()
+----------+----------+---------------+
|  start_dt|    end_dt|Quarter_of_Year|
+----------+----------+---------------+
|2019-02-20|2019-10-18|              1|
+----------+----------+---------------+

PySpark Truncate Date to Year

>>> df_2.select("start_dt","end_dt",trunc("start_dt","year").alias("trunc_Year")).show()
+----------+----------+----------+
|  start_dt|    end_dt|trunc_Year|
+----------+----------+----------+
|2019-02-20|2019-10-18|2019-01-01|
+----------+----------+----------+

PySpark Truncate Date to Month

>>> df_2.select("start_dt","end_dt",trunc("start_dt","month").alias("trunc_Month")).show()
+----------+----------+-----------+
|  start_dt|    end_dt|trunc_Month|
+----------+----------+-----------+
|2019-02-20|2019-10-18| 2019-02-01|
+----------+----------+-----------+

PySpark Fetch week of the Year

>>> df_2.select("start_dt","end_dt",weekofyear("start_dt").alias("week_of_year")).show()
+----------+----------+------------+
|  start_dt|    end_dt|week_of_year|
+----------+----------+------------+
|2019-02-20|2019-10-18|           8|
+----------+----------+------------+

If there is any other DATE operation which you do and is missing here, let us know in comments.

2 thoughts on “PySpark Date Functions”

Leave a Reply