Spark Dataframe NULL values

In this post, we will see how to Handle NULL values in any given dataframe. Many people confuse it with BLANK or empty string however there is a difference. NULL means unknown where BLANK is empty. Alright now let’s see what all operations are available in Spark Dataframe which can help us in handling NULL values.

Identifying NULL Values in Spark Dataframe
NULL values can be identified in multiple manner. If you know any column which can have NULL value then you can use “isNull” command

scala> df_pres.filter($"pres_out".isNull).show(false)
+-------+------------+----------+-------+--------+----------+--------+
|pres_id|pres_name   |pres_dob  |pres_bp|pres_bs |pres_in   |pres_out|
+-------+------------+----------+-------+--------+----------+--------+
|45     |Donald Trump|1946-06-14|Queens |New York|2017-01-20|null    |
+-------+------------+----------+-------+--------+----------+--------+

Other way of writing same command in more SQL like fashion:

scala> df_pres.filter("pres_out is null").show(false)
+-------+------------+----------+-------+--------+----------+--------+
|pres_id|pres_name   |pres_dob  |pres_bp|pres_bs |pres_in   |pres_out|
+-------+------------+----------+-------+--------+----------+--------+
|45     |Donald Trump|1946-06-14|Queens |New York|2017-01-20|null    |
+-------+------------+----------+-------+--------+----------+--------+

Once you know that rows in your Dataframe contains NULL values you may want to do following actions on it:

  1. Drop rows which has any column as NULL.This is default value.
  2. Drop rows which has all columns as NULL
  3. Drop rows which has any value as NULL for specific column
  4. Drop rows when all the specified column has NULL in it. Default value is any so “all” must be explicitly mention in DROP method with column list.
  5. Drop rows if it does not have “n” number of columns as NOT NULL
scala> df_pres.na.drop.show() //case 1:

scala> df_pres.na.drop("any").show()  // case 1:this is default value and is same as above

scala> df_pres.na.drop("all").show() //case 2

scala> df_pres.na.drop(Array("pres_out")).show() //case 3: pass array with column names for which NULL check is required.

scala> df_pres.na.drop(Seq("pres_out")).show() // case 3: pass Sequence of strings. May not work in PySpark

scala> df_pres.na.drop("all",Array("pres_out","pres_bs")).show() // case 4: When all the columns specified has NULL in it.

scala> df_pres.na.drop(7).show() //case 5: Will drop rows if row does not have 7 columns as NOT NULL

You can use different combination of options mentioned above in a single command. So this was all about identifying the records if row has NULL value in it. Next task could be to replace identified NULL value with other default value.

  1. Fill all the “numeric” columns with default value if NULL
  2. Fill all the “string” columns with default value if NULL
  3. Replace value in specific column with default value. If default value is not of datatype of column then it is ignored.
  4. Fill values for multiple columns with default values for each specific column.
scala> df_pres.na.fill(-1).show() // case 1

scala> df_pres.na.fill("Not Available").show() //case 2

scala> df_pres.na.fill(-1,Array("pres_id")).show() //case 3

scala> df_pres.na.fill(Map("pres_id" -> -1, "pres_out" -> "2999-12-31")).show() //case 4

So now you know how to identify NULL values in Dataframe and also how to replace or fill NULL values with default values.
If you have any confusion, feel free to leave a comment with query.

Leave a Reply

Your email address will not be published. Required fields are marked *