In this post, we will see how to Handle NULL values in any given dataframe. Many people confuse it with BLANK or empty string however there is a difference. NULL means unknown where BLANK is empty. Alright now let’s see what all operations are available in Spark Dataframe which can help us in handling NULL values.
Identifying NULL Values in Spark Dataframe
NULL values can be identified in multiple manner. If you know any column which can have NULL value then you can use “isNull” command
Other way of writing same command in more SQL like fashion:
Once you know that rows in your Dataframe contains NULL values you may want to do following actions on it:
- Drop rows which has any column as NULL.This is default value.
- Drop rows which has all columns as NULL
- Drop rows which has any value as NULL for specific column
- Drop rows when all the specified column has NULL in it. Default value is any so “all” must be explicitly mention in DROP method with column list.
- Drop rows if it does not have “n” number of columns as NOT NULL
You can use different combination of options mentioned above in a single command. So this was all about identifying the records if row has NULL value in it. Next task could be to replace identified NULL value with other default value.
- Fill all the “numeric” columns with default value if NULL
- Fill all the “string” columns with default value if NULL
- Replace value in specific column with default value. If default value is not of datatype of column then it is ignored.
- Fill values for multiple columns with default values for each specific column.
So now you know how to identify NULL values in Dataframe and also how to replace or fill NULL values with default values.
If you have any confusion, feel free to leave a comment with query.