In this post, we will see how to Handle NULL values in any given dataframe. Many people confuse it with BLANK or empty string however there is a difference. NULL means unknown where BLANK is empty. Alright now let’s see what all operations are available in Spark Dataframe which can help us in handling NULL values. Identifying NULL Values in Spark DataframeNULL values can be identified in multiple manner. If you know any column which can have NULL value then you can use “isNull” command Other way of writing same command in more SQL like fashion: Once you know that rows in your Dataframe containsRead More →

In Spark, we can use “explode” method to convert single column values into multiple rows. Recently I was working on a task to convert Cobol VSAM file which often has nested columns defined in it. In Spark my requirement was to convert single column value (Array of values) into multiple rows. So let’s see an example to understand it better: Create a sample dataframe with one column as ARRAY Now run the explode function to split each value in col2 as new row. So using explode function, you can split one column into multiple rows.Read More →

In Spark Dataframe, SHOW method is used to display Dataframe records in readable tabular format. This method is used very often to check how the content inside Dataframe looks like. Let’s see it with an example. Few things to observe here: 1) By default, SHOW function will return only 20 records. This is equivalent to Sample/Top/Limit 20 we have in other SQL environment. 2) You can see the string which is longer than 20 characters is truncated. Like “William Henry Har…” in place of “William Henry Harrison”. This is equivalent to width/colwidth etc in typical SQL environment. This is equivalent to below syntax: We canRead More →

Recently I was working on a task where I wanted Spark Dataframe Column List in a variable. This was required to do further processing depending on some technical columns present in the list. So we know that you can print Schema of Dataframe using printSchema method. It will show tree hierarchy of columns along with data type and other info. Example: To Fetch column details, we can use “columns” to return all the column names in the dataframe. This return array of Strings. Example: The requirement was to get this info into a variable. So we can convert Array of String to String using “mkString”Read More →

UNION method is used to MERGE data from 2 dataframes into one. The dataframe must have identical schema. If you are from SQL background then please be very cautious while using UNION operator in SPARK dataframes. Unlike typical RDBMS, UNION in Spark does not remove duplicates from resultant dataframe. It simply MERGEs the data without removing any duplicates. UNION ALL is deprecated and it is recommended to use UNION only. Let’s see one example to understand it more properly. I have 2 dataframes with 5 & 10 records respectively with first 5 common in both the dataframes. scala> df_pres1.show(30) +——-+—————–+———-+——————-+————-+———-+———-+ |pres_id| pres_name| pres_dob| pres_bp| pres_bs|Read More →

In Spark , you can perform aggregate operations on dataframe. This is similar to what we have in SQL like MAX, MIN, SUM etc. We can also perform aggregation on some specific columns which is equivalent to GROUP BY clause we have in typical SQL. Let’s see it with some examples. First method we can use is “agg”. To calculate count, max, min, sum we can use below syntax: scala> df_pres.agg(count($”pres_id”),min($”pres_id”),max($”pres_id”),sum(“pres_id”)).show() +————–+————+————+————+ |count(pres_id)|min(pres_id)|max(pres_id)|sum(pres_id)| +————–+————+————+————+ | 45| 1| 45| 1035| +————–+————+————+————+ Let’s add alias name to columns. scala> df_pres.agg(count($”pres_id”).as(“count”),min($”pres_id”).as(“min”),max($”pres_id”).as(“max”),sum(“pres_id”).as(“sum”)).show() +—–+—+—+—-+ |count|min|max| sum| +—–+—+—+—-+ | 45| 1| 45|1035| +—–+—+—+—-+ Second method is to use “agg” with “groupBy”Read More →