You may need to add new columns in the existing SPARK dataframe as per the requirement. This new column can be initialized with a default value or you can assign some dynamic value to it depending on some logical conditions. Let’s see an example below to add 2 new columns with logical value and 1 column with default value. Let’s add 2 new columns to it. One for State Abbreviation and other for Century to which President was born. Also we will add 1 new column with default value using “lit” function. We can also use withColumn method to add new columns in spark dataframe.Read More →

Repartition is the process of movement of data on the basis of some column or expression or random into required number of partitions. This depends on the kind of value/s you are passing which determines how many partitions will be created. You may want to do Repartition when you have understanding of your data and you know how you can improve the performance of dataframe operations by repartitioning it on the basis of some key columns. Also understand that repartition is a costly operation because it requires shuffling of all the data across nodes. You can increase or decrease the number of partitions using “Repartition”.Read More →

Spark dataframe add row number is very common requirement especially if you are working on ELT in Spark. You can use monotonically_increasing_id method to generate incremental numbers. However the numbers won’t be consecutive if the dataframe has more than 1 partition. Let’s see a simple example to understand it : So I have a dataframe which has information about all 50 states in USA. Now I want to add a new column “state_id” as sequence number. Before that I will create 3 version of this dataframe with 1,2 & 3 partitions respectively. This is required to understand behavior of monotonically_increasing_id clearly. Now let’s generate theRead More →

In this post, we will see how to Handle NULL values in any given dataframe. Many people confuse it with BLANK or empty string however there is a difference. NULL means unknown where BLANK is empty. Alright now let’s see what all operations are available in Spark Dataframe which can help us in handling NULL values. Identifying NULL Values in Spark DataframeNULL values can be identified in multiple manner. If you know any column which can have NULL value then you can use “isNull” command Other way of writing same command in more SQL like fashion: Once you know that rows in your Dataframe containsRead More →

In Spark, we can use “explode” method to convert single column values into multiple rows. Recently I was working on a task to convert Cobol VSAM file which often has nested columns defined in it. In Spark my requirement was to convert single column value (Array of values) into multiple rows. So let’s see an example to understand it better: Create a sample dataframe with one column as ARRAY Now run the explode function to split each value in col2 as new row. So using explode function, you can split one column into multiple rows.Read More →

In Spark Dataframe, SHOW method is used to display Dataframe records in readable tabular format. This method is used very often to check how the content inside Dataframe looks like. Let’s see it with an example. Few things to observe here: 1) By default, SHOW function will return only 20 records. This is equivalent to Sample/Top/Limit 20 we have in other SQL environment. 2) You can see the string which is longer than 20 characters is truncated. Like “William Henry Har…” in place of “William Henry Harrison”. This is equivalent to width/colwidth etc in typical SQL environment. This is equivalent to below syntax: We canRead More →