In the last post, we discussed about basic operations on RDD in PySpark. In this post, we will see other common operations one can perform on RDD in PySpark. Let’s quickly see the syntax and examples for various RDD operations: Read a file into RDD Convert record into LIST of elements Remove the header data Check the count of records in RDD Check the first element in RDD Check the partitions for RDD Use custom function in RDD operations Apply custom function to RDD and see the result: If you want to convert all the columns to UPPER case, create another function which accepts LISTRead More →

A Resilient Distributed Dataset (RDD) is the basic abstraction in Spark. In other words, we can say it is the most common structure that holds data in Spark. RDD is distributed, immutable , fault tolerant, optimized for in-memory computation. Let’s see some basic example of RDD in pyspark. Load file into RDD. Path should be HDFS path and not local. Check count of records in RDD: Check sample records from RDD: Traverse each record in RDD. You cannot directly iterate through records in RDD. To bring all the records to DRIVER you can use collect() action. Apply Map function to convert all columns to upperRead More →

You may need to add new columns in the existing SPARK dataframe as per the requirement. This new column can be initialized with a default value or you can assign some dynamic value to it depending on some logical conditions. Let’s see an example below to add 2 new columns with logical value and 1 column with default value. Let’s add 2 new columns to it. One for State Abbreviation and other for Century to which President was born. Also we will add 1 new column with default value using “lit” function. We can also use withColumn method to add new columns in spark dataframe.Read More →

Repartition is the process of movement of data on the basis of some column or expression or random into required number of partitions. This depends on the kind of value/s you are passing which determines how many partitions will be created. You may want to do Repartition when you have understanding of your data and you know how you can improve the performance of dataframe operations by repartitioning it on the basis of some key columns. Also understand that repartition is a costly operation because it requires shuffling of all the data across nodes. You can increase or decrease the number of partitions using “Repartition”.Read More →

Spark dataframe add row number is very common requirement especially if you are working on ELT in Spark. You can use monotonically_increasing_id method to generate incremental numbers. However the numbers won’t be consecutive if the dataframe has more than 1 partition. Let’s see a simple example to understand it : So I have a dataframe which has information about all 50 states in USA. Now I want to add a new column “state_id” as sequence number. Before that I will create 3 version of this dataframe with 1,2 & 3 partitions respectively. This is required to understand behavior of monotonically_increasing_id clearly. Now let’s generate theRead More →

In this post, we will see how to Handle NULL values in any given dataframe. Many people confuse it with BLANK or empty string however there is a difference. NULL means unknown where BLANK is empty. Alright now let’s see what all operations are available in Spark Dataframe which can help us in handling NULL values. Identifying NULL Values in Spark DataframeNULL values can be identified in multiple manner. If you know any column which can have NULL value then you can use “isNull” command Other way of writing same command in more SQL like fashion: Once you know that rows in your Dataframe containsRead More →