Blog – Page 6 – SQL & Hadoop

PySpark

PySpark RDD operations – Map, Filter, SortBy, reduceByKey, Joins

Leave a Comment / PySpark / Raj

In the last post, we discussed about basic operations on RDD in PySpark. In this post, we will see other common operations one can perform on RDD in PySpark. Let’s quickly see the syntax and examples for various RDD operations: Read a file into RDD Convert record into LIST of elements Remove the header data […]

PySpark RDD operations – Map, Filter, SortBy, reduceByKey, Joins Read More »

PySpark

Basic RDD operations in PySpark

2 Comments / PySpark / Raj

A Resilient Distributed Dataset (RDD) is the basic abstraction in Spark. In other words, we can say it is the most common structure that holds data in Spark. RDD is distributed, immutable , fault tolerant, optimized for in-memory computation. Let’s see some basic example of RDD in pyspark. Load file into RDD. Path should be

Basic RDD operations in PySpark Read More »

Apache Spark

Spark Dataframe add multiple columns with value

Leave a Comment / Apache Spark / Raj

You may need to add new columns in the existing SPARK dataframe as per the requirement. This new column can be initialized with a default value or you can assign some dynamic value to it depending on some logical conditions. Let’s see an example below to add 2 new columns with logical value and 1

Spark Dataframe add multiple columns with value Read More »

Apache Spark

Spark Dataframe Repartition

Leave a Comment / Apache Spark / Raj

What is Repartition in Spark ? Spark Repartition is the process of movement or shuffling of data into given number of logical partitions. Repartition is done on the basis of some column or expression or in a random manner. Default number of shuffle partitions in Spark is 200. Where do I use repartition in Spark

Spark Dataframe Repartition Read More »

Apache Spark

Spark Dataframe – monotonically_increasing_id

Leave a Comment / Apache Spark / Raj

Spark dataframe add row number is very common requirement especially if you are working on ELT in Spark. You can use monotonically_increasing_id method to generate incremental numbers. However the numbers won’t be consecutive if the dataframe has more than 1 partition. Let’s see a simple example to understand it : So I have a dataframe

Spark Dataframe – monotonically_increasing_id Read More »