A Resilient Distributed Dataset (RDD) is the basic abstraction in Spark. In other words, we can say it is the most common structure that holds data in Spark. RDD is distributed, immutable , fault tolerant, optimized for in-memory computation. Let’s see some basic example of RDD in pyspark. Load file into RDD. Path should be HDFS path and not local. Check count of records in RDD: Check sample records from RDD: Traverse each record in RDD. You cannot directly iterate through records in RDD. To bring all the records to DRIVER you can use collect() action. Apply Map function to convert all columns to upperRead More →