In PySpark, you can run dataframe commands or if you are comfortable with SQL then you can run SQL queries too. In this post, we will see how to run different variations of SELECT queries on table built on Hive & corresponding Dataframe commands to replicate same output as SQL query. Let’s create a dataframe first for the table “sample_07” which will use in this post. df_sample_07 = spark.sql(“select * from sample_07″) So now I have table “sample_07” and a dataframe “df_sample_07”. Query 1: Select all the columns >>> spark.sql(“select * from sample_07”).show() +——-+——————–+———+——+ | code| description|total_emp|salary| +——-+——————–+———+——+ |00-0000| All Occupations|134354250| 40690| |11-0000|Management occupa…| 6003930|Read More →

If you migrate from any RDBMS platform to another, one technical challenge you may face is different Decimal Rounding on both the platforms. I was recently working for a client where we migrated Teradata application into Spark on EMR and there were many measures like Amount which were not matching. On analysis, we realized the issue is with decimal rounding. Example: Teradata:select cast(20.12345 as decimal(18,4)) ; — Output: 20.1234 Spark: select cast(20.12345 as decimal(18,4)) ; — Output: 20.1235 Now you may say that the how scale of 0.0001 can result in considerable difference ? Initially we also thought so however it was amount field andRead More →

Recently, I was working on one project where the ETL requirement was to have daily snapshot of the table. It was 15+ years old data model on which datawarehouse was designed and the client wanted to replicate it on Hadoop. So you can convert the ETL to Spark SQL however not everything works as-is on Spark. Also not everything as-is is best and efficient on Spark. Similarly another requirement was to create daily back-up of target table at the end of ETL chain. Even the SQL business logic used back-up table to refer to one day old data and current table for today’s data. ThereRead More →

If you are working on Hadoop or any other platform and storing structured data, I am sure you must have heard about columnar storage types. In the past 7-8 years the popularity “columnar” has gained confirms that the buzz is not a bubble and this is the future of Data Analytics from storage perspective. What is Columnar Storage ? For simplicity, we will restrict our discussion to RDBMS only. Data is stored in “blocks”. Blocks are nothing but physical storage space in bytes which is occupied when data is written to it. A typical block may range from few bytes to even MB depending onRead More →

In the last post, we discussed about basic operations on RDD in PySpark. In this post, we will see other common operations one can perform on RDD in PySpark. Let’s quickly see the syntax and examples for various RDD operations: Read a file into RDD Convert record into LIST of elements Remove the header data Check the count of records in RDD Check the first element in RDD Check the partitions for RDD Use custom function in RDD operations Apply custom function to RDD and see the result: If you want to convert all the columns to UPPER case, create another function which accepts LISTRead More →

A Resilient Distributed Dataset (RDD) is the basic abstraction in Spark. In other words, we can say it is the most common structure that holds data in Spark. RDD is distributed, immutable , fault tolerant, optimized for in-memory computation. Let’s see some basic example of RDD in pyspark. Load file into RDD. Path should be HDFS path and not local. Check count of records in RDD: Check sample records from RDD: Traverse each record in RDD. You cannot directly iterate through records in RDD. To bring all the records to DRIVER you can use collect() action. Apply Map function to convert all columns to upperRead More →