Spark-SQL DataFrame is the closest thing a SQL Developer can find in Apache Spark. I am technically from SQL background with over 10 years of working on traditional RDBMS like Teradata, Oracle, Netezza, Sybase etc. so when I moved from traditional RDBMS to Hadoop for my new projects, I was excitingly looking for SQL options available in it. I must admit HIVE is the most relevant one and it made my life so simple in my new project. Next comes the Apache SPARK.

Apache Spark has Spark SQL as one of the components which is blessing for people like me. We don’t prefer writing java applications but SQL is our forte. Since Apache Spark is perhaps the loudest buzz word in the market today so working on Spark SQL is exciting too. One of the core object in Spark SQL is DataFrame and it is as good as any Table in RDBMS. You can apply all sorts of SQL operations on a DataFrame directly or indirectly.

Below are the posts which I would like to share with you and hope it can help you in transitioning from SQL to Spark SQL.

Spark Dataframe Update Column Value

We all know that UPDATING column value in a table is a pain in HIVE or SPARK SQL especially if you are dealing with non-ACID tables. However in Dataframe you can easily update column values. In the example below we will update State Name with State Abbreviation. Original Dataframe: scala> df_pres.show(45) +——-+——————–+———-+——————-+————–+———-+———-+ |pres_id| pres_name| pres_dob| pres_bp| pres_bs| pres_in| pres_out| +——-+——————–+———-+——————-+————–+———-+———-+ | 1| George Washington|1732-02-22|Westmoreland County| Virginia|1789-04-30|1797-03-04| | 2| John Adams|1735-10-30| Braintree| Massachusetts|1797-03-04|1801-03-04| | 3| Thomas Jefferson|1743-04-13| Shadwell| Virginia|1801-03-04|1809-03-04| | 4|…

Posted in Apache Spark Dataframe | Tagged , | Leave a comment

Spark Dataframe JOINS – Only post you need to read

JOINS are used to retrieve data from more than one table or dataframes. You can replicate almost all types of joins possible in any typical SQL environment using Spark Dataframes. We will discuss about following join types in this post: INNER JOIN LEFT OUTER JOIN RIGHT OUTER JOIN FULL OUTER JOIN LEFT SEMI JOIN ANTI LEFT JOIN CROSS JOIN Dataframe INNER JOIN INNER JOINs are used to fetch common data between 2 tables or in this case 2 dataframes. You…

Posted in Apache Spark Dataframe | Tagged , , , , , , , | Leave a comment

Spark Dataframe – Distinct or Drop Duplicates

DISTINCT or dropDuplicates is used to remove duplicate rows in the Dataframe. Row consists of columns, if you are selecting only one column then output will be unique values for that specific column. DISTINCT is very commonly used to seek possible values which exists in the dataframe for any given column. Example: scala> df_pres.select($”pres_bs”).show(45) +——————–+ | pres_bs| +——————–+ | Virginia| | Massachusetts| | Virginia| | Virginia| | Virginia| | Massachusetts| |South/North Carolina| | New York| | Virginia| | Virginia| |…

Posted in Apache Spark Dataframe | Tagged | Leave a comment

Spark Dataframe concatenate strings

In many scenarios, you may want to concatenate multiple strings into one. For example, you may want to concatenate “FIRST NAME” & “LAST NAME” of a customer to show his “FULL NAME”. In Spark SQL Dataframe, we can use concat function to join multiple string into one string. Let’s look at the example below: scala> df_pres.select(concat($”pres_id”,$”pres_name”)).show() +————————-+ |concat(pres_id,pres_name)| +————————-+ | 1George Washington| | 2John Adams| | 3Thomas Jefferson| | 4James Madison| | 5James Monroe| | 6John Quincy Adams| | 7Andrew…

Posted in Apache Spark Dataframe | Tagged , , | Leave a comment

Spark Dataframe Replace String

It is very common sql operation to replace a character in a string with other character or you may want to replace string with other string . This is possible in Spark SQL Dataframe easily using regexp_replace or translate function. Let’s see if we want to replace any given character in String with some other character then how Translate can help us. scala> df_pres.select($”pres_name”,translate($”pres_name”,”J”,”Z”).as(“new_name”)).show() +——————–+——————–+ | pres_name| new_name| +——————–+——————–+ | George Washington| George Washington| | John Adams| Zohn Adams| |…

Posted in Apache Spark Dataframe | Tagged , , | Leave a comment

Spark Dataframe orderBy Sort

SORT is used to order resultset on the basis of values for any selected column. The syntax is to use sort function with column name inside it. We can also specify asending or descending order for sorting, default is ascending. In our dataframe, if we want to order the resultset on the basis of the state in which President was born then we will use below query: scala> df_pres.select($”pres_id”,$”pres_dob”,$”pres_bs”).sort($”pres_bs”.asc).show() +——-+———-+————-+ |pres_id| pres_dob| pres_bs| +——-+———-+————-+ | 42|1946-08-19| Arkansas| | 37|1913-01-09| California|…

Posted in Apache Spark Dataframe | Tagged , | Leave a comment

Spark Dataframe WHEN case

In SQL, if we have to check multiple conditions for any column value then we use case statament. In Spark SQL dataframes also we can replicate same functionality by using WHEN clause multiple times, once for each conditional check. No requirement to add CASE keyword though. So let’s see an example to see how to check for multiple conditions and replicate SQL CASE statement in Spark SQL. scala> df_pres.select($”pres_name”,$”pres_dob”,$”pres_bs”, when($”pres_bs”===”Virginia”,”VA”).when($”pres_bs”===”Massachusetts”,”MA”) .when($”pres_bs”===”Ohio”,”OH”).otherwise(“Others”).alias(“state_abbr”)).show() +——————–+———-+——————–+———-+ | pres_name| pres_dob| pres_bs|state_abbr| +——————–+———-+——————–+———-+ | George Washington|1732-02-22|…

Posted in Apache Spark Dataframe | Tagged | 1 Comment

Spark Dataframe LIKE NOT LIKE RLIKE

LIKE condition is used in situation when you don’t know the exact value or you are looking for some specific pattern in the output. LIKE is similar as in SQL and can be used to specify any pattern in WHERE/FILTER or even in JOIN conditions. Let’s see an example to find out all the president where name starts with James. scala> df_pres.filter($”pres_name”.like(“James%”)).select($”pres_name”,$”pres_dob”,$”pres_bs”).show() +—————–+———-+————–+ | pres_name| pres_dob| pres_bs| +—————–+———-+————–+ | James Madison|1751-03-16| Virginia| | James Monroe|1758-04-28| Virginia| | James K. Polk|1795-11-02|North…

Posted in Apache Spark Dataframe | Tagged , , | 5 Comments

Spark Dataframe IN-NOT IN

IN or NOT IN conditions are used in FILTER/WHERE or even in JOINS when we have to specify multiple possible values for any column. If the value is one of the values mentioned inside “IN” clause then it will qualify. It is opposite for “NOT IN” where the value must not be among any one present inside NOT IN clause. So let’s look at the example for IN condition scala> df_pres.filter($”pres_bs” in (“New York”,”Ohio”,”Texas”)).select($”pres_name”,$”pres_dob”,$”pres_bs”).show() +——————–+———-+——–+ | pres_name| pres_dob| pres_bs| +——————–+———-+——–+…

Posted in Apache Spark Dataframe | Tagged , | Leave a comment

Spark Dataframe WHERE Filter

As the name suggests, FILTER is used in Spark SQL to filter out records as per the requirement. If you do not want complete data set and just wish to fetch few records which satisfy some condition then you can use FILTER function. It is equivalent to SQL “WHERE” clause and is more commonly used in Spark-SQL. Let’s fetch all the presidents who were born in New York. scala> df_pres.filter($”pres_bs” === “New York”).select($”pres_name”,$”pres_dob”.alias(“Date Of Birth”),$”pres_bs”).show() +——————–+————-+——–+ | pres_name|Date Of Birth|…

Posted in Apache Spark Dataframe | Tagged , , , , , | Leave a comment