Spark Dataframe Update Column Value

We all know that UPDATING column value in a table is a pain in HIVE or SPARK SQL especially if you are dealing with non-ACID tables. However in Dataframe you can easily update column values. In the example below we will update State Name with State Abbreviation. Original Dataframe: scala> df_pres.show(45) +——-+——————–+———-+——————-+————–+———-+———-+ |pres_id| pres_name| pres_dob| pres_bp| pres_bs| pres_in| pres_out| +——-+——————–+———-+——————-+————–+———-+———-+ | 1| George Washington|1732-02-22|Westmoreland County| Virginia|1789-04-30|1797-03-04| | 2| John Adams|1735-10-30| Braintree| Massachusetts|1797-03-04|1801-03-04| | 3| Thomas Jefferson|1743-04-13| Shadwell| Virginia|1801-03-04|1809-03-04| | 4|…

Continue Reading

Spark Dataframe JOINS – Only post you need to read

JOINS are used to retrieve data from more than one table or dataframes. You can replicate almost all types of joins possible in any typical SQL environment using Spark Dataframes. We will discuss about following join types in this post: INNER JOIN LEFT OUTER JOIN RIGHT OUTER JOIN FULL OUTER JOIN LEFT SEMI JOIN ANTI LEFT JOIN CROSS JOIN Dataframe INNER JOIN INNER JOINs are used to fetch common data between 2 tables or in this case 2 dataframes. You…

Continue Reading

Spark Dataframe – Distinct or Drop Duplicates

DISTINCT or dropDuplicates is used to remove duplicate rows in the Dataframe. Row consists of columns, if you are selecting only one column then output will be unique values for that specific column. DISTINCT is very commonly used to seek possible values which exists in the dataframe for any given column. Example: scala> df_pres.select($”pres_bs”).show(45) +——————–+ | pres_bs| +——————–+ | Virginia| | Massachusetts| | Virginia| | Virginia| | Virginia| | Massachusetts| |South/North Carolina| | New York| | Virginia| | Virginia| |…

Continue Reading

Spark Dataframe concatenate strings

In many scenarios, you may want to concatenate multiple strings into one. For example, you may want to concatenate “FIRST NAME” & “LAST NAME” of a customer to show his “FULL NAME”. In Spark SQL Dataframe, we can use concat function to join multiple string into one string. Let’s look at the example below: scala> df_pres.select(concat($”pres_id”,$”pres_name”)).show() +————————-+ |concat(pres_id,pres_name)| +————————-+ | 1George Washington| | 2John Adams| | 3Thomas Jefferson| | 4James Madison| | 5James Monroe| | 6John Quincy Adams| | 7Andrew…

Continue Reading

Spark Dataframe Replace String

It is very common sql operation to replace a character in a string with other character or you may want to replace string with other string . This is possible in Spark SQL Dataframe easily using regexp_replace or translate function. Let’s see if we want to replace any given character in String with some other character then how Translate can help us. scala> df_pres.select($”pres_name”,translate($”pres_name”,”J”,”Z”).as(“new_name”)).show() +——————–+——————–+ | pres_name| new_name| +——————–+——————–+ | George Washington| George Washington| | John Adams| Zohn Adams| |…

Continue Reading

Spark Dataframe orderBy Sort

SORT is used to order resultset on the basis of values for any selected column. The syntax is to use sort function with column name inside it. We can also specify asending or descending order for sorting, default is ascending. In our dataframe, if we want to order the resultset on the basis of the state in which President was born then we will use below query: scala> df_pres.select($”pres_id”,$”pres_dob”,$”pres_bs”).sort($”pres_bs”.asc).show() +——-+———-+————-+ |pres_id| pres_dob| pres_bs| +——-+———-+————-+ | 42|1946-08-19| Arkansas| | 37|1913-01-09| California|…

Continue Reading

Spark Dataframe WHEN case

In SQL, if we have to check multiple conditions for any column value then we use case statament. In Spark SQL dataframes also we can replicate same functionality by using WHEN clause multiple times, once for each conditional check. No requirement to add CASE keyword though. So let’s see an example to see how to check for multiple conditions and replicate SQL CASE statement in Spark SQL. scala> df_pres.select($”pres_name”,$”pres_dob”,$”pres_bs”, when($”pres_bs”===”Virginia”,”VA”).when($”pres_bs”===”Massachusetts”,”MA”) .when($”pres_bs”===”Ohio”,”OH”).otherwise(“Others”).alias(“state_abbr”)).show() +——————–+———-+——————–+———-+ | pres_name| pres_dob| pres_bs|state_abbr| +——————–+———-+——————–+———-+ | George Washington|1732-02-22|…

Continue Reading

Spark Dataframe LIKE NOT LIKE RLIKE

LIKE condition is used in situation when you don’t know the exact value or you are looking for some specific pattern in the output. LIKE is similar as in SQL and can be used to specify any pattern in WHERE/FILTER or even in JOIN conditions. Let’s see an example to find out all the president where name starts with James. scala> df_pres.filter($”pres_name”.like(“James%”)).select($”pres_name”,$”pres_dob”,$”pres_bs”).show() +—————–+———-+————–+ | pres_name| pres_dob| pres_bs| +—————–+———-+————–+ | James Madison|1751-03-16| Virginia| | James Monroe|1758-04-28| Virginia| | James K. Polk|1795-11-02|North…

Continue Reading

Spark Dataframe IN-NOT IN

IN or NOT IN conditions are used in FILTER/WHERE or even in JOINS when we have to specify multiple possible values for any column. If the value is one of the values mentioned inside “IN” clause then it will qualify. It is opposite for “NOT IN” where the value must not be among any one present inside NOT IN clause. So let’s look at the example for IN condition scala> df_pres.filter($”pres_bs” in (“New York”,”Ohio”,”Texas”)).select($”pres_name”,$”pres_dob”,$”pres_bs”).show() +——————–+———-+——–+ | pres_name| pres_dob| pres_bs| +——————–+———-+——–+…

Continue Reading

Spark Dataframe WHERE Filter

As the name suggests, FILTER is used in Spark SQL to filter out records as per the requirement. If you do not want complete data set and just wish to fetch few records which satisfy some condition then you can use FILTER function. It is equivalent to SQL “WHERE” clause and is more commonly used in Spark-SQL. Let’s fetch all the presidents who were born in New York. scala> df_pres.filter($”pres_bs” === “New York”).select($”pres_name”,$”pres_dob”.alias(“Date Of Birth”),$”pres_bs”).show() +——————–+————-+——–+ | pres_name|Date Of Birth|…

Continue Reading