IN or NOT IN conditions are used in FILTER/WHERE or even in JOINS when we have to specify multiple possible values for any column. If the value is one of the values mentioned inside “IN” clause then it will qualify. It is opposite for “NOT IN” where the value must not be among any one present inside NOT IN clause. So let’s look at the example for IN condition scala> df_pres.filter($”pres_bs” in (“New York”,”Ohio”,”Texas”)).select($”pres_name”,$”pres_dob”,$”pres_bs”).show() +——————–+———-+——–+ | pres_name| pres_dob| pres_bs| +——————–+———-+——–+ | Martin Van Buren|1782-12-05|New York| | Millard Fillmore|1800-01-07|New York| | Ulysses S. Grant|1822-04-27| Ohio| | Rutherford B. Hayes|1822-10-04| Ohio| | James A. Garfield|1831-11-19| Ohio| |Read More →

As the name suggests, FILTER is used in Spark SQL to filter out records as per the requirement. If you do not want complete data set and just wish to fetch few records which satisfy some condition then you can use FILTER function. It is equivalent to SQL “WHERE” clause and is more commonly used in Spark-SQL. Let’s fetch all the presidents who were born in New York. scala> df_pres.filter($”pres_bs” === “New York”).select($”pres_name”,$”pres_dob”.alias(“Date Of Birth”),$”pres_bs”).show() +——————–+————-+——–+ | pres_name|Date Of Birth| pres_bs| +——————–+————-+——–+ | Martin Van Buren| 1782-12-05|New York| | Millard Fillmore| 1800-01-07|New York| | Theodore Roosevelt| 1858-10-27|New York| |Franklin D. Roose…| 1882-01-30|New York| | DonaldRead More →

ALIAS is defined in order to make columns or tables more readable or even shorter. If you wish to rename your columns while displaying it to the user or if you are using tables in joins then you may need to have alias for table names. Other than making column names or table names more readable, alias also helps in making developer life better by writing smaller table names in join conditions. You may have to give alias name to DERIVED table as well in SQL. Now let’s see how to give alias names to columns or tables in Spark SQL. We will use alias()Read More →

In this post, we will see how to fetch data from HIVE table into SPARK DataFrame and perform few SQL like “SELECT” operations on it. I have a table in HIVE database which has details of all the US Presidents (src: https://en.wikipedia.org/) .If you don’t know how to create table in hive or load data into hive table kindly read this post:Create Table in HiveThe data looks like : 0: jdbc:hive2://localhost:10000> select * from pres_orc as tbl; +————–+————————-+—————+———————-+———————–+————–+—————+–+ | tbl.pres_id | tbl.pres_name | tbl.pres_dob | tbl.pres_bp | tbl.pres_bs | tbl.pres_in | tbl.pres_out | +————–+————————-+—————+———————-+———————–+————–+—————+–+ | 1 | George Washington | 1732-02-22 | Westmoreland County |Read More →

While working on HIVE, you may want to use variables in the query to get results. A good coding practice is not to hardcode values in the query itself so we should know how to use variables in the HIVE query. Hive variables can be referred using “hivevar” keyword. We can set value of HIVE variable using below command: SET hivevar:VARIABLE_NAME=’VARIABLE_VALUE’; We can use this value by using ${hivevar:VARIABLE_NAME} in the query. Example: set hivevar:desc=’Legislators’; select * from sample_07 where description=${hivevar:desc};Read More →

We may want to subtract two timestamps in order to find out the difference between occurence of two events. This is a very common operation which we perform on any TIMESTAMP or DATE or TIME data type. Now the question is how to find out the difference or subtract two timestamp ? In HIVE we can directly subtract two timestamp columns and get the result. Let’s see this with an example: TIMESTAMP 1: 2017-09-22 17:22:38 TIMESTAMP 2: 2017-09-12 14:53:19 Now I want to find the difference between these 2 timestamp values. I can directly subtract it. select CAST(‘2017-09-22 17:22:38’ as timestamp) – CAST(‘2017-09-12 14:53:19’ asRead More →