PySpark – zipWithIndex Example

One of the most common operation in any DATA Analytics environment is to generate sequences. There are multiple ways of generating SEQUENCE numbers however I find zipWithIndex as the best one in terms of simplicity and performance combined. Especially when requirement is to generate consecutive numbers without any gap. Below is the detailed code which shall help in generating surrogate keys/natural keys/sequence numbers. Step 1: Create a dataframe with all the required columns from the table. df_0= sqlContext.sql(“select pres_name,pres_dob,pres_bp,pres_bs,pres_in,pres_out from…

Continue Reading

Hive – WHERE

WHERE clause in HIVE WHERE clause is used to FILTER out records from final result set. When you run any query in HIVE and you want to FILTER out records such that all the rows in output shall abide by some condition then use WHERE clause. You can mention conditions in WHERE which must be true for row in order to be in output result set. Example: select pres_name,pres_dob,pres_bs from usa_president where pres_bs=’New York’; +————————+————-+———–+–+ | pres_name | pres_dob |…

Continue Reading

Hive – DISTINCT

select DISTINCT in HIVE DISTINCT keyword is used in SELECT statement in HIVE to fetch only unique rows. The row does not mean entire row in the table but it means “row” as per column listed in the SELECT statement. If the SELECT has 3 columns listed then SELECT DISTINCT will fetch unique row for those 3 column values only. Also it will not fetch DISTINCT value for 1 of the column. The row comprising of 3 columns will be…

Continue Reading

Hive – SELECT

SELECT statement in HIVE SELECT statement is used to fetch records from table. A table may consists of many columns and you can specify the column name which you want to fetch in SELECT statement. You can mention column name/s or can simply use “*” to represent all columns in the table in SELECT statement. Example: SELECT pres_name,pres_dob,pres_bs from usa_president; +————————-+————-+———————–+–+ | pres_name | pres_dob | pres_bs | +————————-+————-+———————–+–+ | George Washington | 1732-02-22 | Virginia | | John Adams…

Continue Reading

Hive Date Functions – all possible Date operations

Handling DATE in HIVE is not as easy as compared to many traditional RDBMS in the market. One of the reason why I feel so is the inability of HIVE to throw proper error when dealing with wrong format. If any function or date format is not compatible, it will give NULL as output in place of error. This makes identification of wrong date manipulation in HIVE more tricky. We have used DATE_FORMAT which is available from Hive 1.2 onwards.…

Continue Reading

How to implement recursive queries in Spark?

Recently I was working on a project in which client data warehouse was in Teradata. The requirement was to have something similar on Hadoop also for a specific business application. At a high level, the requirement was to have same data and run similar sql on that data to produce exactly same report on hadoop too. I don’t see any challenge in migrating data from Teradata to Hadoop. Also transforming SQL into equivalent HIVE/SPARK is not that difficult now. The…

Continue Reading
Show Partitions in HIVE table

Hive Partitions – Everything you must know

What are partitions in HIVE? How partitions are implemented in HIVE? What is the benefit of partition in HIVE? If you have these or similar questions about partitions in HIVE then we have answered such questions and many more in this post. What is HIVE ? HIVE is a software built on top of Hadoop to provide environment where user can run SQL queries rather than map reduce program on data. It is developed primarily for people like me who…

Continue Reading

Max iterations (100) reached for batch Resolution – Spark Error

Max Iterations error is not very common error in Spark however if you are working with Spark SQL you may encounter this error. The error mostly comes while running query which generates very long query plans. I was recently working on such query which involved many joins and derived tables & CTE etc. In short, it was a pretty complex query which actually runs on Netezza everyday. We were checking the feasibility and also comparing the query performance in Netezza…

Continue Reading
Spark1 in HDP

How to select SPARK2 as default spark version

Hi Guys. I have been using HDP2.5 for sometime now and few of my friends asked me that how can they select SPARK2 by default. In HDP2.5 we have Spark1.X & Spark2 both available. However when you will start SPARK-SHELL, it will show you a prompt and will select SPARK1.X as default. The answer to the question is present in the prompt itself. You can see it displays on screen that SPARK_MAJOR_VERSION is not set hence taking SPARK1 as default.…

Continue Reading

Spark Dataframe concatenate strings

In many scenarios, you may want to concatenate multiple strings into one. For example, you may want to concatenate “FIRST NAME” & “LAST NAME” of a customer to show his “FULL NAME”. In Spark SQL Dataframe, we can use concat function to join multiple string into one string. Let’s look at the example below: scala> df_pres.select(concat($”pres_id”,$”pres_name”)).show() +————————-+ |concat(pres_id,pres_name)| +————————-+ | 1George Washington| | 2John Adams| | 3Thomas Jefferson| | 4James Madison| | 5James Monroe| | 6John Quincy Adams| | 7Andrew…

Continue Reading