Recently I have received few queries regarding the query which we are passing to “load” function when using JDBC connection to connect to any RDBMS. The question is whether that query should be Spark SQL compliant or should be RDBMS specific. This is actually a very valid question because Spark SQL does not support all SQL constructs which are supported by typical RDBMS like Teradata , Netezza etc. Answer to this question is : Query must be RDBMS specific. When we use a jdbc connection, then the Query which you pass is actually executed on RDBMS and then the result set is pushed to DATAFRAMERead More →

In this post, we will see how to connect to 3 very popular RDBMS using Spark. We will create connection and will fetch some records via spark. The dataframe will hold data and we can use it as per requirement. We will talk about JAR files required for connection and JDBC connection string to fetch data and load dataframe. Connect to Netezza from Spark RDBMS: Netezza Jar Required: nzjdbc.jar Step 1: Open Spark shell and add jar spark-shell –jars /tmp/nz/nzjdbc.jar Step 2: Pass required parameters and create a dataframe with data from Netezza val df_nz =“jdbc”).options(Map(“url” -> “jdbc:netezza://”, “user” -> “admin”, “password” -> “password”,Read More →

Recently I was working on a project in which client data warehouse was in Teradata. The requirement was to have something similar on Hadoop also for a specific business application. At a high level, the requirement was to have same data and run similar sql on that data to produce exactly same report on hadoop too. I don’t see any challenge in migrating data from Teradata to Hadoop. Also transforming SQL into equivalent HIVE/SPARK is not that difficult now. The only challenge I see was in converting Teradata recursive queries into spark since Spark does not support Recursive queries. I searched for various options onlineRead More →

Max Iterations error is not very common error in Spark however if you are working with Spark SQL you may encounter this error. The error mostly comes while running query which generates very long query plans. I was recently working on such query which involved many joins and derived tables & CTE etc. In short, it was a pretty complex query which actually runs on Netezza everyday. We were checking the feasibility and also comparing the query performance in Netezza against Apache Spark2. So we know that Spark takes advantage of Catalyst Optimizer while using DataFrames. Spark run the same algorithm iteratively across multiple executorsRead More →

Spark1 in HDP

Hi Guys. I have been using HDP2.5 for sometime now and few of my friends asked me that how can they select SPARK2 by default. In HDP2.5 we have Spark1.X & Spark2 both available. However when you will start SPARK-SHELL, it will show you a prompt and will select SPARK1.X as default. The answer to the question is present in the prompt itself. You can see it displays on screen that SPARK_MAJOR_VERSION is not set hence taking SPARK1 as default. All you have to do is set this parameter before calling SPARK-SHELL and it will select proper SPARK version. Run below command before calling spark-shellRead More →