In this post, we will see how to connect to 3 very popular RDBMS using Spark. We will create connection and will fetch some records via spark. The dataframe will hold data and we can use it as per requirement. We will talk about JAR files required for connection and JDBC connection string to fetch data and load dataframe. Connect to Netezza from Spark RDBMS: Netezza Jar Required: nzjdbc.jar Step 1: Open Spark shell and add jar spark-shell –jars /tmp/nz/nzjdbc.jar Step 2: Pass required parameters and create a dataframe with data from Netezza val df_nz = sqlContext.read.format(“jdbc”).options(Map(“url” -> “jdbc:netezza://192.168.200.201:5480/DBNAME”, “user” -> “admin”, “password” -> “password”,Read More →

Recently I was working on a project to convert Teradata BTEQ to PySpark code. Since it was mostly SQL queries, we were asked to typically transform into Spark SQL and run it using PySpark. We used sqlContext mostly for SQL queries however in Teradata you can have some constructs like ACITIVTYCOUNT which can help in deciding if you want to run subsequent queries or not. These conditional constructs cannot be directly converted to equivalent Spark SQL. So in pyspark , we converted a dataframe to array and then check array index value to implement IF condition. Be very careful while converting dataframe to array. ItRead More →

BETWEEN operator in HIVE When we have to check if column value lies between a range of value then we use BETWEEN operator in the query. You have to specify MIN and MAX value for the range when using BETWEEN operator. Example: select pres_name,pres_bs,pres_dob from usa_president where pres_dob between ‘1850-01-01’ and ‘1900-01-01’; +————————+———–+————-+–+ | pres_name | pres_bs | pres_dob | +————————+———–+————-+–+ | Theodore Roosevelt | New York | 1858-10-27 | | William Howard Taft | Ohio | 1857-09-15 | | Woodrow Wilson | Virginia | 1856-12-28 | | Warren G. Harding | Ohio | 1865-11-02 | | Calvin Coolidge | Vermont | 1872-07-04 | |Read More →

IN operator is used to compare any column value in the table with multiple possible values. It returns TRUE if value is matching with any value listed inside IN operator. Example: select pres_name,pres_dob,pres_bs from usa_president where pres_bs IN (‘Virginia’,’New York’,’New Jersey’); +————————-+————-+————-+–+ | pres_name | pres_dob | pres_bs | +————————-+————-+————-+–+ | George Washington | 1732-02-22 | Virginia | | Thomas Jefferson | 1743-04-13 | Virginia | | James Madison | 1751-03-16 | Virginia | | James Monroe | 1758-04-28 | Virginia | | Martin Van Buren | 1782-12-05 | New York | | William Henry Harrison | 1773-02-09 | Virginia | | John Tyler |Read More →

One of the most common operation in any DATA Analytics environment is to generate sequences. There are multiple ways of generating SEQUENCE numbers however I find zipWithIndex as the best one in terms of simplicity and performance combined. Especially when requirement is to generate consecutive numbers without any gap. Below is the detailed code which shall help in generating surrogate keys/natural keys/sequence numbers. Step 1: Create a dataframe with all the required columns from the table. df_0= sqlContext.sql(“select pres_name,pres_dob,pres_bp,pres_bs,pres_in,pres_out from usa_prez”) df_0.printSchema() df_0.printSchema() root |– pres_name: string (nullable = true) |– pres_dob: date (nullable = true) |– pres_bp: string (nullable = true) |– pres_bs: stringRead More →

WHERE clause in HIVE WHERE clause is used to FILTER out records from final result set. When you run any query in HIVE and you want to FILTER out records such that all the rows in output shall abide by some condition then use WHERE clause. You can mention conditions in WHERE which must be true for row in order to be in output result set. Example: select pres_name,pres_dob,pres_bs from usa_president where pres_bs=’New York’; +————————+————-+———–+–+ | pres_name | pres_dob | pres_bs | +————————+————-+———–+–+ | Martin Van Buren | 1782-12-05 | New York | | Millard Fillmore | 1800-01-07 | New York | | Theodore RooseveltRead More →