Spark Dataframe Update Column Value

We all know that UPDATING column value in a table is a pain in HIVE or SPARK SQL especially if you are dealing with non-ACID tables. However in Dataframe you can easily update column values. In the example below we will update State Name with State Abbreviation. Original Dataframe: scala> df_pres.show(45) +——-+——————–+———-+——————-+————–+———-+———-+ |pres_id| pres_name| pres_dob| pres_bp| pres_bs| pres_in| pres_out| +——-+——————–+———-+——————-+————–+———-+———-+ | 1| George Washington|1732-02-22|Westmoreland County| Virginia|1789-04-30|1797-03-04| | 2| John Adams|1735-10-30| Braintree| Massachusetts|1797-03-04|1801-03-04| | 3| Thomas Jefferson|1743-04-13| Shadwell| Virginia|1801-03-04|1809-03-04| | 4|…

Continue Reading

Spark Dataframe JOINS – Only post you need to read

JOINS are used to retrieve data from more than one table or dataframes. You can replicate almost all types of joins possible in any typical SQL environment using Spark Dataframes. We will discuss about following join types in this post: INNER JOIN LEFT OUTER JOIN RIGHT OUTER JOIN FULL OUTER JOIN LEFT SEMI JOIN ANTI LEFT JOIN CROSS JOIN Dataframe INNER JOIN INNER JOINs are used to fetch common data between 2 tables or in this case 2 dataframes. You…

Continue Reading

Hive Transactional Tables: Limitations and Considerations (Part 2)

In the previous post, we discussed about HIVE transactional tables; how to create it, properties and configurations required and example of HIVE transactional table DDL with INSERT/UPDATE/DELETE/MERGE statements on HIVE ACID table. In this post, we will talk about limitations with HIVE transactional tables and possible solutions if any. 1) [Error 10302]: Updating values of bucketing columns is not supported. Since ACID table must be bucketed to enable transactional property, you cannot run UPDATE command to set BUCKETED column. In…

Continue Reading

Hive Transactional Tables: Everything you must know (Part 1)

We all know HDFS does not support random deletes, updates. With HIVE ACID properties enabled, we can directly run UPDATE/DELETE on HIVE tables. Hive ACID tables support UPDATE, DELETE, INSERT, MERGE query constructs with some limitations and we will talk about that too. I am using HDP 2.6 & Hive 1.2 for examples mentioned below. Points to consider: 1) Only ORC storage format is supported presently. 2) Table must have CLUSTERED BY column 3) Table properties must have : “transactional”=”true”…

Continue Reading

Spark Dataframe – Distinct or Drop Duplicates

DISTINCT or dropDuplicates is used to remove duplicate rows in the Dataframe. Row consists of columns, if you are selecting only one column then output will be unique values for that specific column. DISTINCT is very commonly used to seek possible values which exists in the dataframe for any given column. Example: scala> df_pres.select($”pres_bs”).show(45) +——————–+ | pres_bs| +——————–+ | Virginia| | Massachusetts| | Virginia| | Virginia| | Virginia| | Massachusetts| |South/North Carolina| | New York| | Virginia| | Virginia| |…

Continue Reading

Spark JDBC connection to RDBMS

Recently I have received few queries regarding the query which we are passing to “load” function when using JDBC connection to connect to any RDBMS. The question is whether that query should be Spark SQL compliant or should be RDBMS specific. This is actually a very valid question because Spark SQL does not support all SQL constructs which are supported by typical RDBMS like Teradata , Netezza etc. Answer to this question is : Query must be RDBMS specific. When…

Continue Reading

Connect to different RDBMS from Spark

In this post, we will see how to connect to 3 very popular RDBMS using Spark. We will create connection and will fetch some records via spark. The dataframe will hold data and we can use it as per requirement. We will talk about JAR files required for connection and JDBC connection string to fetch data and load dataframe. Connect to Netezza from Spark RDBMS: Netezza Jar Required: nzjdbc.jar Step 1: Open Spark shell and add jar spark-shell –jars /tmp/nz/nzjdbc.jar…

Continue Reading

Teradata to PySpark – Replicate ACTIVITYCOUNT to Spark

Recently I was working on a project to convert Teradata BTEQ to PySpark code. Since it was mostly SQL queries, we were asked to typically transform into Spark SQL and run it using PySpark. We used sqlContext mostly for SQL queries however in Teradata you can have some constructs like ACITIVTYCOUNT which can help in deciding if you want to run subsequent queries or not. These conditional constructs cannot be directly converted to equivalent Spark SQL. So in pyspark ,…

Continue Reading

Hive – BETWEEN

BETWEEN operator in HIVE When we have to check if column value lies between a range of value then we use BETWEEN operator in the query. You have to specify MIN and MAX value for the range when using BETWEEN operator. Example: select pres_name,pres_bs,pres_dob from usa_president where pres_dob between ‘1850-01-01’ and ‘1900-01-01’; +————————+———–+————-+–+ | pres_name | pres_bs | pres_dob | +————————+———–+————-+–+ | Theodore Roosevelt | New York | 1858-10-27 | | William Howard Taft | Ohio | 1857-09-15 | |…

Continue Reading

Hive – IN

IN operator is used to compare any column value in the table with multiple possible values. It returns TRUE if value is matching with any value listed inside IN operator. Example: select pres_name,pres_dob,pres_bs from usa_president where pres_bs IN (‘Virginia’,’New York’,’New Jersey’); +————————-+————-+————-+–+ | pres_name | pres_dob | pres_bs | +————————-+————-+————-+–+ | George Washington | 1732-02-22 | Virginia | | Thomas Jefferson | 1743-04-13 | Virginia | | James Madison | 1751-03-16 | Virginia | | James Monroe | 1758-04-28 |…

Continue Reading