Apache Spark

Connect to different RDBMS from Spark

In this post, we will see how to connect to 3 very popular RDBMS using Spark. We will create connection and will fetch some records via spark. The dataframe will hold data and we can use it as per requirement. We will talk about JAR files required for connection and JDBC connection string to fetch data and load dataframe.

Connect to Netezza from Spark

RDBMS: Netezza
Jar Required: nzjdbc.jar

Load data from Netezza to Dataframe

spark-shell --jars /tmp/nz/nzjdbc.jar

val df_nz = sqlContext.read.format("jdbc").options(Map("url" -> "jdbc:netezza://192.168.200.201:5480/DBNAME", "user" -> "admin", "password" -> "password", "dbtable" -> "nz_usa_prez", "driver" -> "org.netezza.Driver")).load()

df_nz.show(5)
 +--------------------+----------+-------------------+--------------------+----------+----------+
 |           pres_name|  pres_dob|            pres_bp|             pres_bs|   pres_in|  pres_out|
 +--------------------+----------+-------------------+--------------------+----------+----------+
 |     Harry S. Truman|1884-05-08|              Lamar|            Missouri|1945-04-12|1953-01-20|
 |    Grover Cleveland|1837-03-18|           Caldwell|          New Jersey|1885-03-04|1889-03-04|
 |     Abraham Lincoln|1809-02-12|     Sinking spring|            Kentucky|1861-03-04|1865-04-15|
 |    Grover Cleveland|1837-03-18|           Caldwell|          New Jersey|1893-03-04|1897-03-04|
 |     Calvin Coolidge|1872-07-04|           Plymouth|             Vermont|1923-08-02|1929-03-04|
 +--------------------+----------+-------------------+--------------------+----------+----------+

Connect to Teradata from Spark

RDBMS: Teradata
Jar Required: terajdbc4.jar,tdgssconfig.jar,teradata-connector-1.4.4.jar

Load data from Teradata to Dataframe

spark-shell --jars /tmp/td/terajdbc4.jar,/tmp/td/tdgssconfig.jar,/tmp/td/teradata-connector-1.4.4.jar
 
val df_td = sqlContext.load("jdbc", Map("url" -> "jdbc:teradata://192.168.200.202/DBS_PORT=1025, TMODE=TERA, user=admin, password=password","dbtable" -> "dbname.td_usa_prez","driver" -> "com.teradata.jdbc.TeraDriver"))

df_td.show(5)
+--------------------+----------+-------------------+--------------------+----------+----------+
 |           pres_name|  pres_dob|            pres_bp|             pres_bs|   pres_in|  pres_out|
 +--------------------+----------+-------------------+--------------------+----------+----------+
 |     Harry S. Truman|1884-05-08|              Lamar|            Missouri|1945-04-12|1953-01-20|
 |    Grover Cleveland|1837-03-18|           Caldwell|          New Jersey|1885-03-04|1889-03-04|
 |     Abraham Lincoln|1809-02-12|     Sinking spring|            Kentucky|1861-03-04|1865-04-15|
 |    Grover Cleveland|1837-03-18|           Caldwell|          New Jersey|1893-03-04|1897-03-04|
 |     Calvin Coolidge|1872-07-04|           Plymouth|             Vermont|1923-08-02|1929-03-04|
 +--------------------+----------+-------------------+--------------------+----------+----------+

Connect to Oracle from Spark

RDBMS: Oracle
Jar Required: ojdbc7.jar

Load data from Oracle to Dataframe

spark-shell --jars /tmp/ora/ojdbc7.jar

scala> val df_ora = sqlContext.load("jdbc", Map("url" -> "jdbc:oracle:thin:username/password@192.168.200.203:1521/SID","dbtable" -> "ora_usa_prez","driver" -> "oracle.jdbc.driver.OracleDriver"));
                                                 
df_ora.show(5)
 +--------------------+--------------------+-------------------+--------------------+--------------------+--------------------+
 |           PRES_NAME|            PRES_DOB|            PRES_BP|             PRES_BS|             PRES_IN|            PRES_OUT|
 +--------------------+--------------------+-------------------+--------------------+--------------------+--------------------+
 |   George Washington|1732-02-22 00:00:…|Westmoreland County|            Virginia|1789-04-30 00:00:…|1797-03-04 00:00:…|
 |          John Adams|1735-10-30 00:00:…|          Braintree|       Massachusetts|1797-03-04 00:00:…|1801-03-04 00:00:…|
 |    Thomas Jefferson|1743-04-13 00:00:…|           Shadwell|            Virginia|1801-03-04 00:00:…|1809-03-04 00:00:…|
 |       James Madison|1751-03-16 00:00:…|        Port Conway|            Virginia|1809-03-04 00:00:…|1817-03-04 00:00:…|
 |        James Monroe|1758-04-28 00:00:…|        Monroe Hall|            Virginia|1817-03-04 00:00:…|1825-03-04 00:00:…|
 +--------------------+--------------------+-------------------+--------------------+--------------------+--------------------+

One thing to observe here is that Dataframe from Oracle table has converted DATE datatype to Timestamp.

Now you know how to read data from different RDBMS using spark. This is very useful when you quickly need some data from EDW to compare it with HIVE table.

Leave a Reply