In this post, we will see how to connect to 3 very popular RDBMS using Spark. We will create connection and will fetch some records via spark. The dataframe will hold data and we can use it as per requirement. We will talk about JAR files required for connection and JDBC connection string to fetch data and load dataframe.
Connect to Netezza from Spark
RDBMS: Netezza
Jar Required: nzjdbc.jar
Load data from Netezza to Dataframe
spark-shell --jars /tmp/nz/nzjdbc.jar val df_nz = sqlContext.read.format("jdbc").options(Map("url" -> "jdbc:netezza://192.168.200.201:5480/DBNAME", "user" -> "admin", "password" -> "password", "dbtable" -> "nz_usa_prez", "driver" -> "org.netezza.Driver")).load() df_nz.show(5) +--------------------+----------+-------------------+--------------------+----------+----------+ | pres_name| pres_dob| pres_bp| pres_bs| pres_in| pres_out| +--------------------+----------+-------------------+--------------------+----------+----------+ | Harry S. Truman|1884-05-08| Lamar| Missouri|1945-04-12|1953-01-20| | Grover Cleveland|1837-03-18| Caldwell| New Jersey|1885-03-04|1889-03-04| | Abraham Lincoln|1809-02-12| Sinking spring| Kentucky|1861-03-04|1865-04-15| | Grover Cleveland|1837-03-18| Caldwell| New Jersey|1893-03-04|1897-03-04| | Calvin Coolidge|1872-07-04| Plymouth| Vermont|1923-08-02|1929-03-04| +--------------------+----------+-------------------+--------------------+----------+----------+
Connect to Teradata from Spark
RDBMS: Teradata
Jar Required: terajdbc4.jar,tdgssconfig.jar,teradata-connector-1.4.4.jar
Load data from Teradata to Dataframe
spark-shell --jars /tmp/td/terajdbc4.jar,/tmp/td/tdgssconfig.jar,/tmp/td/teradata-connector-1.4.4.jar val df_td = sqlContext.load("jdbc", Map("url" -> "jdbc:teradata://192.168.200.202/DBS_PORT=1025, TMODE=TERA, user=admin, password=password","dbtable" -> "dbname.td_usa_prez","driver" -> "com.teradata.jdbc.TeraDriver")) df_td.show(5) +--------------------+----------+-------------------+--------------------+----------+----------+ | pres_name| pres_dob| pres_bp| pres_bs| pres_in| pres_out| +--------------------+----------+-------------------+--------------------+----------+----------+ | Harry S. Truman|1884-05-08| Lamar| Missouri|1945-04-12|1953-01-20| | Grover Cleveland|1837-03-18| Caldwell| New Jersey|1885-03-04|1889-03-04| | Abraham Lincoln|1809-02-12| Sinking spring| Kentucky|1861-03-04|1865-04-15| | Grover Cleveland|1837-03-18| Caldwell| New Jersey|1893-03-04|1897-03-04| | Calvin Coolidge|1872-07-04| Plymouth| Vermont|1923-08-02|1929-03-04| +--------------------+----------+-------------------+--------------------+----------+----------+
Connect to Oracle from Spark
RDBMS: Oracle
Jar Required: ojdbc7.jar
Load data from Oracle to Dataframe
spark-shell --jars /tmp/ora/ojdbc7.jar scala> val df_ora = sqlContext.load("jdbc", Map("url" -> "jdbc:oracle:thin:username/password@192.168.200.203:1521/SID","dbtable" -> "ora_usa_prez","driver" -> "oracle.jdbc.driver.OracleDriver")); df_ora.show(5) +--------------------+--------------------+-------------------+--------------------+--------------------+--------------------+ | PRES_NAME| PRES_DOB| PRES_BP| PRES_BS| PRES_IN| PRES_OUT| +--------------------+--------------------+-------------------+--------------------+--------------------+--------------------+ | George Washington|1732-02-22 00:00:…|Westmoreland County| Virginia|1789-04-30 00:00:…|1797-03-04 00:00:…| | John Adams|1735-10-30 00:00:…| Braintree| Massachusetts|1797-03-04 00:00:…|1801-03-04 00:00:…| | Thomas Jefferson|1743-04-13 00:00:…| Shadwell| Virginia|1801-03-04 00:00:…|1809-03-04 00:00:…| | James Madison|1751-03-16 00:00:…| Port Conway| Virginia|1809-03-04 00:00:…|1817-03-04 00:00:…| | James Monroe|1758-04-28 00:00:…| Monroe Hall| Virginia|1817-03-04 00:00:…|1825-03-04 00:00:…| +--------------------+--------------------+-------------------+--------------------+--------------------+--------------------+
One thing to observe here is that Dataframe from Oracle table has converted DATE datatype to Timestamp.
Now you know how to read data from different RDBMS using spark. This is very useful when you quickly need some data from EDW to compare it with HIVE table.