In this post, we will see how to fetch data from HIVE table into SPARK DataFrame and perform few SQL like “SELECT” operations on it. I have a table in HIVE database which has details of all the US Presidents (src: https://en.wikipedia.org/) .
If you don’t know how to create table in hive or load data into hive table kindly read this post:
Create Table in Hive
The data looks like :
0: jdbc:hive2://localhost:10000> select * from pres_orc as tbl;
tbl.pres_id | tbl.pres_name | tbl.pres_dob | tbl.pres_bp | tbl.pres_bs | tbl.pres_in | tbl.pres_out |
---|---|---|---|---|---|---|
1 | George Washington | 1732-02-22 | Westmoreland County | Virginia | 1789-04-30 | 1797-03-04 |
2 | John Adams | 1735-10-30 | Braintree | Massachusetts | 1797-03-04 | 1801-03-04 |
3 | Thomas Jefferson | 1743-04-13 | Shadwell | Virginia | 1801-03-04 | 1809-03-04 |
4 | James Madison | 1751-03-16 | Port Conway | Virginia | 1809-03-04 | 1817-03-04 |
5 | James Monroe | 1758-04-28 | Monroe Hall | Virginia | 1817-03-04 | 1825-03-04 |
6 | John Quincy Adams | 1767-07-11 | Braintree | Massachusetts | 1825-03-04 | 1829-03-04 |
7 | Andrew Jackson | 1767-03-15 | Waxhaws Region | South/North Carolina | 1829-03-04 | 1837-03-04 |
8 | Martin Van Buren | 1782-12-05 | Kinderhook | New York | 1837-03-04 | 1841-03-04 |
9 | William Henry Harrison | 1773-02-09 | Charles City County | Virginia | 1841-03-04 | 1841-04-04 |
10 | John Tyler | 1790-03-29 | Charles City County | Virginia | 1841-04-04 | 1845-03-04 |
11 | James K. Polk | 1795-11-02 | Pineville | North Carolina | 1845-03-04 | 1849-03-04 |
12 | Zachary Taylor | 1784-11-24 | Barboursville | Virginia | 1849-03-04 | 1850-07-09 |
13 | Millard Fillmore | 1800-01-07 | Summerhill | New York | 1850-07-09 | 1853-03-04 |
14 | Franklin Pierce | 1804-11-23 | Hillsborough | New Hampshire | 1853-03-04 | 1857-03-04 |
15 | James Buchanan | 1791-04-23 | Cove Gap | Pennsylvania | 1857-03-04 | 1861-03-04 |
16 | Abraham Lincoln | 1809-02-12 | Sinking spring | Kentucky | 1861-03-04 | 1865-04-15 |
17 | Andrew Johnson | 1808-12-29 | Raleigh | North Carolina | 1865-04-15 | 1869-03-04 |
18 | Ulysses S. Grant | 1822-04-27 | Point Pleasant | Ohio | 1869-03-04 | 1877-03-04 |
19 | Rutherford B. Hayes | 1822-10-04 | Delaware | Ohio | 1877-03-04 | 1881-03-04 |
20 | James A. Garfield | 1831-11-19 | Moreland Hills | Ohio | 1881-03-04 | 1881-09-19 |
21 | Chester A. Arthur | 1829-10-05 | Fairfield | Vermont | 1881-09-19 | 1885-03-04 |
22 | Grover Cleveland | 1837-03-18 | Caldwell | New Jersey | 1885-03-04 | 1889-03-04 |
23 | Benjamin Harrison | 1833-08-20 | North Bend | Ohio | 1889-03-04 | 1893-03-04 |
24 | Grover Cleveland | 1837-03-18 | Caldwell | New Jersey | 1893-03-04 | 1897-03-04 |
25 | William McKinley | 1843-01-29 | Niles | Ohio | 1897-03-04 | 1901-09-14 |
26 | Theodore Roosevelt | 1858-10-27 | Manhattan | New York | 1901-09-14 | 1909-03-04 |
27 | William Howard Taft | 1857-09-15 | Cincinnati | Ohio | 1909-03-04 | 1913-03-04 |
28 | Woodrow Wilson | 1856-12-28 | Staunton | Virginia | 1913-03-04 | 1921-03-04 |
29 | Warren G. Harding | 1865-11-02 | Blooming Grove | Ohio | 1921-03-04 | 1923-08-02 |
30 | Calvin Coolidge | 1872-07-04 | Plymouth | Vermont | 1923-08-02 | 1929-03-04 |
31 | Herbert Hoover | 1874-08-10 | West Branch | Iowa | 1929-03-04 | 1933-03-04 |
32 | Franklin D. Roosevelt | 1882-01-30 | Hyde Park | New York | 1933-03-04 | 1945-04-12 |
33 | Harry S. Truman | 1884-05-08 | Lamar | Missouri | 1945-04-12 | 1953-01-20 |
34 | Dwight D. Eisenhower | 1890-10-14 | Denison | Texas | 1953-01-20 | 1961-01-20 |
35 | John F. Kennedy | 1917-05-29 | Brookline | Massachusetts | 1961-01-20 | 1963-11-22 |
36 | Lyndon B. Johnson | 1908-08-27 | Stonewall | Texas | 1963-11-22 | 1969-01-20 |
37 | Richard M. Nixon | 1913-01-09 | Yorba Linda | California | 1969-01-20 | 1974-08-09 |
38 | Gerald R. Ford | 1913-07-14 | Omaha | Nebraska | 1974-08-09 | 1977-01-20 |
39 | Jimmy Carter | 1924-10-01 | Plains | Georgia | 1977-01-20 | 1981-01-20 |
40 | Ronald Reagan | 1911-02-06 | Tampico | Illinois | 1981-01-20 | 1989-01-20 |
41 | George H. W. Bush | 1924-06-12 | Milton | Massachusetts | 1989-01-20 | 1993-01-20 |
42 | Bill Clinton | 1946-08-19 | Hope | Arkansas | 1993-01-20 | 2001-01-20 |
43 | George W. Bush | 1946-07-06 | New Haven | Connecticut | 2001-01-20 | 2009-01-20 |
44 | Barack Obama | 1961-08-04 | Honolulu | Hawaii | 2009-01-20 | 2017-01-20 |
45 | Donald Trump | 1946-06-14 | Queens | New York | 2017-01-20 | NULL |
Now we will create a DataFrame in spark and will load this data into it. For future examples, we will use this DataFrame when required. Now let’s connect to spark. I am using HDP 2.5. So I will open “spark-shell” and few points to note here:
spark version 1.6
Spark context sc
SQL context sqlContext
Now let’s run the statement to create dataframe:
scala> val df_pres = sqlContext.sql(" SELECT * from testdb.pres_orc") 17/09/10 08:32:58 INFO ParseDriver: Parsing command: SELECT * from testdb.pres_orc 17/09/10 08:32:58 INFO ParseDriver: Parse Completed df_pres: org.apache.spark.sql.DataFrame = [pres_id: tinyint, pres_name: string, pres_dob: date, pres_bp: string, pres_bs: string, pres_in: date, pres_out: date]
Now I have a DataFrame with name df_pres which has the HIVE Table data in it. I can perform almost all the SQL operations on it in SPARK-SQL. SELECT primarily has two options:
- You can either SELECT all columns by specifying “*” in the SQL query
- You can mention specific columns in the SQL query to pick only required columns
Now how do we do it in Spark ?
1) Show all columns from DataFrame
scala> df_pres.show() +-------+--------------------+----------+-------------------+--------------------+----------+----------+ |pres_id| pres_name| pres_dob| pres_bp| pres_bs| pres_in| pres_out| +-------+--------------------+----------+-------------------+--------------------+----------+----------+ | 1| George Washington|1732-02-22|Westmoreland County| Virginia|1789-04-30|1797-03-04| | 2| John Adams|1735-10-30| Braintree| Massachusetts|1797-03-04|1801-03-04| | 3| Thomas Jefferson|1743-04-13| Shadwell| Virginia|1801-03-04|1809-03-04| | 4| James Madison|1751-03-16| Port Conway| Virginia|1809-03-04|1817-03-04| | 5| James Monroe|1758-04-28| Monroe Hall| Virginia|1817-03-04|1825-03-04| | 6| John Quincy Adams|1767-07-11| Braintree| Massachusetts|1825-03-04|1829-03-04| | 7| Andrew Jackson|1767-03-15| Waxhaws Region|South/North Carolina|1829-03-04|1837-03-04| | 8| Martin Van Buren|1782-12-05| Kinderhook| New York|1837-03-04|1841-03-04| | 9|William Henry Har...|1773-02-09|Charles City County| Virginia|1841-03-04|1841-04-04| | 10| John Tyler|1790-03-29|Charles City County| Virginia|1841-04-04|1845-03-04| | 11| James K. Polk|1795-11-02| Pineville| North Carolina|1845-03-04|1849-03-04| | 12| Zachary Taylor|1784-11-24| Barboursville| Virginia|1849-03-04|1850-07-09| | 13| Millard Fillmore|1800-01-07| Summerhill| New York|1850-07-09|1853-03-04| | 14| Franklin Pierce|1804-11-23| Hillsborough| New Hampshire|1853-03-04|1857-03-04| | 15| James Buchanan|1791-04-23| Cove Gap| Pennsylvania|1857-03-04|1861-03-04| | 16| Abraham Lincoln|1809-02-12| Sinking spring| Kentucky|1861-03-04|1865-04-15| | 17| Andrew Johnson|1808-12-29| Raleigh| North Carolina|1865-04-15|1869-03-04| | 18| Ulysses S. Grant|1822-04-27| Point Pleasant| Ohio|1869-03-04|1877-03-04| | 19| Rutherford B. Hayes|1822-10-04| Delaware| Ohio|1877-03-04|1881-03-04| | 20| James A. Garfield|1831-11-19| Moreland Hills| Ohio|1881-03-04|1881-09-19| +-------+--------------------+----------+-------------------+--------------------+----------+----------+ only showing top 20 rows
If you wish to show more number of rows, just enter the number in the show function.
scala> df_pres.show(45)
The above command will show all 45 rows from the data frame.
2) Show specific columns from DataFrame
scala> df_pres.select($"pres_id",$"pres_dob",$"pres_bs").show() +-------+----------+--------------------+ |pres_id| pres_dob| pres_bs| +-------+----------+--------------------+ | 1|1732-02-22| Virginia| | 2|1735-10-30| Massachusetts| | 3|1743-04-13| Virginia| | 4|1751-03-16| Virginia| | 5|1758-04-28| Virginia| | 6|1767-07-11| Massachusetts| | 7|1767-03-15|South/North Carolina| | 8|1782-12-05| New York| | 9|1773-02-09| Virginia| | 10|1790-03-29| Virginia| | 11|1795-11-02| North Carolina| | 12|1784-11-24| Virginia| | 13|1800-01-07| New York| | 14|1804-11-23| New Hampshire| | 15|1791-04-23| Pennsylvania| | 16|1809-02-12| Kentucky| | 17|1808-12-29| North Carolina| | 18|1822-04-27| Ohio| | 19|1822-10-04| Ohio| | 20|1831-11-19| Ohio| +-------+----------+--------------------+ only showing top 20 rows
In the next post, we will see how to mention ALIAS names for COLUMNS and TABLES while using DATAFRAMES in SPARK.