In this post, we will see how to fetch data from HIVE table into SPARK DataFrame and perform few SQL like “SELECT” operations on it. I have a table in HIVE database which has details of all the US Presidents (src: https://en.wikipedia.org/) .
If you don’t know how to create table in hive or load data into hive table kindly read this post:
Create Table in Hive
The data looks like :
0: jdbc:hive2://localhost:10000> select * from pres_orc as tbl;
tbl.pres_id | tbl.pres_name | tbl.pres_dob | tbl.pres_bp | tbl.pres_bs | tbl.pres_in | tbl.pres_out |
---|---|---|---|---|---|---|
1 | George Washington | 1732-02-22 | Westmoreland County | Virginia | 1789-04-30 | 1797-03-04 |
2 | John Adams | 1735-10-30 | Braintree | Massachusetts | 1797-03-04 | 1801-03-04 |
3 | Thomas Jefferson | 1743-04-13 | Shadwell | Virginia | 1801-03-04 | 1809-03-04 |
4 | James Madison | 1751-03-16 | Port Conway | Virginia | 1809-03-04 | 1817-03-04 |
5 | James Monroe | 1758-04-28 | Monroe Hall | Virginia | 1817-03-04 | 1825-03-04 |
6 | John Quincy Adams | 1767-07-11 | Braintree | Massachusetts | 1825-03-04 | 1829-03-04 |
7 | Andrew Jackson | 1767-03-15 | Waxhaws Region | South/North Carolina | 1829-03-04 | 1837-03-04 |
8 | Martin Van Buren | 1782-12-05 | Kinderhook | New York | 1837-03-04 | 1841-03-04 |
9 | William Henry Harrison | 1773-02-09 | Charles City County | Virginia | 1841-03-04 | 1841-04-04 |
10 | John Tyler | 1790-03-29 | Charles City County | Virginia | 1841-04-04 | 1845-03-04 |
11 | James K. Polk | 1795-11-02 | Pineville | North Carolina | 1845-03-04 | 1849-03-04 |
12 | Zachary Taylor | 1784-11-24 | Barboursville | Virginia | 1849-03-04 | 1850-07-09 |
13 | Millard Fillmore | 1800-01-07 | Summerhill | New York | 1850-07-09 | 1853-03-04 |
14 | Franklin Pierce | 1804-11-23 | Hillsborough | New Hampshire | 1853-03-04 | 1857-03-04 |
15 | James Buchanan | 1791-04-23 | Cove Gap | Pennsylvania | 1857-03-04 | 1861-03-04 |
16 | Abraham Lincoln | 1809-02-12 | Sinking spring | Kentucky | 1861-03-04 | 1865-04-15 |
17 | Andrew Johnson | 1808-12-29 | Raleigh | North Carolina | 1865-04-15 | 1869-03-04 |
18 | Ulysses S. Grant | 1822-04-27 | Point Pleasant | Ohio | 1869-03-04 | 1877-03-04 |
19 | Rutherford B. Hayes | 1822-10-04 | Delaware | Ohio | 1877-03-04 | 1881-03-04 |
20 | James A. Garfield | 1831-11-19 | Moreland Hills | Ohio | 1881-03-04 | 1881-09-19 |
21 | Chester A. Arthur | 1829-10-05 | Fairfield | Vermont | 1881-09-19 | 1885-03-04 |
22 | Grover Cleveland | 1837-03-18 | Caldwell | New Jersey | 1885-03-04 | 1889-03-04 |
23 | Benjamin Harrison | 1833-08-20 | North Bend | Ohio | 1889-03-04 | 1893-03-04 |
24 | Grover Cleveland | 1837-03-18 | Caldwell | New Jersey | 1893-03-04 | 1897-03-04 |
25 | William McKinley | 1843-01-29 | Niles | Ohio | 1897-03-04 | 1901-09-14 |
26 | Theodore Roosevelt | 1858-10-27 | Manhattan | New York | 1901-09-14 | 1909-03-04 |
27 | William Howard Taft | 1857-09-15 | Cincinnati | Ohio | 1909-03-04 | 1913-03-04 |
28 | Woodrow Wilson | 1856-12-28 | Staunton | Virginia | 1913-03-04 | 1921-03-04 |
29 | Warren G. Harding | 1865-11-02 | Blooming Grove | Ohio | 1921-03-04 | 1923-08-02 |
30 | Calvin Coolidge | 1872-07-04 | Plymouth | Vermont | 1923-08-02 | 1929-03-04 |
31 | Herbert Hoover | 1874-08-10 | West Branch | Iowa | 1929-03-04 | 1933-03-04 |
32 | Franklin D. Roosevelt | 1882-01-30 | Hyde Park | New York | 1933-03-04 | 1945-04-12 |
33 | Harry S. Truman | 1884-05-08 | Lamar | Missouri | 1945-04-12 | 1953-01-20 |
34 | Dwight D. Eisenhower | 1890-10-14 | Denison | Texas | 1953-01-20 | 1961-01-20 |
35 | John F. Kennedy | 1917-05-29 | Brookline | Massachusetts | 1961-01-20 | 1963-11-22 |
36 | Lyndon B. Johnson | 1908-08-27 | Stonewall | Texas | 1963-11-22 | 1969-01-20 |
37 | Richard M. Nixon | 1913-01-09 | Yorba Linda | California | 1969-01-20 | 1974-08-09 |
38 | Gerald R. Ford | 1913-07-14 | Omaha | Nebraska | 1974-08-09 | 1977-01-20 |
39 | Jimmy Carter | 1924-10-01 | Plains | Georgia | 1977-01-20 | 1981-01-20 |
40 | Ronald Reagan | 1911-02-06 | Tampico | Illinois | 1981-01-20 | 1989-01-20 |
41 | George H. W. Bush | 1924-06-12 | Milton | Massachusetts | 1989-01-20 | 1993-01-20 |
42 | Bill Clinton | 1946-08-19 | Hope | Arkansas | 1993-01-20 | 2001-01-20 |
43 | George W. Bush | 1946-07-06 | New Haven | Connecticut | 2001-01-20 | 2009-01-20 |
44 | Barack Obama | 1961-08-04 | Honolulu | Hawaii | 2009-01-20 | 2017-01-20 |
45 | Donald Trump | 1946-06-14 | Queens | New York | 2017-01-20 | NULL |
Now we will create a DataFrame in spark and will load this data into it. For future examples, we will use this DataFrame when required. Now let’s connect to spark. I am using HDP 2.5. So I will open “spark-shell” and few points to note here:
spark version 1.6
Spark context sc
SQL context sqlContext
Now let’s run the statement to create dataframe:
scala> val df_pres = sqlContext.sql(" SELECT * from testdb.pres_orc") 17/09/10 08:32:58 INFO ParseDriver: Parsing command: SELECT * from testdb.pres_orc 17/09/10 08:32:58 INFO ParseDriver: Parse Completed df_pres: org.apache.spark.sql.DataFrame = [pres_id: tinyint, pres_name: string, pres_dob: date, pres_bp: string, pres_bs: string, pres_in: date, pres_out: date]
Now I have a DataFrame with name df_pres which has the HIVE Table data in it. I can perform almost all the SQL operations on it in SPARK-SQL. SELECT primarily has two options:
- You can either SELECT all columns by specifying “*” in the SQL query
- You can mention specific columns in the SQL query to pick only required columns
Now how do we do it in Spark ?
1) Show all columns from DataFrame
scala> df_pres.show() +-------+--------------------+----------+-------------------+--------------------+----------+----------+ |pres_id| pres_name| pres_dob| pres_bp| pres_bs| pres_in| pres_out| +-------+--------------------+----------+-------------------+--------------------+----------+----------+ | 1| George Washington|1732-02-22|Westmoreland County| Virginia|1789-04-30|1797-03-04| | 2| John Adams|1735-10-30| Braintree| Massachusetts|1797-03-04|1801-03-04| | 3| Thomas Jefferson|1743-04-13| Shadwell| Virginia|1801-03-04|1809-03-04| | 4| James Madison|1751-03-16| Port Conway| Virginia|1809-03-04|1817-03-04| | 5| James Monroe|1758-04-28| Monroe Hall| Virginia|1817-03-04|1825-03-04| | 6| John Quincy Adams|1767-07-11| Braintree| Massachusetts|1825-03-04|1829-03-04| | 7| Andrew Jackson|1767-03-15| Waxhaws Region|South/North Carolina|1829-03-04|1837-03-04| | 8| Martin Van Buren|1782-12-05| Kinderhook| New York|1837-03-04|1841-03-04| | 9|William Henry Har...|1773-02-09|Charles City County| Virginia|1841-03-04|1841-04-04| | 10| John Tyler|1790-03-29|Charles City County| Virginia|1841-04-04|1845-03-04| | 11| James K. Polk|1795-11-02| Pineville| North Carolina|1845-03-04|1849-03-04| | 12| Zachary Taylor|1784-11-24| Barboursville| Virginia|1849-03-04|1850-07-09| | 13| Millard Fillmore|1800-01-07| Summerhill| New York|1850-07-09|1853-03-04| | 14| Franklin Pierce|1804-11-23| Hillsborough| New Hampshire|1853-03-04|1857-03-04| | 15| James Buchanan|1791-04-23| Cove Gap| Pennsylvania|1857-03-04|1861-03-04| | 16| Abraham Lincoln|1809-02-12| Sinking spring| Kentucky|1861-03-04|1865-04-15| | 17| Andrew Johnson|1808-12-29| Raleigh| North Carolina|1865-04-15|1869-03-04| | 18| Ulysses S. Grant|1822-04-27| Point Pleasant| Ohio|1869-03-04|1877-03-04| | 19| Rutherford B. Hayes|1822-10-04| Delaware| Ohio|1877-03-04|1881-03-04| | 20| James A. Garfield|1831-11-19| Moreland Hills| Ohio|1881-03-04|1881-09-19| +-------+--------------------+----------+-------------------+--------------------+----------+----------+ only showing top 20 rows
If you wish to show more number of rows, just enter the number in the show function.
scala> df_pres.show(45)
The above command will show all 45 rows from the data frame.
2) Show specific columns from DataFrame
scala> df_pres.select($"pres_id",$"pres_dob",$"pres_bs").show() +-------+----------+--------------------+ |pres_id| pres_dob| pres_bs| +-------+----------+--------------------+ | 1|1732-02-22| Virginia| | 2|1735-10-30| Massachusetts| | 3|1743-04-13| Virginia| | 4|1751-03-16| Virginia| | 5|1758-04-28| Virginia| | 6|1767-07-11| Massachusetts| | 7|1767-03-15|South/North Carolina| | 8|1782-12-05| New York| | 9|1773-02-09| Virginia| | 10|1790-03-29| Virginia| | 11|1795-11-02| North Carolina| | 12|1784-11-24| Virginia| | 13|1800-01-07| New York| | 14|1804-11-23| New Hampshire| | 15|1791-04-23| Pennsylvania| | 16|1809-02-12| Kentucky| | 17|1808-12-29| North Carolina| | 18|1822-04-27| Ohio| | 19|1822-10-04| Ohio| | 20|1831-11-19| Ohio| +-------+----------+--------------------+ only showing top 20 rows
In the next post, we will see how to mention ALIAS names for COLUMNS and TABLES while using DATAFRAMES in SPARK.
Watch Video to see how Spark Dataframe SELECT works: