Spark Dataframe Select

In this post, we will see how to fetch data from HIVE table into SPARK DataFrame and perform few SQL like “SELECT” operations on it. I have a table in HIVE database which has details of all the US Presidents (src: https://en.wikipedia.org/) .
If you don’t know how to create table in hive or load data into hive table kindly read this post:
Create Table in Hive
The data looks like :

0: jdbc:hive2://localhost:10000> select * from pres_orc as tbl;
tbl.pres_id tbl.pres_name tbl.pres_dob tbl.pres_bp tbl.pres_bs tbl.pres_in tbl.pres_out
1 George Washington 1732-02-22 Westmoreland County Virginia 1789-04-30 1797-03-04
2 John Adams 1735-10-30 Braintree Massachusetts 1797-03-04 1801-03-04
3 Thomas Jefferson 1743-04-13 Shadwell Virginia 1801-03-04 1809-03-04
4 James Madison 1751-03-16 Port Conway Virginia 1809-03-04 1817-03-04
5 James Monroe 1758-04-28 Monroe Hall Virginia 1817-03-04 1825-03-04
6 John Quincy Adams 1767-07-11 Braintree Massachusetts 1825-03-04 1829-03-04
7 Andrew Jackson 1767-03-15 Waxhaws Region South/North Carolina 1829-03-04 1837-03-04
8 Martin Van Buren 1782-12-05 Kinderhook New York 1837-03-04 1841-03-04
9 William Henry Harrison 1773-02-09 Charles City County Virginia 1841-03-04 1841-04-04
10 John Tyler 1790-03-29 Charles City County Virginia 1841-04-04 1845-03-04
11 James K. Polk 1795-11-02 Pineville North Carolina 1845-03-04 1849-03-04
12 Zachary Taylor 1784-11-24 Barboursville Virginia 1849-03-04 1850-07-09
13 Millard Fillmore 1800-01-07 Summerhill New York 1850-07-09 1853-03-04
14 Franklin Pierce 1804-11-23 Hillsborough New Hampshire 1853-03-04 1857-03-04
15 James Buchanan 1791-04-23 Cove Gap Pennsylvania 1857-03-04 1861-03-04
16 Abraham Lincoln 1809-02-12 Sinking spring Kentucky 1861-03-04 1865-04-15
17 Andrew Johnson 1808-12-29 Raleigh North Carolina 1865-04-15 1869-03-04
18 Ulysses S. Grant 1822-04-27 Point Pleasant Ohio 1869-03-04 1877-03-04
19 Rutherford B. Hayes 1822-10-04 Delaware Ohio 1877-03-04 1881-03-04
20 James A. Garfield 1831-11-19 Moreland Hills Ohio 1881-03-04 1881-09-19
21 Chester A. Arthur 1829-10-05 Fairfield Vermont 1881-09-19 1885-03-04
22 Grover Cleveland 1837-03-18 Caldwell New Jersey 1885-03-04 1889-03-04
23 Benjamin Harrison 1833-08-20 North Bend Ohio 1889-03-04 1893-03-04
24 Grover Cleveland 1837-03-18 Caldwell New Jersey 1893-03-04 1897-03-04
25 William McKinley 1843-01-29 Niles Ohio 1897-03-04 1901-09-14
26 Theodore Roosevelt 1858-10-27 Manhattan New York 1901-09-14 1909-03-04
27 William Howard Taft 1857-09-15 Cincinnati Ohio 1909-03-04 1913-03-04
28 Woodrow Wilson 1856-12-28 Staunton Virginia 1913-03-04 1921-03-04
29 Warren G. Harding 1865-11-02 Blooming Grove Ohio 1921-03-04 1923-08-02
30 Calvin Coolidge 1872-07-04 Plymouth Vermont 1923-08-02 1929-03-04
31 Herbert Hoover 1874-08-10 West Branch Iowa 1929-03-04 1933-03-04
32 Franklin D. Roosevelt 1882-01-30 Hyde Park New York 1933-03-04 1945-04-12
33 Harry S. Truman 1884-05-08 Lamar Missouri 1945-04-12 1953-01-20
34 Dwight D. Eisenhower 1890-10-14 Denison Texas 1953-01-20 1961-01-20
35 John F. Kennedy 1917-05-29 Brookline Massachusetts 1961-01-20 1963-11-22
36 Lyndon B. Johnson 1908-08-27 Stonewall Texas 1963-11-22 1969-01-20
37 Richard M. Nixon 1913-01-09 Yorba Linda California 1969-01-20 1974-08-09
38 Gerald R. Ford 1913-07-14 Omaha Nebraska 1974-08-09 1977-01-20
39 Jimmy Carter 1924-10-01 Plains Georgia 1977-01-20 1981-01-20
40 Ronald Reagan 1911-02-06 Tampico Illinois 1981-01-20 1989-01-20
41 George H. W. Bush 1924-06-12 Milton Massachusetts 1989-01-20 1993-01-20
42 Bill Clinton 1946-08-19 Hope Arkansas 1993-01-20 2001-01-20
43 George W. Bush 1946-07-06 New Haven Connecticut 2001-01-20 2009-01-20
44 Barack Obama 1961-08-04 Honolulu Hawaii 2009-01-20 2017-01-20
45 Donald Trump 1946-06-14 Queens New York 2017-01-20 NULL

Now we will create a DataFrame in spark and will load this data into it. For future examples, we will use this DataFrame when required. Now let’s connect to spark. I am using HDP 2.5. So I will open “spark-shell” and few points to note here:
spark version 1.6
Spark context sc
SQL context sqlContext

Now let’s run the statement to create dataframe:

 scala> val df_pres = sqlContext.sql(" SELECT * from testdb.pres_orc")
17/09/10 08:32:58 INFO ParseDriver: Parsing command: SELECT * from testdb.pres_orc
17/09/10 08:32:58 INFO ParseDriver: Parse Completed
df_pres: org.apache.spark.sql.DataFrame = [pres_id: tinyint, pres_name: string, pres_dob: date, pres_bp: string, pres_bs: string, pres_in: date, pres_out: date]

Now I have a DataFrame with name df_pres which has the HIVE Table data in it. I can perform almost all the SQL operations on it in SPARK-SQL. SELECT primarily has two options:

  1. You can either SELECT all columns by specifying “*” in the SQL query
  2. You can mention specific columns in the SQL query to pick only required columns

Now how do we do it in Spark ?

1) Show all columns from DataFrame

scala> df_pres.show()
+-------+--------------------+----------+-------------------+--------------------+----------+----------+
|pres_id|           pres_name|  pres_dob|            pres_bp|             pres_bs|   pres_in|  pres_out|
+-------+--------------------+----------+-------------------+--------------------+----------+----------+
|      1|   George Washington|1732-02-22|Westmoreland County|            Virginia|1789-04-30|1797-03-04|
|      2|          John Adams|1735-10-30|          Braintree|       Massachusetts|1797-03-04|1801-03-04|
|      3|    Thomas Jefferson|1743-04-13|           Shadwell|            Virginia|1801-03-04|1809-03-04|
|      4|       James Madison|1751-03-16|        Port Conway|            Virginia|1809-03-04|1817-03-04|
|      5|        James Monroe|1758-04-28|        Monroe Hall|            Virginia|1817-03-04|1825-03-04|
|      6|   John Quincy Adams|1767-07-11|          Braintree|       Massachusetts|1825-03-04|1829-03-04|
|      7|      Andrew Jackson|1767-03-15|     Waxhaws Region|South/North Carolina|1829-03-04|1837-03-04|
|      8|    Martin Van Buren|1782-12-05|         Kinderhook|            New York|1837-03-04|1841-03-04|
|      9|William Henry Har...|1773-02-09|Charles City County|            Virginia|1841-03-04|1841-04-04|
|     10|          John Tyler|1790-03-29|Charles City County|            Virginia|1841-04-04|1845-03-04|
|     11|       James K. Polk|1795-11-02|          Pineville|      North Carolina|1845-03-04|1849-03-04|
|     12|      Zachary Taylor|1784-11-24|      Barboursville|            Virginia|1849-03-04|1850-07-09|
|     13|    Millard Fillmore|1800-01-07|         Summerhill|            New York|1850-07-09|1853-03-04|
|     14|     Franklin Pierce|1804-11-23|       Hillsborough|       New Hampshire|1853-03-04|1857-03-04|
|     15|      James Buchanan|1791-04-23|           Cove Gap|        Pennsylvania|1857-03-04|1861-03-04|
|     16|     Abraham Lincoln|1809-02-12|     Sinking spring|            Kentucky|1861-03-04|1865-04-15|
|     17|      Andrew Johnson|1808-12-29|            Raleigh|      North Carolina|1865-04-15|1869-03-04|
|     18|    Ulysses S. Grant|1822-04-27|     Point Pleasant|                Ohio|1869-03-04|1877-03-04|
|     19| Rutherford B. Hayes|1822-10-04|           Delaware|                Ohio|1877-03-04|1881-03-04|
|     20|   James A. Garfield|1831-11-19|     Moreland Hills|                Ohio|1881-03-04|1881-09-19|
+-------+--------------------+----------+-------------------+--------------------+----------+----------+
only showing top 20 rows

If you wish to show more number of rows, just enter the number in the show function.

scala> df_pres.show(45)

The above command will show all 45 rows from the data frame.

2) Show specific columns from DataFrame

scala> df_pres.select($"pres_id",$"pres_dob",$"pres_bs").show()
+-------+----------+--------------------+
|pres_id|  pres_dob|             pres_bs|
+-------+----------+--------------------+
|      1|1732-02-22|            Virginia|
|      2|1735-10-30|       Massachusetts|
|      3|1743-04-13|            Virginia|
|      4|1751-03-16|            Virginia|
|      5|1758-04-28|            Virginia|
|      6|1767-07-11|       Massachusetts|
|      7|1767-03-15|South/North Carolina|
|      8|1782-12-05|            New York|
|      9|1773-02-09|            Virginia|
|     10|1790-03-29|            Virginia|
|     11|1795-11-02|      North Carolina|
|     12|1784-11-24|            Virginia|
|     13|1800-01-07|            New York|
|     14|1804-11-23|       New Hampshire|
|     15|1791-04-23|        Pennsylvania|
|     16|1809-02-12|            Kentucky|
|     17|1808-12-29|      North Carolina|
|     18|1822-04-27|                Ohio|
|     19|1822-10-04|                Ohio|
|     20|1831-11-19|                Ohio|
+-------+----------+--------------------+
only showing top 20 rows

In the next post, we will see how to mention ALIAS names for COLUMNS and TABLES while using DATAFRAMES in SPARK.

Leave a Reply