Apache Spark

Spark Dataframe Select

In this post, we will see how to fetch data from HIVE table into SPARK DataFrame and perform few SQL like “SELECT” operations on it. I have a table in HIVE database which has details of all the US Presidents (src: https://en.wikipedia.org/) .
If you don’t know how to create table in hive or load data into hive table kindly read this post:
Create Table in Hive
The data looks like :

0: jdbc:hive2://localhost:10000> select * from pres_orc as tbl;
tbl.pres_id tbl.pres_name tbl.pres_dob tbl.pres_bp tbl.pres_bs tbl.pres_in tbl.pres_out
1 George Washington 1732-02-22 Westmoreland County Virginia 1789-04-30 1797-03-04
2 John Adams 1735-10-30 Braintree Massachusetts 1797-03-04 1801-03-04
3 Thomas Jefferson 1743-04-13 Shadwell Virginia 1801-03-04 1809-03-04
4 James Madison 1751-03-16 Port Conway Virginia 1809-03-04 1817-03-04
5 James Monroe 1758-04-28 Monroe Hall Virginia 1817-03-04 1825-03-04
6 John Quincy Adams 1767-07-11 Braintree Massachusetts 1825-03-04 1829-03-04
7 Andrew Jackson 1767-03-15 Waxhaws Region South/North Carolina 1829-03-04 1837-03-04
8 Martin Van Buren 1782-12-05 Kinderhook New York 1837-03-04 1841-03-04
9 William Henry Harrison 1773-02-09 Charles City County Virginia 1841-03-04 1841-04-04
10 John Tyler 1790-03-29 Charles City County Virginia 1841-04-04 1845-03-04
11 James K. Polk 1795-11-02 Pineville North Carolina 1845-03-04 1849-03-04
12 Zachary Taylor 1784-11-24 Barboursville Virginia 1849-03-04 1850-07-09
13 Millard Fillmore 1800-01-07 Summerhill New York 1850-07-09 1853-03-04
14 Franklin Pierce 1804-11-23 Hillsborough New Hampshire 1853-03-04 1857-03-04
15 James Buchanan 1791-04-23 Cove Gap Pennsylvania 1857-03-04 1861-03-04
16 Abraham Lincoln 1809-02-12 Sinking spring Kentucky 1861-03-04 1865-04-15
17 Andrew Johnson 1808-12-29 Raleigh North Carolina 1865-04-15 1869-03-04
18 Ulysses S. Grant 1822-04-27 Point Pleasant Ohio 1869-03-04 1877-03-04
19 Rutherford B. Hayes 1822-10-04 Delaware Ohio 1877-03-04 1881-03-04
20 James A. Garfield 1831-11-19 Moreland Hills Ohio 1881-03-04 1881-09-19
21 Chester A. Arthur 1829-10-05 Fairfield Vermont 1881-09-19 1885-03-04
22 Grover Cleveland 1837-03-18 Caldwell New Jersey 1885-03-04 1889-03-04
23 Benjamin Harrison 1833-08-20 North Bend Ohio 1889-03-04 1893-03-04
24 Grover Cleveland 1837-03-18 Caldwell New Jersey 1893-03-04 1897-03-04
25 William McKinley 1843-01-29 Niles Ohio 1897-03-04 1901-09-14
26 Theodore Roosevelt 1858-10-27 Manhattan New York 1901-09-14 1909-03-04
27 William Howard Taft 1857-09-15 Cincinnati Ohio 1909-03-04 1913-03-04
28 Woodrow Wilson 1856-12-28 Staunton Virginia 1913-03-04 1921-03-04
29 Warren G. Harding 1865-11-02 Blooming Grove Ohio 1921-03-04 1923-08-02
30 Calvin Coolidge 1872-07-04 Plymouth Vermont 1923-08-02 1929-03-04
31 Herbert Hoover 1874-08-10 West Branch Iowa 1929-03-04 1933-03-04
32 Franklin D. Roosevelt 1882-01-30 Hyde Park New York 1933-03-04 1945-04-12
33 Harry S. Truman 1884-05-08 Lamar Missouri 1945-04-12 1953-01-20
34 Dwight D. Eisenhower 1890-10-14 Denison Texas 1953-01-20 1961-01-20
35 John F. Kennedy 1917-05-29 Brookline Massachusetts 1961-01-20 1963-11-22
36 Lyndon B. Johnson 1908-08-27 Stonewall Texas 1963-11-22 1969-01-20
37 Richard M. Nixon 1913-01-09 Yorba Linda California 1969-01-20 1974-08-09
38 Gerald R. Ford 1913-07-14 Omaha Nebraska 1974-08-09 1977-01-20
39 Jimmy Carter 1924-10-01 Plains Georgia 1977-01-20 1981-01-20
40 Ronald Reagan 1911-02-06 Tampico Illinois 1981-01-20 1989-01-20
41 George H. W. Bush 1924-06-12 Milton Massachusetts 1989-01-20 1993-01-20
42 Bill Clinton 1946-08-19 Hope Arkansas 1993-01-20 2001-01-20
43 George W. Bush 1946-07-06 New Haven Connecticut 2001-01-20 2009-01-20
44 Barack Obama 1961-08-04 Honolulu Hawaii 2009-01-20 2017-01-20
45 Donald Trump 1946-06-14 Queens New York 2017-01-20 NULL

Now we will create a DataFrame in spark and will load this data into it. For future examples, we will use this DataFrame when required. Now let’s connect to spark. I am using HDP 2.5. So I will open “spark-shell” and few points to note here:
spark version 1.6
Spark context sc
SQL context sqlContext

Now let’s run the statement to create dataframe:

 scala> val df_pres = sqlContext.sql(" SELECT * from testdb.pres_orc")
17/09/10 08:32:58 INFO ParseDriver: Parsing command: SELECT * from testdb.pres_orc
17/09/10 08:32:58 INFO ParseDriver: Parse Completed
df_pres: org.apache.spark.sql.DataFrame = [pres_id: tinyint, pres_name: string, pres_dob: date, pres_bp: string, pres_bs: string, pres_in: date, pres_out: date]

Now I have a DataFrame with name df_pres which has the HIVE Table data in it. I can perform almost all the SQL operations on it in SPARK-SQL. SELECT primarily has two options:

  1. You can either SELECT all columns by specifying “*” in the SQL query
  2. You can mention specific columns in the SQL query to pick only required columns

Now how do we do it in Spark ?

1) Show all columns from DataFrame

scala> df_pres.show()
+-------+--------------------+----------+-------------------+--------------------+----------+----------+
|pres_id|           pres_name|  pres_dob|            pres_bp|             pres_bs|   pres_in|  pres_out|
+-------+--------------------+----------+-------------------+--------------------+----------+----------+
|      1|   George Washington|1732-02-22|Westmoreland County|            Virginia|1789-04-30|1797-03-04|
|      2|          John Adams|1735-10-30|          Braintree|       Massachusetts|1797-03-04|1801-03-04|
|      3|    Thomas Jefferson|1743-04-13|           Shadwell|            Virginia|1801-03-04|1809-03-04|
|      4|       James Madison|1751-03-16|        Port Conway|            Virginia|1809-03-04|1817-03-04|
|      5|        James Monroe|1758-04-28|        Monroe Hall|            Virginia|1817-03-04|1825-03-04|
|      6|   John Quincy Adams|1767-07-11|          Braintree|       Massachusetts|1825-03-04|1829-03-04|
|      7|      Andrew Jackson|1767-03-15|     Waxhaws Region|South/North Carolina|1829-03-04|1837-03-04|
|      8|    Martin Van Buren|1782-12-05|         Kinderhook|            New York|1837-03-04|1841-03-04|
|      9|William Henry Har...|1773-02-09|Charles City County|            Virginia|1841-03-04|1841-04-04|
|     10|          John Tyler|1790-03-29|Charles City County|            Virginia|1841-04-04|1845-03-04|
|     11|       James K. Polk|1795-11-02|          Pineville|      North Carolina|1845-03-04|1849-03-04|
|     12|      Zachary Taylor|1784-11-24|      Barboursville|            Virginia|1849-03-04|1850-07-09|
|     13|    Millard Fillmore|1800-01-07|         Summerhill|            New York|1850-07-09|1853-03-04|
|     14|     Franklin Pierce|1804-11-23|       Hillsborough|       New Hampshire|1853-03-04|1857-03-04|
|     15|      James Buchanan|1791-04-23|           Cove Gap|        Pennsylvania|1857-03-04|1861-03-04|
|     16|     Abraham Lincoln|1809-02-12|     Sinking spring|            Kentucky|1861-03-04|1865-04-15|
|     17|      Andrew Johnson|1808-12-29|            Raleigh|      North Carolina|1865-04-15|1869-03-04|
|     18|    Ulysses S. Grant|1822-04-27|     Point Pleasant|                Ohio|1869-03-04|1877-03-04|
|     19| Rutherford B. Hayes|1822-10-04|           Delaware|                Ohio|1877-03-04|1881-03-04|
|     20|   James A. Garfield|1831-11-19|     Moreland Hills|                Ohio|1881-03-04|1881-09-19|
+-------+--------------------+----------+-------------------+--------------------+----------+----------+
only showing top 20 rows

If you wish to show more number of rows, just enter the number in the show function.

scala> df_pres.show(45)

The above command will show all 45 rows from the data frame.

2) Show specific columns from DataFrame

scala> df_pres.select($"pres_id",$"pres_dob",$"pres_bs").show()
+-------+----------+--------------------+
|pres_id|  pres_dob|             pres_bs|
+-------+----------+--------------------+
|      1|1732-02-22|            Virginia|
|      2|1735-10-30|       Massachusetts|
|      3|1743-04-13|            Virginia|
|      4|1751-03-16|            Virginia|
|      5|1758-04-28|            Virginia|
|      6|1767-07-11|       Massachusetts|
|      7|1767-03-15|South/North Carolina|
|      8|1782-12-05|            New York|
|      9|1773-02-09|            Virginia|
|     10|1790-03-29|            Virginia|
|     11|1795-11-02|      North Carolina|
|     12|1784-11-24|            Virginia|
|     13|1800-01-07|            New York|
|     14|1804-11-23|       New Hampshire|
|     15|1791-04-23|        Pennsylvania|
|     16|1809-02-12|            Kentucky|
|     17|1808-12-29|      North Carolina|
|     18|1822-04-27|                Ohio|
|     19|1822-10-04|                Ohio|
|     20|1831-11-19|                Ohio|
+-------+----------+--------------------+
only showing top 20 rows

In the next post, we will see how to mention ALIAS names for COLUMNS and TABLES while using DATAFRAMES in SPARK.

Watch Video to see how Spark Dataframe SELECT works:

Leave a Reply