Apache Hive

Create your first Table in HIVE and load data into it.

Once you have access to HIVE , the first thing you would like to do is Create a Database and Create few tables in it. Before we start with the SQL commands, it is good to know how HIVE stores the data. Since in HDFS everything is FILE based so HIVE stores all the information in FILEs only. So we should know the path of FILEs also for better understanding the concepts.

TIP: Open two putty sessions: one two execute hive queries via beeline , second to run OS & HDFS commands.

[hadoop@ip-10-0-0-142 ~]$ beeline
Beeline version 2.3.7-amzn-3 by Apache Hive
beeline> !connect jdbc:hive2://localhost:10000
Connecting to jdbc:hive2://localhost:10000
Enter username for jdbc:hive2://localhost:10000: hive
Enter password for jdbc:hive2://localhost:10000:
Connected to: Apache Hive (version 2.3.7-amzn-3)
Driver: Hive JDBC (version 2.3.7-amzn-3)
Transaction isolation: TRANSACTION_REPEATABLE_READ

Once you login into HIVE via beeline, run below command:

0: jdbc:hive2://localhost:10000> set hive.metastore.warehouse.dir;
+----------------------------------------------------+
|                        set                         |
+----------------------------------------------------+
| hive.metastore.warehouse.dir=/user/hive/warehouse  |
+----------------------------------------------------+
1 row selected (0.169 seconds)

So in Hadoop, we have /user/hive/warehouse as the location to store all the files/data. You can check the file location by running following command in linux:

[hadoop@ip-10-0-0-142 ~]$ hdfs dfs -ls /user/hive/warehouse

We will use USA President public data-set for our example. You can download DATA SET from this link.

Let’s create a DATABASE first and then we will create tables in it.

CREATE DATABASE in HIVE

0: jdbc:hive2://localhost:10000> create database db_pres;


Now let’s check how it has impacted file directory:

[hadoop@ip-10-0-0-142 ~]$ hdfs dfs -ls /user/hive/warehouse
Found 1 items
drwxrwxrwt   - hive hadoop          0 2021-05-06 05:48 /user/hive/warehouse/db_pres.db


I have placed DATA file in the path and can validate it.

[hadoop@ip-10-0-0-142 ~]$ head /tmp/hive_data/pres_data.csv
pres_id,pres_name,pres_dob,pres_bp,pres_bs,pres_in,pres_out
1,George Washington,1732-02-22,Westmoreland County,Virginia,1789-04-30,1797-03-04
2,John Adams,1735-10-30,Braintree,Massachusetts,1797-03-04,1801-03-04
3,Thomas Jefferson,1743-04-13,Shadwell,Virginia,1801-03-04,1809-03-04
4,James Madison,1751-03-16,Port Conway,Virginia,1809-03-04,1817-03-04
5,James Monroe,1758-04-28,Monroe Hall,Virginia,1817-03-04,1825-03-04
6,John Quincy Adams,1767-07-11,Braintree,Massachusetts,1825-03-04,1829-03-04
7,Andrew Jackson,1767-03-15,Waxhaws Region,South/North Carolina,1829-03-04,1837-03-04
8,Martin Van Buren,1782-12-05,Kinderhook,New York,1837-03-04,1841-03-04
9,William Henry Harrison,1773-02-09,Charles City County,Virginia,1841-03-04,1841-04-04

Now File is in the path and next step is to CREATE A TABLE to store DATA into it. Switch to beeline CLI and run below CREATE TABLE command

CREATE TABLE in HIVE

create table db_pres.usa_prez
(
pres_id integer,
pres_name varchar(100),
pres_dob date,
pres_bp varchar(100),
pres_bs varchar(100),
pres_in date,
pres_out date
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
tblproperties ("skip.header.line.count"="1");

Verify table creation at directory level:

[hadoop@ip-10-0-0-142 ~]$ hdfs dfs -ls /user/hive/warehouse/db_pres.db/
Found 1 items
drwxrwxrwt   - hive hadoop          0 2021-05-06 06:08 /user/hive/warehouse/db_pres.db/usa_prez

Most of the create table statements will look familiar to you and only difference is at the end of command where we specify the file format. File is comma separated and we are storing it as TEXTFILE. There are other formats also however for this example we will store the table as TEXTFILE only. Since the DATA file has header in it , we will skip the first row while loading the data into the table.Hence added table property to skip 1 header line.
Once Table is created, Next step is to load data into the table. We will use below command to load DATA into HIVE table:

LOAD CSV DATA into Hive Table

LOAD DATA LOCAL INPATH '/tmp/hive_data/pres_data.csv' INTO TABLE db_pres.usa_prez;

LOCAL keyword is optional.If you specify it then it will look at OS path and if you skip it then it will search for file in HDFS.
Once data is loaded you can verify it in directory:

[hadoop@ip-10-0-0-142 ~]$ hdfs dfs -ls /user/hive/warehouse/db_pres.db/usa_prez/
Found 1 items
-rwxrwxrwt   1 hive hadoop       3375 2021-05-06 06:11 /user/hive/warehouse/db_pres.db/usa_prez/pres_data.csv

So we have created a DATABASE then created a TABLE. Now we have loaded data into the table too. Next step is to run few sample queries on the data.

0: jdbc:hive2://localhost:10000> select count(*) from db_pres.usa_prez;
+------+
| _c0  |
+------+
| 46   |
+------+
1 row selected (0.659 seconds)

Let’s look at some sample records too:

0: jdbc:hive2://localhost:10000> select * from db_pres.usa_prez as t limit 5;
+------------+--------------------+-------------+----------------------+----------------+-------------+-------------+
| t.pres_id  |    t.pres_name     | t.pres_dob  |      t.pres_bp       |   t.pres_bs    |  t.pres_in  | t.pres_out  |
+------------+--------------------+-------------+----------------------+----------------+-------------+-------------+
| 1          | George Washington  | 1732-02-22  | Westmoreland County  | Virginia       | 1789-04-30  | 1797-03-04  |
| 2          | John Adams         | 1735-10-30  | Braintree            | Massachusetts  | 1797-03-04  | 1801-03-04  |
| 3          | Thomas Jefferson   | 1743-04-13  | Shadwell             | Virginia       | 1801-03-04  | 1809-03-04  |
| 4          | James Madison      | 1751-03-16  | Port Conway          | Virginia       | 1809-03-04  | 1817-03-04  |
| 5          | James Monroe       | 1758-04-28  | Monroe Hall          | Virginia       | 1817-03-04  | 1825-03-04  |
+------------+--------------------+-------------+----------------------+----------------+-------------+-------------+
5 rows selected (0.132 seconds)

Now you know how to Create Table and load data into it in HIVE. In the next post we will see different table formats and how they are stored in HDFS.

Leave a Reply