Once you have access to HIVE , the first thing you would like to do is Create a Database and Create few tables in it. Before we start with the SQL commands, it is good to know how HIVE stores the data. Since in HDFS everything is FILE based so HIVE stores all the information in FILEs only. So we should know the path of FILEs also for better understanding the concepts.
TIP: Open two putty sessions: one two execute hive queries via beeline , second to run OS & HDFS commands.
[hadoop@ip-10-0-0-142 ~]$ beeline Beeline version 2.3.7-amzn-3 by Apache Hive beeline> !connect jdbc:hive2://localhost:10000 Connecting to jdbc:hive2://localhost:10000 Enter username for jdbc:hive2://localhost:10000: hive Enter password for jdbc:hive2://localhost:10000: Connected to: Apache Hive (version 2.3.7-amzn-3) Driver: Hive JDBC (version 2.3.7-amzn-3) Transaction isolation: TRANSACTION_REPEATABLE_READ
Once you login into HIVE via beeline, run below command:
0: jdbc:hive2://localhost:10000> set hive.metastore.warehouse.dir; +----------------------------------------------------+ | set | +----------------------------------------------------+ | hive.metastore.warehouse.dir=/user/hive/warehouse | +----------------------------------------------------+ 1 row selected (0.169 seconds)
So in Hadoop, we have /user/hive/warehouse as the location to store all the files/data. You can check the file location by running following command in linux:
[hadoop@ip-10-0-0-142 ~]$ hdfs dfs -ls /user/hive/warehouse
We will use USA President public data-set for our example. You can download DATA SET from this link.
Let’s create a DATABASE first and then we will create tables in it.
CREATE DATABASE in HIVE
0: jdbc:hive2://localhost:10000> create database db_pres;
Now let’s check how it has impacted file directory:
[hadoop@ip-10-0-0-142 ~]$ hdfs dfs -ls /user/hive/warehouse Found 1 items drwxrwxrwt - hive hadoop 0 2021-05-06 05:48 /user/hive/warehouse/db_pres.db
I have placed DATA file in the path and can validate it.
[hadoop@ip-10-0-0-142 ~]$ head /tmp/hive_data/pres_data.csv pres_id,pres_name,pres_dob,pres_bp,pres_bs,pres_in,pres_out 1,George Washington,1732-02-22,Westmoreland County,Virginia,1789-04-30,1797-03-04 2,John Adams,1735-10-30,Braintree,Massachusetts,1797-03-04,1801-03-04 3,Thomas Jefferson,1743-04-13,Shadwell,Virginia,1801-03-04,1809-03-04 4,James Madison,1751-03-16,Port Conway,Virginia,1809-03-04,1817-03-04 5,James Monroe,1758-04-28,Monroe Hall,Virginia,1817-03-04,1825-03-04 6,John Quincy Adams,1767-07-11,Braintree,Massachusetts,1825-03-04,1829-03-04 7,Andrew Jackson,1767-03-15,Waxhaws Region,South/North Carolina,1829-03-04,1837-03-04 8,Martin Van Buren,1782-12-05,Kinderhook,New York,1837-03-04,1841-03-04 9,William Henry Harrison,1773-02-09,Charles City County,Virginia,1841-03-04,1841-04-04
Now File is in the path and next step is to CREATE A TABLE to store DATA into it. Switch to beeline CLI and run below CREATE TABLE command
CREATE TABLE in HIVE
create table db_pres.usa_prez ( pres_id integer, pres_name varchar(100), pres_dob date, pres_bp varchar(100), pres_bs varchar(100), pres_in date, pres_out date ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE tblproperties ("skip.header.line.count"="1");
Verify table creation at directory level:
[hadoop@ip-10-0-0-142 ~]$ hdfs dfs -ls /user/hive/warehouse/db_pres.db/ Found 1 items drwxrwxrwt - hive hadoop 0 2021-05-06 06:08 /user/hive/warehouse/db_pres.db/usa_prez
Most of the create table statements will look familiar to you and only difference is at the end of command where we specify the file format. File is comma separated and we are storing it as TEXTFILE. There are other formats also however for this example we will store the table as TEXTFILE only. Since the DATA file has header in it , we will skip the first row while loading the data into the table.Hence added table property to skip 1 header line.
Once Table is created, Next step is to load data into the table. We will use below command to load DATA into HIVE table:
LOAD CSV DATA into Hive Table
LOAD DATA LOCAL INPATH '/tmp/hive_data/pres_data.csv' INTO TABLE db_pres.usa_prez;
LOCAL keyword is optional.If you specify it then it will look at OS path and if you skip it then it will search for file in HDFS.
Once data is loaded you can verify it in directory:
[hadoop@ip-10-0-0-142 ~]$ hdfs dfs -ls /user/hive/warehouse/db_pres.db/usa_prez/ Found 1 items -rwxrwxrwt 1 hive hadoop 3375 2021-05-06 06:11 /user/hive/warehouse/db_pres.db/usa_prez/pres_data.csv
So we have created a DATABASE then created a TABLE. Now we have loaded data into the table too. Next step is to run few sample queries on the data.
0: jdbc:hive2://localhost:10000> select count(*) from db_pres.usa_prez; +------+ | _c0 | +------+ | 46 | +------+ 1 row selected (0.659 seconds)
Let’s look at some sample records too:
0: jdbc:hive2://localhost:10000> select * from db_pres.usa_prez as t limit 5; +------------+--------------------+-------------+----------------------+----------------+-------------+-------------+ | t.pres_id | t.pres_name | t.pres_dob | t.pres_bp | t.pres_bs | t.pres_in | t.pres_out | +------------+--------------------+-------------+----------------------+----------------+-------------+-------------+ | 1 | George Washington | 1732-02-22 | Westmoreland County | Virginia | 1789-04-30 | 1797-03-04 | | 2 | John Adams | 1735-10-30 | Braintree | Massachusetts | 1797-03-04 | 1801-03-04 | | 3 | Thomas Jefferson | 1743-04-13 | Shadwell | Virginia | 1801-03-04 | 1809-03-04 | | 4 | James Madison | 1751-03-16 | Port Conway | Virginia | 1809-03-04 | 1817-03-04 | | 5 | James Monroe | 1758-04-28 | Monroe Hall | Virginia | 1817-03-04 | 1825-03-04 | +------------+--------------------+-------------+----------------------+----------------+-------------+-------------+ 5 rows selected (0.132 seconds)
Now you know how to Create Table and load data into it in HIVE. In the next post we will see different table formats and how they are stored in HDFS.