Create your first Table in HIVE and load data into it.

Once you have access to HIVE , the first thing you would like to do is Create a Database and Create few tables in it. Before we start with the SQL commands, it is good to know how HIVE stores the data. Since in HDFS everything is FILE so HIVE stores all the information in FILEs only. So we should know the path of FILEs also for better understanding the concepts.
TIP: Open two putty sessions: one two execute hive queries via beeline , second to run OS & HDFS commands.
Once you login into HIVE via beeline, run below command:

set hive.metastore.warehouse.dir;

0: jdbc:hive2://localhost:10000> set hive.metastore.warehouse.dir;
+—————————————————-+–+
| set |
+—————————————————-+–+
| hive.metastore.warehouse.dir=/apps/hive/warehouse |
+—————————————————-+–+
1 row selected (0.222 seconds)

So in HDP, we have /apps/hive/warehouse as the location to store all the files. You can check the file location by running following command in linux:
[root@sandbox ~]# hdfs dfs -ls /apps/hive/warehouse

We will use Indian Railways public data set for our example. Indian Railways has one of the biggest railway network in the world. You can download DATA SET from this link.

Let’s create a DATABASE first and then we will create tables in it.
0: jdbc:hive2://localhost:10000> create database railways;
No rows affected (1.121 seconds)

Now let’s check how it has impacted file directory:
[root@sandbox Nitin]# hdfs dfs -ls /apps/hive/warehouse/
Found 1 items
drwxrwxrwx - hive hdfs 0 2017-03-17 08:32 /apps/hive/warehouse/railways.db

I have placed DATA file in the path and can validate it.

[root@sandbox hive_data]# pwd
/tmp/hive_data
[root@sandbox hive_data]# ls -ltr
total 7796
-rwxr-xr-x 1 root root 7981193 2017-03-22 05:13 train_detail.csv
[root@sandbox hive_data]# head train_detail.csv
Train No.,train Name,islno,station Code,Station Name,Arrival time,Departure time,Distance,Source Station Code,source Station Name,Destination station Code,Destination Station Name
'00851',BNC SUVIDHA SPL,1,BBS ,BHUBANESWAR ,'00:00:00','22:50:00',0,BBS ,BHUBANESWAR ,BNC ,BANGALORE CANT
'00851',BNC SUVIDHA SPL,2,BAM ,BRAHMAPUR ,'01:10:00','01:12:00',166,BBS ,BHUBANESWAR ,BNC ,BANGALORE CANT
'00851',BNC SUVIDHA SPL,3,VSKP,VISAKHAPATNAM ,'05:10:00','05:30:00',443,BBS ,BHUBANESWAR ,BNC ,BANGALORE CANT
'00851',BNC SUVIDHA SPL,4,BZA ,VIJAYAWADA JN ,'11:10:00','11:20:00',793,BBS ,BHUBANESWAR ,BNC ,BANGALORE CANT
'00851',BNC SUVIDHA SPL,5,RU ,RENIGUNTA JN ,'16:42:00','16:52:00',1169,BBS ,BHUBANESWAR ,BNC ,BANGALORE CANT
'00851',BNC SUVIDHA SPL,6,JTJ ,JOLARPETTAI ,'20:35:00','20:37:00',1367,BBS ,BHUBANESWAR ,BNC ,BANGALORE CANT
'00851',BNC SUVIDHA SPL,7,BNC ,BANGALORE CANT ,'22:40:00','00:00:00',1511,BBS ,BHUBANESWAR ,BNC ,BANGALORE CANT
'00852',BNC BBS SUVIDHA,1,BNC ,BANGALORE CANT ,'00:00:00','01:00:00',0,BNC ,BANGALORE CANT ,BBS ,BHUBANESWAR
'00852',BNC BBS SUVIDHA,2,JTJ ,JOLARPETTAI ,'03:30:00','03:32:00',144,BNC ,BANGALORE CANT ,BBS ,BHUBANESWAR

Now File is in the path and next step is to CREATE A TABLE to store DATA into it. Switch to beeline CLI and run below CREATE TABLE command:

Create Table Train_Route
(
Train_No string,
Train_Name string,
islno tinyint,
Stn_Cd string,
Stn_Name string,
Arr_ts string,
Dep_ts string,
Distance smallint,
Src_Stn_Cd string,
Src_Stn_Name string,
Des_Stn_Cd string,
Des_Stn_Name string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
tblproperties ("skip.header.line.count"="1");

Verify table creation at directory level:

[hive@sandbox root]$ hdfs dfs -ls /apps/hive/warehouse/railways.db
Found 1 items
drwxrwxrwx - hive hdfs 0 2017-03-22 04:46 /apps/hive/warehouse/railways.db/train_route

Most of the create table statements will look familiar to you and only difference is at the end of command where we specify the file format. File is comma separated and we are storing it as TEXTFILE. There are other formats also however for this example we will store the table as TEXTFILE only. Since the DATA file has header in it , we will skip the first row while loading the data into the table.Hence added table property to skip 1 header line.
Once Table is created, Next step is to load data into the table. We will use below command to load DATA into HIVE table:

0: jdbc:hive2://localhost:10000> LOAD DATA LOCAL INPATH '/tmp/hive_data/train_detail.csv' INTO TABLE Train_Route;
INFO : Loading data to table railways.train_route from file:/tmp/hive_data/train_detail.csv
INFO : Table railways.train_route stats: [numFiles=1, totalSize=7981193]
No rows affected (3.331 seconds)

LOCAL keyword is optional.If you specify it then it will look at OS path and if you skip it then it will search for file in HDFS.
Once data is loaded you can verify it in directory:

[root@sandbox hive_data]# hdfs dfs -ls /apps/hive/warehouse/railways.db/train_route
Found 1 items
-rwxrwxrwx 3 hive hdfs 7981193 2017-03-22 05:19 /apps/hive/warehouse/railways.db/train_route/train_detail.csv

So we have created a DATABASE then created a TABLE. Now we have loaded data into the table too. Next step is to run few sample queries on the data.

0: jdbc:hive2://localhost:10000> select count(*) from Train_Route;
INFO : Tez session hasn't been created yet. Opening session
INFO : Dag name: select count(*) from Train_Route(Stage-1)
INFO :

INFO : Status: Running (Executing on YARN cluster with App id application_1489644829884_0002)

INFO : Map 1: -/- Reducer 2: 0/1
INFO : Map 1: 0/1 Reducer 2: 0/1
INFO : Map 1: 0(+1)/1 Reducer 2: 0/1
INFO : Map 1: 0(+1)/1 Reducer 2: 0/1
INFO : Map 1: 1/1 Reducer 2: 0/1
INFO : Map 1: 1/1 Reducer 2: 0(+1)/1
INFO : Map 1: 1/1 Reducer 2: 1/1
+--------+--+
| _c0 |
+--------+--+
| 69006 |
+--------+--+
1 row selected (7.138 seconds)

Let’s look at some sample records too:

0: jdbc:hive2://localhost:10000> select * from Train_Route limit 5;
+-----------------------+-------------------------+--------------------+---------------------+-----------------------+---------------------+---------------------+-----------------------+-------------------------+---------------------------+-------------------------+---------------------------+--+
| train_route.train_no | train_route.train_name | train_route.islno | train_route.stn_cd | train_route.stn_name | train_route.arr_ts | train_route.dep_ts | train_route.distance | train_route.src_stn_cd | train_route.src_stn_name | train_route.des_stn_cd | train_route.des_stn_name |
+-----------------------+-------------------------+--------------------+---------------------+-----------------------+---------------------+---------------------+-----------------------+-------------------------+---------------------------+-------------------------+---------------------------+--+
| '00851' | BNC SUVIDHA SPL | 1 | BBS | BHUBANESWAR | '00:00:00' | '22:50:00' | 0 | BBS | BHUBANESWAR | BNC | BANGALORE CANT |
| '00851' | BNC SUVIDHA SPL | 2 | BAM | BRAHMAPUR | '01:10:00' | '01:12:00' | 166 | BBS | BHUBANESWAR | BNC | BANGALORE CANT |
| '00851' | BNC SUVIDHA SPL | 3 | VSKP | VISAKHAPATNAM | '05:10:00' | '05:30:00' | 443 | BBS | BHUBANESWAR | BNC | BANGALORE CANT |
| '00851' | BNC SUVIDHA SPL | 4 | BZA | VIJAYAWADA JN | '11:10:00' | '11:20:00' | 793 | BBS | BHUBANESWAR | BNC | BANGALORE CANT |
| '00851' | BNC SUVIDHA SPL | 5 | RU | RENIGUNTA JN | '16:42:00' | '16:52:00' | 1169 | BBS | BHUBANESWAR | BNC | BANGALORE CANT |
+-----------------------+-------------------------+--------------------+---------------------+-----------------------+---------------------+---------------------+-----------------------+-------------------------+---------------------------+-------------------------+---------------------------+--+
5 rows selected (2.201 seconds)

Now you know how to Create Table and load data into it in HIVE. In the next post we will see different table formats and how they are stored in HDFS.

1 Comment


Leave a Reply

Your email address will not be published. Required fields are marked *