Hadoop is a very popular framework for data storage and data processing. So it suffice two main purposes:
- Distributed Data Storage using HDFS ( Hadoop Distributed File System)
- Data processing using Map-Reduce.
In Hadoop everything is in File format. It is capable of processing huge volume of File Data in a very efficient manner. Now the obvious question is how can I run SQL queries if everything is in File and not Tables ? That is actually a very good question as SQL cannot run queries on data present in files. The dependency on table is real and SQL works on data present in rows and columns.
To overcome this dependency HIVE was introduced. Hive is SQL infrastructure built on top of hadoop. It provides user RDBMS like features to create DATABASE, TABLES, VIEWS and execute SQL queries on the platform. The SQL query is compiled and is converted to Map-Reduce jobs internally. So SQL experts need not worry about writing Map-Reduce jobs as HIVE will take care of it.
Now we will see how we can run queries using HIVE. As mentioned in earlier post I am using HDP2.4 for explaining tutorials. This avoids so many configuration issues which I would have to face in order to install applications individually one by one. So I will strongly suggest you to install Hortonworks HDP or Cloudera CDH which ever you want. This will take care of installations,configurations and settings and you can directly focus on the actual tasks.
There are multiple options to use HIVE and we will look at three of the most popular ones:
It is graphical interface like a web portal to manage all the settings, configurations of various applications installed as part of HDP. If you have started HDP then go to http://127.0.0.1:8080. Default username password is maria_dev. Login into the portal and go to HIVE view from option list next to username. You will see a query editor worksheet to write and execute SQL queries.
I will login into my HDP sandbox using root. HIVE CLI was very popular and now with HIVESERVER2, it is replaced with beeline which we will see in next step. Once I login into HDP sandbox I just have to give HIVE command and it will open HIVE shell.
One of the best way to connect to HIVE is using beeline shell. It supports both remote connectivity or local connection using JDBC client. In order to connect HIVE using beeline I will give below commands in sandbox:
Note: In HDP, you can leave username/password as blank to connect to HIVE using beeline.
Since we know how to connect to HIVE now we will see how to use HIVE to run SQL queries in subsequent posts. I have intentionally not covered details of Hadoop,HDFS, Map-Reduce or even HIVE Architecture in order to keep post simple. Also the focus is more on SQL writing.