Apache Hadoop

SQL on RDBMS to SQL on Hadoop/Cloud

We all have been using SQL on RDBMS for so long now. The time has come when we shall switch to SQL on Hadoop/Cloud. SQL (Structured Query Language) help us in communicating with any RDBMS like Teradata, Oracle, Netezza, SQL Server etc which are mostly used for OLAP purposes. Traditional Datawarehouse systems used to store structured data where data is stored in Tables in rows and columns. However in the past couple of years there has been significant changes in the DW/BI world. The need for real time data is ever increasing and traditional RDBMS may not be able to handle streaming data that well. Also the storage costs involved was way too high. Since Hadoop has the answers to most of the limitations in RDBMS, it looks like a suitable alternative to traditional RDBMS systems. Slowly but gradually we can see Hadoop has already started consuming components of Datawarehouse from RDBMS to Hadoop. The concept of Hadoop or BigData warehouse is very much popular nowadays and most of the clients are desperately trying to offload data warehouse components from RDBMS to Hadoop/Cloud.

If we talk about Hadoop from technical skills perspective then clearly there was huge requirement for JAVA resources which is aberration from traditional data warehouse environment. Since the underlying framework of hadoop is built on JAVA and most of the operations (Map-Reduce) support Java, there was huge demand for JAVA experts. Earlier it was more of SQL experts and now it is all JAVA. So there was skill gap which was pretty ocular and the data warehouse world was not enough mature to handle this gap. In order to overcome this, few SQL frameworks were built on top of Hadoop to allow SQL experts run SQL-like queries. One of the most popular such framework is Apache Hive. It provides good infrastructure support to run SQL queries on Hadoop. Although it is not fully ANSI SQL compliant and few SQL queries may not work in HIVE still it is considered as the best SQL infrastructure we have today in Hadoop.

In the subsequent posts we will see how HIVE can be used to build a stable Datawahouse in Hadoop. We will see various SQL options available in HIVE and what are the limitations. How HIVE is better than RDBMS and where it falls behind traditional RDBMS. We will also see how one query works fine in RDBMS but throws error in HIVE and how to overcome the errors. We will also check how to do performance tuning in HIVE and how Hive jobs are converted to Map-Reduce programs.

In the end, I would like to say that be open to new technologies especially Hadoop. If you are SQL expert then running SQL queries on HIVE should be pretty simple for you. You may not need to know the underlying framework in details if the intention is just to run queries on hadoop. I personally have worked for more than 8 years in RDBMS DWH environment and with time it was not that exciting. However with BigData and Hadoop in picture the Datawarehouse world is again a pretty happening place to be. New technologies/learning makes it more interesting and fun to work.

Technologies/Software I will be using for future tutorials:
Hortonworks HDP2.4
Hadoop 2.7
Hive 1.2

Leave a Reply