If you are working on Hadoop or any other platform and storing structured data, I am sure you must have heard about columnar storage types. In the past 7-8 years the popularity “columnar” has gained confirms that the buzz is not a bubble and this is the future of Data Analytics from storage perspective. What is Columnar Storage ? For simplicity, we will restrict our discussion to RDBMS only. Data is stored in “blocks”. Blocks are nothing but physical storage space in bytes which is occupied when data is written to it. A typical block may range from few bytes to even MB depending on
Hadoop is a very popular framework for data storage and data processing. So it suffice two main purposes: Distributed Data Storage using HDFS ( Hadoop Distributed File System) Data processing using Map-Reduce. In Hadoop everything is in File format. It is capable of processing huge volume of File Data in a very efficient manner. Now the obvious question is how can I run SQL queries if everything is in File and not Tables ? That is actually a very good question as SQL cannot run queries on data present in files. The dependency on table is real and SQL works on data present in rows
We all have been using SQL on RDBMS for so long now. The time has come when we shall switch to SQL on Hadoop. SQL (Structured Query Language) help us in communicating with any RDBMS like Teradata, Oracle, Netezza etc which are mostly used for OLTP or OLAP purposes. Traditional Datawarehouse systems used to store structured data where data is stored in Tables in rows and columns. However in the past couple of years there has been significant changes in the DW/BI world. The need for real time data is ever increasing and traditional RDBMS may not be able to handle streaming data that well.