Apache Hadoop

Columnar Storage & why you must use it

If you are working on Hadoop or any other platform and storing structured data, I am sure you must have heard about columnar storage types. In the past 7-8 years the popularity “columnar” has gained confirms that the buzz is not a bubble and this is the future of Data Analytics from storage perspective.

What is Columnar Storage ?

For simplicity, we will restrict our discussion to RDBMS only. Data is stored in “blocks”. Blocks are nothing but physical storage space in bytes which is occupied when data is written to it. A typical block may range from few bytes to even MB depending on the platform you are working on. Traditional databases store entire row in blocks. So if 1 row has 50 columns then all 50 columns 1 row is stored in a block. Depending on the size of row, it may fit in a single block or may spill to other block or a block may remain under-utilized if row length is less than block size. This results in wastage of storage space. If I may give a more relevant example , try recalling “defragment” feature of Windows machine. Nowadays , it is automatic but earlier we use to do it manually. The intent was to free some space and align data properly into blocks.

The second problem with row format is most of the time, a query may not retrieve all the columns in the row. A row may have 50 columns in it but for the query you may be selecting only 10 columns. Since you are storing entire row together, even to fetch those 10 columns you may end up reading all 50 from blocks. This is wastage of I/O buffer.

Now let’s see how Columnar storage differs from Row based storage and how it helps in overcoming the problems mentioned above. When you store data in columnar format, all the values in a column are stored together in blocks.So the first block will have all the values for column1 (it may be sufficient or may not be sufficient to store it entirely depending on volume). Similarly, block2 will have all the values for column2. So wastage of blocks is minimum in this case as a single block can have more row values for given column.

Also, when you select only 10 columns in SQL query, you may read only those 10 blocks which stores those column values. Other blocks with remaining 40 columns won’t be read. Hence saving lots of I/O. This results in much better query performance too.

Another advantage is better compression can be achieved in columnar storage compared to row format. Like column1 is int, column2 is string so in row format a block will store numeric+string value. However in columnar, block1 will have only numeric values and block2 will have only string. So compression which is better for numeric can be applied to block1 and compression which is better for string can be applied to block2. This is not achievable in row-based storage.

Considering all the above points, it is recommend to use PARQUET & ORC format while working in Hadoop. Also modern cloud data warehouses like redshift is columnar. In Teradata, you have option to create columnar tables too.

2 thoughts on “Columnar Storage & why you must use it”

Leave a Reply