What are partitions in HIVE? How partitions are implemented in HIVE? What is the benefit of partition in HIVE? If you have these or similar questions about partitions in HIVE then we have answered such questions and many more in this post.
What is HIVE ?
HIVE is a software built on top of Hadoop to provide environment where user can run SQL queries rather than map reduce program on data. It is developed primarily for people like me who have good experience in SQL and data warehouse however do not have much expertise in programming languages like Java, Scala etc. So if you have to run SQL queries on Hadoop, HIVE can be your first option.
What are partitions in HIVE?
Now when I said that HIVE could run SQL queries on Hadoop I did not talk about query performance. Now in most of the cases you will find traditional RDBMS like Teradata, Oracle etc outperform HIVE in terms of execution time for executing same queries. It could be because these RDBMS have matured in the last 25 years or so and are built to run SQL queries at core whereas HIVE is very new when compared to it. Also Hadoop or HDFS is not meant for typical ETL like processing and HIVE is just giving the option to run SQL queries. At the back-end, it converts it into a Map Reduce job only. However, with time we can see much better support and performance for SQL on HIVE too.
Now coming to the original question what are partitions in HIVE? In a very simple language, we can say Partition is the way of storing the data, which improves performance of SQL queries significantly. I want you to remember two things here: data distribution & performance. With partitions defined on HIVE table data is distributed differently (based on partition column) which facilitates faster retrieval of data hence much better performance.
Let us look at one example:
Table without any partition:
create table dim_pres ( pres_id tinyint, pres_name string, pres_dob date, pres_bp string, pres_bs string, pres_in date, pres_out date ) row format delimited fields terminated by '~' lines terminated by '\n' stored as textfile tblproperties ("skip.header.line.count"="1");
Load data into non partitioned table:
load data inpath '/tmp/raj/US_President_data.txt' into table dim_pres;
If you look how data is stored in HDFS then you will see single file:
Note: If you would have created a similar non-partitioned table and loaded data into it using INSERT statement rather than loading file, you could have seen file in parts like : /apps/hive/warehouse/dim_pres_1/000000_0
Now let us create a partition table.
create table dim_pres_2 ( pres_id tinyint, pres_name string, pres_dob date, pres_bp string, pres_in date, pres_out date ) partitioned by (pres_bs string) row format delimited fields terminated by '~' lines terminated by '\n' stored as textfile ;
Load data into Partitioned table:
insert into dim_pres_2 partition (pres_bs) select pres_id,pres_name,pres_dob,pres_bp,pres_in,pres_out,pres_bs from dim_pres;
Check how data is stored in HDFS for this table:
Now I want you to observe three points here:
- In the non-partitioned table, there is just one data file, which stores all the information.
- In the partitioned table, single file is actually split into multiple data files based on partitioned column i.e. pres_bs which is president birth state column. So one file results in 21 smaller files while storing the info. This explains the data distribution part in case of partitioned tables.
- The syntax of INSERT statement used to load partitioned table is different from usual INSERT you have seen before. We will discuss more on this in some time.
Now I am assuming data distribution part is clear to you. So how is it affecting performance? In your query if you are using filter or JOINING on pres_bs column then in non-partitioned table, entire data file with all rows are scanned to give you result. However if you run same query on partitioned table, only the specific partition is scanned. Hence, less number of rows are scanned thereby increasing performance drastically. This is called Partition Pruning or Partition Elimination.
Last point I would like to cover in this post is the modified INSERT statement used above. In HIVE there are 2 types of partitions available: STATIC PARTITIONS & DYNAMIC PARTITIONS. When user already have info about the value of partitions and specify that while loading data into partitioned table then it is STATIC PARTITION. In most of the cases, it won’t be possible and user may want to create dynamic partitions as we receive new values for partition column. This is called DYNAMIC PARTITION. To enable DYNAMIC partition, enable below two properties:
set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict;
Now dynamic partitioning is enabled, let’s look into the syntax to load data into the partitioned table. Always mention the partitioned column/s with PARTITION keyword in INSERT part and partitioned column/s must be present at the last in the SELECT part. Hence the query becomes:
insert into dim_pres_2 partition (pres_bs) select pres_id,pres_name,pres_dob,pres_bp,pres_in,pres_out,pres_bs from dim_pres;
To check Partitions in any given table use command: SHOW PARTITIONS TABLENAME
Hope this helps. If you have any question regarding HIVE PARTITIONS, feel free to leave a comment and I will be happy to answer it.