Friday, November 18, 2016

Hive - Creating and loading data in an external partitioned table


I have been ramping up on Hive concepts these days. 
This post is on one interesting and a bit tricky concept that I came across and thought of sharing.
It deals with creating and loading data in an external PARTITIONED table and then querying it for data

Hive version: 2.1.0 from Apache

First let’s create an EXTERNAL table with a three fields and partitioned on one of these fields.

CREATE EXTERNAL TABLE my_dest_table (
                         primary_alias_type string,
                         primary_alias_id string)
PARTITIONED BY (d string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION '/home/input/testData/my_dest_table/'
As you can see I have created an external table with partitions and at the location mentioned I have placed a csv file having three columns and plenty of data.

Now I run a select statement to view data:
hive> select * from my_dest_table;
OK
Time taken: 1.467 seconds

As you can see I don’t see any output here. Had this been a Non Partitioned External table the query would have returned results but in this case it needs some more effort. You need to create partitions and load data in those partitions as well. 

ALTER TABLE my_dest_table ADD PARTITION(d = 20000)
LOCATION '/home/input/testData/my_dest_table/20000';

As you can see above I have created a partition using ALTER TABLE command to add this info to the metastore. Also created corresponding folder and placed relevant snapshot of csv file data in the folder. You can alternatively create all partition folders in the filesystem and then use MSCK REPAIR TABLE command to load partition info in metastore in one go.

Now when I do a select, I can see the data in query output.

Some more points of note are that an external table is only created in metastore and you wont see its info in the warehouse folder (as you can see for MANAGED tables).