Partitioning Hive data
A brief description of partitions and the performance benefits includes characters you must avoid when creating a partition. Examples of creating a partition and inserting data in a partition introduce basic partition syntax. Best practices for partitioning are mentioned.
A table you create without partitioning puts the data in a single directory. Partitioning
divides the data into multiple directories. Queries of one or more columns based on the
directories can run faster. Lengthy full table scans are avoided. Only data in the relevant
directory is scanned. For example, a
school_records table partitioned on a
year column, segregates values by year into separate directories. A
WHERE condiition such as
YEAR BETWEEN 2001 AND 2010 scans only the data in the
appropriate directory to resolve the query. Using partitions typically improves query
To insert data into this table, you specify the partition key for fast loading:
CREATE TABLE sale(id in, amount decimal) PARTITIONED BY (xdate string, state string);
You do not need to specify dynamic partition columns. Hive generates a partition specification if you enable dynamic partitions.
INSERT INTO sale (xdate='2016-03-08', state='CA') SELECT * FROM staging_table WHERE xdate='2016-03-08' AND state='CA';
Examples of a query on partitioned data
INSERT INTO sale (xdate, state) SELECT * FROM staging_table;
Follow these best practices when you partition tables and query partitioned tables:
- Never partition on a unique ID.
- Size partitions to greater than or equal to 1 GB on average.
- Design queries to process not more than 1000 partitions.
Invalid Characters in a Partition Name
- question mark
If you use these characters in a partition name, your directories will be named using the URL encoding of these characters, as described in "Why some special characters should not be used in a partition name in Hive/Impala."