Partitions and performance
A brief description of partitions and the performance benefits includes characters you must avoid when creating a partition. Examples of creating a partition and inserting data in a partition introduce basic partition syntax. Best practices for partitioning are mentioned.
A table you create without partitioning puts the data in a single directory. Partitioning
divides the data into multiple directories. Queries of one or more columns based on the
directories can run faster. Lengthy full table scans are avoided. Only data in the relevant
directory is scanned. For example, a school_records
table partitioned on a
year
column, segregates values by year into separate directories. A
WHERE
condiition such as YEAR=2020
, YEAR IN
(2020,2019)
, or YEAR BETWEEN 2001 AND 2010
scans only the data in the
appropriate directory to resolve the query. Using partitions typically improves query
performance.
CREATE TABLE sale(id in, amount decimal) PARTITIONED BY (xdate string, state string);
To insert data into this table, you specify the partition key for fast loading:
INSERT INTO sale (xdate='2016-03-08', state='CA') SELECT * FROM staging_table WHERE xdate='2016-03-08' AND state='CA';
You
do not need to specify dynamic partition columns. Hive generates a partition specification if you
enable dynamic partitions. Examples of a query on partitioned data
INSERT INTO sale (xdate, state)
SELECT * FROM staging_table;
Follow these best practices when you partition tables and query partitioned tables:
- Never partition on a unique ID.
- Size partitions to greater than or equal to 1 GB on average.
- Design queries to process not more than 1000 partitions.
Invalid Characters in a Partition Name
- colon
- question mark
- percent
If you use these characters in a partition name, your directories will be named using the URL encoding of these characters, as described in "Why some special characters should not be used in a partition name in Hive/Impala."