Partitioning Hive data

A brief description of partitions and the performance benefits includes characters you must avoid when creating a partition. Examples of creating a partition and inserting data in a partition introduce basic partition syntax. Best practices for partitioning are mentioned.

A table you create without partitioning puts the data in a single directory. Partitioning divides the data into multiple directories. Queries of one or more columns based on the directories can run faster. Lengthy full table scans are avoided. Only data in the relevant directory is scanned. For example, a school_records table partitioned on a year column, segregates values by year into separate directories. A WHERE condiition such as YEAR=2020, YEAR IN (2020,2019), or YEAR BETWEEN 2001 AND 2010 scans only the data in the appropriate directory to resolve the query. Using partitions typically improves query performance.

In a SQL query, you define the partition as shown in the following example:
CREATE TABLE sale(id in, amount decimal) PARTITIONED BY (xdate string, state string);
To insert data into this table, you specify the partition key for fast loading:
INSERT INTO sale (xdate='2016-03-08', state='CA') SELECT * FROM staging_table WHERE xdate='2016-03-08' AND state='CA'; 
You do not need to specify dynamic partition columns. Hive generates a partition specification if you enable dynamic partitions.

Examples of a query on partitioned data

INSERT INTO sale (xdate, state)
SELECT * FROM staging_table;

Follow these best practices when you partition tables and query partitioned tables:

  • Never partition on a unique ID.
  • Size partitions to greater than or equal to 1 GB on average.
  • Design queries to process not more than 1000 partitions.

Invalid Characters in a Partition Name

When you create a partition, do not use the following characters in a partition name:
  • colon
  • question mark
  • percent

If you use these characters in a partition name, your directories will be named using the URL encoding of these characters, as described in "Why some special characters should not be used in a partition name in Hive/Impala."