Apache Hive Performance Tuning
Also available as:
PDF

Design Data Storage Smartly

Figure 3.3. Hive Data Abstractions


Partitions

In Hive, tables are often partitioned. Frequently, tables are partitioned by date-time as incoming data is loaded into Hive each day. Large deployments can have tens of thousands of partitions. Partitions map to physical directories on the file system.

Using partitions can significantly improve performance if a query has a filter on the partitioned columns, which can prune partition scanning to one or a few partitions that match the filter criteria. Partition pruning occurs directly when a partition key is present in the WHERE clause. Pruning occurs indirectly when the partition key is discovered during query processing. For example, after joining with a dimension table, the partition key might come from the dimension table.

Partitioned columns are not written into the main table because they are the same for the entire partition, so they are "virtual columns." However, to SQL queries, there is no difference:

CREATE TABLE sale(id in, amount decimal)
PARTITIONED BY (xdate string, state string);

To insert data into this table, the partition key can be specified for fast loading:

INSERT INTO sale (xdate='2016-03-08', state='CA')
SELECT * FROM staging_table
WHERE xdate='2016-03-08' AND state='CA';

Without the partition key, the data can be loaded using dynamic partitions, but that makes it slower:

hive-site.xml settings:

SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.dynamic.partition=true;

SQL query:

INSERT INTO sale (xdate, state)
SELECT * FROM staging_table;

The virtual columns that are used as partitioning keys must be last. Otherwise, you must re-order the columns using a SELECT statement similar to the following:

INSERT INTO sale (xdate, state='CA')
SELECT id, amount, other_columns..., xdate
FROM staging_table
WHERE state='CA';

Buckets

Tables or partitions can be further divided into buckets that are stored as files in the directory for the table or the directories of partitions if the table is partitioned.