Performance improvement using partitions
You must be aware of what partition pruning is, how to enable dynamic partitioning, and the configuration required for bulk-loading of data to ensure significant improvements in performance. You can use partitions to significantly improve performance. You can design Hive table and materialized views partitions to map to physical directories on the file system/object store. For example, a table partitioned by date-time can organize data loaded into Hive each day.
Large deployments can have tens of thousands of partitions. Partition pruning occurs indirectly when Hive discovers the partition key during query processing. For example, after joining with a dimension table, the partition key might come from the dimension table. A query filters columns by partition, limiting scanning that occurs to one or a few matching partitions. Partition pruning occurs directly when a partition key is present in the WHERE clause. Partitioned columns are virtual, not written into the main table because these columns are the same for the entire partition.
You do not need to specify dynamic partition columns. Hive generates a partition specification if you enable dynamic partitions.
- Configuration for loading 1 to 9 partitions:
SET hive.exec.dynamic.partition.mode=nonstrict; SET hive.exec.dynamic.partition=true;
For bulk-loading data into partitioned ORC tables, you use the following property, which optimizes the performance of data loading into 10 or more partitions.
- Configuration for loading 10 or more partitions: