Partitioning granularity recommendations

Following are recommendations for table partitioning granularity that provides the best performance in Impala.

By: Manish Maheshwari, Data Architect and Data Scientist at Cloudera, Inc.

  • Choose a partitioning strategy that ensures there is at least 256 MB of data in each partition.
  • Over-partitioning causes query planning to take longer than necessary because Impala prunes the unnecessary partitions, which results in small files in each partition.
  • Cloudera recommends that you keep the number of partitions in tables under 30,000.
  • Always use integer data types for partition key columns:

    • Partition key values are turned into HDFS directory names so you can minimize memory usage by using numeric values for common partition key fields such as YEAR, MONTH, and DAY.
    • Use the smallest integer data type that holds the appropriate range of values. Typically, TINYINT for MONTH and DAY, and SMALLINT for YEAR. Use the EXTRACT() function to pull out individual date and time fields from a TIMESTAMP value, and CAST() the return value to the appropriate integer data type.