Partitioning granularity recommendations
Following are recommendations for table partitioning granularity that provides the best performance in Impala.
By: Manish Maheshwari, Data Architect and Data Scientist at Cloudera, Inc.
- Choose a partitioning strategy that ensures there is at least 256 MB of data in each partition.
- Over-partitioning causes query planning to take longer than necessary because Impala prunes the unnecessary partitions, which results in small files in each partition.
- Cloudera recommends that you keep the number of partitions in tables under 30,000.
-
Always use
integer
data types for partition key columns:- Partition key values are turned into HDFS directory names so you
can minimize memory usage by using numeric values for common partition
key fields such as
YEAR
,MONTH
, andDAY
. - Use the smallest
integer
data type that holds the appropriate range of values. Typically,TINYINT
forMONTH
andDAY
, andSMALLINT
forYEAR
. Use theEXTRACT()
function to pull out individual date and time fields from aTIMESTAMP
value, andCAST()
the return value to the appropriateinteger
data type.
- Partition key values are turned into HDFS directory names so you
can minimize memory usage by using numeric values for common partition
key fields such as