Impala Performance Guidelines and Best Practices
Here are performance guidelines and best practices that you can use during planning, experimentation, and performance tuning for an Impala-enabled CDH cluster. All of this information is also available in more detail elsewhere in the Impala documentation; it is gathered together here to serve as a cookbook and emphasize which performance techniques typically provide the highest return on investment
- Choose the appropriate file format for the data.
- Avoid data ingestion processes that produce many small files.
- Choose partitioning granularity based on actual data volume.
- Use smallest appropriate integer types for partition key columns.
- Choose an appropriate Parquet block size.
- Gather statistics for all tables used in performance-critical or high-volume join queries.
- Minimize the overhead of transmitting results back to the client.
- Verify that your queries are planned in an efficient logical manner.
- Verify performance characteristics of queries.
Choose the appropriate file format for the data.
Typically, for large volumes of data (multiple gigabytes per table or partition), the Parquet file format performs best because of its combination of columnar storage layout, large I/O request size, and compression and encoding. See How Impala Works with Hadoop File Formats for comparisons of all file formats supported by Impala, and Using the Parquet File Format with Impala Tables for details about the Parquet file format.
Avoid data ingestion processes that produce many small files.
When producing data files outside of Impala, prefer either text format or Avro, where you can build up the files row by row. Once the data is in Impala, you can convert it to the more efficient Parquet format and split into multiple data files using a single INSERT ... SELECT statement. Or, if you have the infrastructure to produce multi-megabyte Parquet files as part of your data preparation process, do that and skip the conversion step inside Impala.
Always use INSERT ... SELECT to copy significant volumes of data from table to table within Impala. Avoid INSERT ... VALUES for any substantial volume of data or performance-critical tables, because each such statement produces a separate tiny data file. See INSERT Statement for examples of the INSERT ... SELECT syntax.
For example, if you have thousands of partitions in a Parquet table, each with less than 1 GB of data, consider partitioning in a less granular way, such as by year / month rather than year / month / day. If an inefficient data ingestion process produces thousands of data files in the same table or partition, consider compacting the data by performing an INSERT ... SELECT to copy all the data to a different table; the data will be reorganized into a smaller number of larger files by this process.
Choose partitioning granularity based on actual data volume.
Partitioning is a technique that physically divides the data based on values of one or more columns, such as by year, month, day, region, city, section of a web site, and so on. When you issue queries that request a specific value or range of values for the partition key columns, Impala can avoid reading the irrelevant data, potentially yielding a huge savings in disk I/O.
When deciding which column(s) to use for partitioning, choose the right level of granularity. For example, should you partition by year, month, and day, or only by year and month? Choose a partitioning strategy that puts at least 1 GB of data in each partition, to take advantage of HDFS bulk I/O and Impala distributed queries.
Over-partitioning can also cause query planning to take longer than necessary, as Impala prunes the unnecessary partitions. Ideally, keep the number of partitions in the table under 30 thousand.
When preparing data files to go in a partition directory, create several large files rather than many small ones. If you receive data in the form of many small files and have no control over the input format, consider using the INSERT ... SELECT syntax to copy data from one table or partition to another, which compacts the files into a relatively small number (based on the number of nodes in the cluster).
If you need to reduce the overall number of partitions and increase the amount of data in each partition, first look for partition key columns that are rarely referenced or are referenced in non-critical queries (not subject to an SLA). For example, your web site log data might be partitioned by year, month, day, and hour, but if most queries roll up the results by day, perhaps you only need to partition by year, month, and day.
If you need to reduce the granularity even more, consider creating
See Partitioning for full details and performance considerations for partitioning.
Use smallest appropriate integer types for partition key columns.
Although it is tempting to use strings for partition key columns, since those values are turned into HDFS directory names anyway, you can minimize memory usage by using numeric values for common partition key fields such as YEAR, MONTH, and DAY. Use the smallest integer type that holds the appropriate range of values, typically TINYINT for MONTH and DAY, and SMALLINT for YEAR. Use the EXTRACT() function to pull out individual date and time fields from a TIMESTAMP value, and CAST() the return value to the appropriate integer type.
Choose an appropriate Parquet block size.
By default, the Impala INSERT ... SELECT statement creates Parquet files with a 1 GB block size. Each file is a single block, allowing the whole file to be processed by a single node. As you copy Parquet files into HDFS or between HDFS filesystems, use hdfs dfs -pb to preserve the original block size.
If there is only one or a few data files in
your Parquet table, or in a partition that is the only one accessed by a query,
then you might experience a slowdown for a different reason: not enough data
to take advantage of Impala's parallel distributed queries.
Each data file is processed by a single core on one of the data nodes.
In a 100-node cluster of 16-core machines, you could potentially process
1600 data files simultaneously. You want to find a sweet spot between
Gather statistics for all tables used in performance-critical or high-volume join queries.
Gather the statistics with the COMPUTE STATS statement. See Performance Considerations for Join Queries for details.
Minimize the overhead of transmitting results back to the client.
Use techniques such as:
- Aggregation. If you need to know how many rows match a condition, the total values of matching values from some column, the lowest or highest matching value, and so on, call aggregate functions such as COUNT(), SUM(), and MAX() in the query rather than sending the result set to an application and doing those computations there. Remember that the size of an unaggregated result set could be huge, requiring substantial time to transmit across the network.
- Filtering. Use all applicable tests in the WHERE clause of a query to eliminate rows that are not relevant, rather than producing a big result set and filtering it using application logic.
- LIMIT clause. If you only need to see a few sample values from a result set, or the top or bottom values from a query using ORDER BY, include the LIMIT clause to reduce the size of the result set rather than asking for the full result set and then throwing most of the rows away.
- Avoid overhead from pretty-printing the result set and displaying it on the screen. When you retrieve the results through impala-shell, use impala-shell options such as -B and --output_delimiter to produce results without special formatting, and redirect output to a file rather than printing to the screen. Consider using INSERT ... SELECT to write the results directly to new files in HDFS. See impala-shell Command-Line Options for details about the impala-shell command-line options.
Verify that your queries are planned in an efficient logical manner.
Examine the EXPLAIN plan for a query before actually running it. See EXPLAIN Statement and Using the EXPLAIN Plan for Performance Tuning for details.
Verify performance characteristics of queries.
Verify that the low-level aspects of I/O, memory usage, network bandwidth, CPU utilization, and so on are within expected ranges by examining the query profile for a query after running it. See Using the Query Profile for Performance Tuning for details.
<< Tuning Impala for Performance | Performance Considerations for Join Queries >> | |