Recommended SET options for Impala
Following are the recommended SET options for the best performance with Impala.
By: Manish Maheshwari, Data Architect and Data Scientist at Cloudera, Inc.
- Always use the
SET MEM_LIMIT
query option. See MEM_LIMIT Query Option for details. - Use the
APPX_COUNT_DISTINCT
query option to improve performance ofCOUNT(DISTINCT)
queries by converting them to useNDV()
function calls over linear counting. Note thatAPPX_COUNT_DISTINCT
is approximately 98% accurate for smaller data sets and over 99% accurate for larger datasets. See APPX_COUNT_DISTINCT Query Option for details. - Always use a compression algorithm that both encodes and decodes (codec). Cloudera recommends that you use LZ4 or ZSTD with Impala version 3.3 and onwards. Use snappy for earlier versions of Impala.
- Do not use the
DISABLE_CODEGEN
query option because it can significantly slow query execution. Instead, use theDISABLE_CODEGEN_ROWS_THRESHOLD
query option, which automatically disables codegen for small queries. For details, see DISABLE_CODEGEN Query Option. - Set the
EXEC_SINGLE_NODE_ROWS_THRESHOLD
query option to 1000 rows for faster execution of simple queries or queries that containLIMIT
clauses. See EXEC_SINGLE_NODE_ROWS_THRESHOLD Query Option for details. - Set the
NUM_ROWS_PRODUCED_LIMIT
query option for interactive user pools or in Hue to limit the rows produced by interactive queries. See NUM_ROWS_PRODUCED_LIMIT Query Option for details. - Do not use the
SYNC_DDL = true
query option. Instead, use your load balancer configurations to ensure that clients are always directed to the same query coordinators. - Set the
EXEC_TIME_LIMIT_S
query option to a value such as 1,200 seconds (20 minutes). This query option causes an executing queries to be canceled if it is still executing when the time limit expires. If a query is canceled, investigate the query plan to identify the reason for the slowness. See EXEC_TIME_LIMIT_S Query Option for details.