Recommended SET options for Impala
Following are the recommended SET options for the best performance with Impala.
By: Manish Maheshwari, Data Architect and Data Scientist at Cloudera, Inc.
- Always use the
SET MEM_LIMITquery option. See MEM_LIMIT Query Option for details. - Use the
APPX_COUNT_DISTINCTquery option to improve performance ofCOUNT(DISTINCT)queries by converting them to useNDV()function calls over linear counting. Note thatAPPX_COUNT_DISTINCTis approximately 98% accurate for smaller data sets and over 99% accurate for larger datasets. See APPX_COUNT_DISTINCT Query Option for details. - Always use a compression algorithm that both encodes and decodes (codec). Cloudera recommends that you use LZ4 or ZSTD with Impala version 3.3 and onwards. Use snappy for earlier versions of Impala.
- Do not use the
DISABLE_CODEGENquery option because it can significantly slow query execution. Instead, use theDISABLE_CODEGEN_ROWS_THRESHOLDquery option, which automatically disables codegen for small queries. For details, see DISABLE_CODEGEN Query Option. - Set the
EXEC_SINGLE_NODE_ROWS_THRESHOLDquery option to 1000 rows for faster execution of simple queries or queries that containLIMITclauses. See EXEC_SINGLE_NODE_ROWS_THRESHOLD Query Option for details. - Set the
NUM_ROWS_PRODUCED_LIMITquery option for interactive user pools or in Hue to limit the rows produced by interactive queries. See NUM_ROWS_PRODUCED_LIMIT Query Option for details. - Do not use the
SYNC_DDL = truequery option. Instead, use your load balancer configurations to ensure that clients are always directed to the same query coordinators. - Set the
EXEC_TIME_LIMIT_Squery option to a value such as 1,200 seconds (20 minutes). This query option causes an executing queries to be canceled if it is still executing when the time limit expires. If a query is canceled, investigate the query plan to identify the reason for the slowness. See EXEC_TIME_LIMIT_S Query Option for details.
