Recommended SET options for Impala

Following are the recommended SET options for the best performance with Impala.

By: Manish Maheshwari, Data Architect and Data Scientist at Cloudera, Inc.

Always use the SET MEM_LIMIT query option. See MEM_LIMIT Query Option for details.
Use the APPX_COUNT_DISTINCT query option to improve performance of COUNT(DISTINCT) queries by converting them to use NDV() function calls over linear counting. Note that APPX_COUNT_DISTINCT is approximately 98% accurate for smaller data sets and over 99% accurate for larger datasets. See APPX_COUNT_DISTINCT Query Option for details.
Always use a compression algorithm that both encodes and decodes (codec). Cloudera recommends that you use LZ4 or ZSTD with Impala version 3.3 and onwards. Use snappy for earlier versions of Impala.
Do not use the DISABLE_CODEGEN query option because it can significantly slow query execution. Instead, use the DISABLE_CODEGEN_ROWS_THRESHOLD query option, which automatically disables codegen for small queries. For details, see DISABLE_CODEGEN Query Option.
Set the EXEC_SINGLE_NODE_ROWS_THRESHOLD query option to 1000 rows for faster execution of simple queries or queries that contain LIMIT clauses. See EXEC_SINGLE_NODE_ROWS_THRESHOLD Query Option for details.
Set the NUM_ROWS_PRODUCED_LIMIT query option for interactive user pools or in Hue to limit the rows produced by interactive queries. See NUM_ROWS_PRODUCED_LIMIT Query Option for details.
Do not use the SYNC_DDL = true query option. Instead, use your load balancer configurations to ensure that clients are always directed to the same query coordinators.
Set the EXEC_TIME_LIMIT_S query option to a value such as 1,200 seconds (20 minutes). This query option causes an executing queries to be canceled if it is still executing when the time limit expires. If a query is canceled, investigate the query plan to identify the reason for the slowness. See EXEC_TIME_LIMIT_S Query Option for details.