Recommended SET options for Impala

Following are the recommended SET options for the best performance with Impala.

By: Manish Maheshwari, Data Architect and Data Scientist at Cloudera, Inc.

  • Always use the SET MEM_LIMIT query option. See MEM_LIMIT Query Option for details.
  • Use the APPX_COUNT_DISTINCT query option to improve performance of COUNT(DISTINCT) queries by converting them to use NDV() function calls over linear counting. Note that APPX_COUNT_DISTINCT is approximately 98% accurate for smaller data sets and over 99% accurate for larger datasets. See APPX_COUNT_DISTINCT Query Option for details.
  • Always use a compression algorithm that both encodes and decodes (codec). Cloudera recommends that you use LZ4 or ZSTD with Impala version 3.3 and onwards. Use snappy for earlier versions of Impala.
  • Do not use the DISABLE_CODEGEN query option because it can significantly slow query execution. Instead, use the DISABLE_CODEGEN_ROWS_THRESHOLD query option, which automatically disables codegen for small queries. For details, see DISABLE_CODEGEN Query Option.
  • Set the EXEC_SINGLE_NODE_ROWS_THRESHOLD query option to 1000 rows for faster execution of simple queries or queries that contain LIMIT clauses. See EXEC_SINGLE_NODE_ROWS_THRESHOLD Query Option for details.
  • Set the NUM_ROWS_PRODUCED_LIMIT query option for interactive user pools or in Hue to limit the rows produced by interactive queries. See NUM_ROWS_PRODUCED_LIMIT Query Option for details.
  • Do not use the SYNC_DDL = true query option. Instead, use your load balancer configurations to ensure that clients are always directed to the same query coordinators.
  • Set the EXEC_TIME_LIMIT_S query option to a value such as 1,200 seconds (20 minutes). This query option causes an executing queries to be canceled if it is still executing when the time limit expires. If a query is canceled, investigate the query plan to identify the reason for the slowness. See EXEC_TIME_LIMIT_S Query Option for details.