What's New in Apache Impala

Learn about the new features of Impala in Cloudera Runtime 7.2.2.

Expanded SQL support

In this release, we extended SQL support in Impala to increase the completeness of SQL queries.

INTERSECT/EXCEPT

Added support for the set operators EXCEPT and INTERSECT. The EXCEPT and INTERSECT operators allow you to combine the result sets of multiple queries and return distinct rows by comparing the results of two queries.

See UNION/INTERSECT/EXCEPT for more information.

GROUPING SETS/ROLLUP/CUBE

Added support for GROUP BY ROLLUP, CUBE and GROUPING SETS. The GROUP BY ROLLUP clause creates a group for each combination of column expressions. The CUBE clause creates groups for all possible combinations of columns. The GROUPING SETS just lets you list out the combinations of expressions that you want to GROUP BY.

See GROUP BY clause for more information.

GROUPING and GROUPING_ID functions

Added these functions to be used in conjunction with the GROUP BY ROLLUP, CUBE, and GROUPING SETS clauses. GROUPING (column) function indicates whether the column in a GROUP BY list is aggregated or not. GROUPING_ID function computes the level of grouping and returns an integer value which is unique for each combination of grouping values.

See GROUPING and GROUPING_ID functions for more information.

Improved support for scalar subqueries

Uncorrelated scalar subqueries can now be used in SELECT list and in HAVING clause.

See Subqueries in Impala SELECT statements for more information.

New and Improved Performance Optimizations

The following changes have been made to improve the performance.

Pushdown limit to analytic sort operator

To optimize the performance, in 7.2.2, the Impala optimizer applies an outer LIMIT from a top level Sort down to the Sort below associated with an Analytic operator. This pushdown is considered if the Analytic operator is evaluating a ranking function such as RANK, DENSE_RANK, ROW_NUMBER and other qualifying criteria are met. Applying the limit early improves the performance by reducing the amount of data processed by the inner query. When the pushdown is applicable, the Analytic Sort is converted to a TopN Sort. Further, this is split into a bottom TopN and an upper TopN separated by a hash partition exchange. This ensures that the limit is applied as early as possible even before hash partitioning.

Example:


  select * from (select l_partkey, l_quantity, 
  rank() over (partition by l_partkey order by l_quantity desc) rk 
  from lineitem) dt 
  where rk <= 100
  order by l_partkey, l_quantity, rk
  limit 100;
 

In this example, the limit 100 from the outer ORDER BY is pushed down to the analytic sort that is done below the Analytic operator.

Improved performance for sort

Improved the Sort operator efficiency. This reduces the execution time for top-n queries and partitioned loads.

Added distinct aggregation to semi joins

Added a new query option ENABLE_DISTINCT_SEMI_JOIN_OPTIMIZATION to reduce the number of potential plan regressions. When this query option is enabled the query planner considers inserting a distinct aggregation on the inner side of the join based on whether that aggregation would reduce the number of rows by more than 75%.

See Query options for more information.

Improved default runtime filter settings

The following runtime filter settings for DWX Impala have been updated to these new values to provide better results on DWX benchmarks. These are set as DWX defaults and can also be set via the default_query_options config:

  • RUNTIME_FILTER_MIN_SIZE=8192
  • RUNTIME_FILTER_MAX_SIZE=2097152
  • MAX_NUM_RUNTIME_FILTERS=50
  • RUNTIME_FILTER_WAIT_TIME_MS=10000

READ support for FULL-ACID ORC Tables

Until this release, Impala in CDP supported INSERT-ONLY transactional tables allowing both READ and WRITE operations. The latest version of Impala in CDP 7.2.2 now also supports READ of FULL ACID ORC tables without modifying any configurations.

See READ support for FULL-ACID ORC Tables for more information.