What's New in Apache Impala

Learn about the new features of Apache Impala in Cloudera Runtime 7.3.1, its service packs and cumulative hotfixes.

Cloudera Runtime 7.3.1.500 SP3:

Python 3.9 Support for Impala on RHEL 8
Cloudera now provides support for Python 3.9 on RHEL 8 within Cloudera 7.3.1. This new capability ensures that Impala components dependent on Python libraries function completely and reliably.
SHOW VIEWS statement
This release introduces the SHOW VIEWS statement , which simplifies the task of listing all views within a specified schema or database. Using this command, you can quickly identify and review views, thereby enhancing performance by reducing metadata scan operations.
Planner changes to improve cardinality estimation
Significant changes have been made to the query planner to improve cardinality estimation , a critical component of workload-aware autoscaling.
In previous versions, Impala would generate a plan first and then search for runtime filters based on the entire plan. In this release, selective runtime filters have been integrated. These filters aim to reduce the cardinality estimates of scan nodes and specific join nodes located above them. This refinement occurs after the generation of runtime filters and before the computation of resource requirements.
Distribute runtime filter aggregation
Aggregating runtime filters during runtime can impose significant memory overhead on the coordinator. To address this issue, we have implemented a solution that distributes the runtime filter aggregation across specific Impala backends.
Improvement in catalog observability
This release introduces significant enhancements to the Impala Catalog Web UI , focusing on addressing performance issues associated with delays in processing Hive Metastore (HMS) events. These improvements aim to mitigate the risk of queries using outdated metadata.
Caching codegen functions
In Impala, "codegen" involves generating specialized machine code for each query based on query-specific information. The codegen capability converts query-specific information into machine code, enhancing query performance through faster execution.
Support ORDER BY for collections of variable length types in SELECT list
This release introduces support for collections of variable length types in the sorting tuple. While it's now possible to include these collection columns in the SELECT list alongside other columns used for sorting, direct sorting by these collection columns is not supported. Additionally, collections of variable-length types can now serve as non-passthrough children of UNION ALL nodes.

It is important to note that structs containing collections, whether of variable or fixed length, are still not supported in the select list for ORDER BY queries.

Here are examples of supported queries:

select id, arr_string_1d from collection_tbl order by id;
select id, map_1d from collection_tbl order by id;

However, queries such as the following are not supported:

select id, struct_contains_map from collection_struct_mix order by id;

Attempting to execute such queries will result in the error message AnalysisException: Sorting is not supported if the select list contains collection(s) nested in struct(s).

Improved cardinality estimation for aggregation queries
Impala now provides more accurate cardinality estimates for aggregation queries by considering data distribution, predicates, and tuple tracing. Enhancements include:
  • Pre-aggregation Cardinality Adjustments: A new estimation model accounts for duplicate keys across nodes, reducing underestimation errors.
  • Predicate-Aware Cardinality Calculation: The planner now considers filtering conditions on group-by columns to refine cardinality estimates.
  • Tuple Tracing for Better Accuracy: Improved tuple analysis allows deeper tracking across views and intermediate aggregation nodes.
  • Consistent Aggregation Node Stats Computation: The planning process now ensures consistent and efficient recomputation of aggregation node statistics. These improvements lead to better memory estimates, optimized query execution, and more efficient resource utilization.
  • Tuple-Based Cardinality Analysis: Analyzing grouping expressions from the same tuple to ensure their combined number of distinct values does not exceed the output cardinality of the source PlanNode, reducing overestimation.
  • Refined number of distinct values Calculation for CPU Costing: The new approach applies a probabilistic formula to a single global NDV estimate, improving accuracy and reducing overestimation in processing cost calculations.

Apache Jira: IMPALA-2945, IMPALA-13086, IMPALA-13465 , IMPALA-13526, IMPALA-13405 IMPALA-13644

Cleanup of host-level remote scratch dir on startup and exit
Impala now removes leftover scratch files from remote storage during startup and shutdown, ensuring efficient storage management. The cleanup targets files in the host-specific directory within the configured remote scratch location.

A new flag, remote_scratch_cleanup_on_start_stop, controls this behavior. By default, cleanup is enabled, but you can disable it if multiple Impala daemons on a host or multiple clusters share the same remote scratch directory to prevent unintended deletions.

Apache Jira: IMPALA-13677, IMPALA-13798

Graceful shutdown with query cancellation
Impala now attempts to cancel running queries before reaching the graceful shutdown deadline, ensuring resources are released properly. The new shutdown_query_cancel_period_s flag controls this behavior. The default value is 60 seconds. If set to a value greater than 0, Impala will try to cancel running queries within this period before forcing shutdown. If the value exceeds 20% of the total shutdown deadline, it is automatically capped to prevent excessive delays. This approach helps prevent unfinished queries and unreleased resources during shutdown. For more information, see Setting Impala Query Cancellation on Shut down
Programmatic query termination
Impala now supports the KILL QUERY statement, enabling you to forcibly terminate queries for better workload management. The KILL QUERY statement cancels and unregisters queries on any coordinator. For more information, see KILL QUERY statement
AI Functions in Impala
Cloudera Runtime introduces Impala’s built-in ai_generate_text function integrates Large Language Models (LLMs) into SQL for tasks such as sentiment analysis and translation. It simplifies workflows, requires no ML expertise, and supports default or custom UDF configurations.

Secure API key storage is supported through a JCEKS keystore. A lightweight tool included in the UDF SDK helps create or update keystores on Amazon S3 or Azure ABFS without a local Hadoop setup.

For more information, see Advantages and use cases of Impala AI functions

Ability to log and manage Impala workloads (Technical Preview)
Cloudera Runtime provides you the option to enable logging Impala queries on an existing Virtual Warehouse or while creating a new Impala Virtual Warehouse. The information for all completed Impala queries is stored in the sys.impala_query_log system table. Information about all actively running and recently completed Impala queries is stored in the sys.impala_query_live system table. Users with appropriate permissions can query this table using SQL to monitor and optimize the Impala engine.

For more information, see Impala workload management

Support for Impala external JDBC data sources (Technical Preview)
Apache Impala now supports reading from external JDBC data sources. An external JDBC table represents a table or a view in a remote RDBMS database or another Impala cluster. Using external JDBC tables, you can connect Impala to a database, such as MySQL, PostgreSQL, or another Impala cluster and read the data in the remote tables.
For more information, see Using Impala to query external JDBC data sources
Running queries on system tables (Technical Preview)
Queries against Impala system tables, such as sys.impala_query_live, could get delayed due to admission control constraints. These queries, which require only coordinator resources, were previously blocked by queries competing for executor resources. To address this, Impala introduces an "only coordinators" request pool, allowing system table queries to bypass executor queues and run only on the coordinators to prevent delays during admission.

Apache Jira: IMPALA-13201

For more information, see Running queries on system tables

User quotas in admission control (Technical Preview)
This release introduces user quotas in Impala admission control, a new feature designed to enhance resource management and ensure fair query distribution across users and groups.

For more information, see User quotas in Admission Control

Cloudera Runtime 7.3.1.400 SP2:

There are no new features in this release.

Cloudera Runtime 7.3.1.300 SP1 CHF 1:

There are no new features in this release.

Cloudera Runtime 7.3.1.200 SP1:

There are no new features in this release.

Cloudera Runtime 7.3.1.100 CHF 1:

There are no new features in this release.

Cloudera Runtime 7.3.1

Collections of fixed length types as non-passthrough children of unions

This update enables collections of fixed-length types to be used as non-passthrough children in UNION ALL operations. It achieves this by allowing the materialization of these collections.

Apache Jira: IMPALA-12147

Display query execution progress in Impala Web UI

Adds a query progress indicator to the /queries page in Impala's Web UI, showing the completion status of fragment instances. This feature provides better tracking for computation-intensive queries, supplementing the scan progress bar.

Apache Jira: IMPALA-12048

Allow implicit casts between numeric and string types when inserting into table

The current implementation requires explicit casts for numeric and string-based literals. This is controlled through a query option allow_unsafe_casts and turned off by default. This query option allows implicit casting between some numeric types and string types. See, implicit casting

Apache Jira: https://issues.apache.org/jira/browse/IMPALA-10173

Optimize query planning by reducing getLocation() and getFileSystem() calls

The fix reduces planning time by calling HdfsPartition.getLocation() once per partition and caching the FileSystem object based on the URI scheme and authority. This minimizes expensive decompression and redundant getFileSystem() calls, improving performance for queries with many partitions.

Apache Jira: IMPALA-12408

JSON File Reader Prototype

This prototype enables reading JSON files using the rapidjson library with Arrow support such as HdfsJsonScanner, callback functions, and startup flag.

Apache Jira: IMPALA-10798

CREATE TABLE LIKE for Kudu tables

Impala now supports a dedicated keytab for HTTP SPNEGO authentication, enabling easier management of Kerberos keytabs. A new --spnego_keytab_file flag lets you specify a separate keytab for the web console when --webserver_require_spnego is enabled. If this flag is set, the web server will use the SPNEGO keytab for HTTP authentication, while the main service keytab remains unchanged. If not specified, the web server defaults to using the primary service keytab for SPNEGO

Apache Jira: IMPALA-4052

Dedicated SPNEGO keytab for Impala web console authentication

Impala now supports a dedicated keytab for HTTP SPNEGO authentication, enabling easier management of Kerberos keytabs. A new --spnego_keytab_file flag lets you specify a separate keytab for the web console when --webserver_require_spnego is enabled. If this flag is set, the web server will use the SPNEGO keytab for HTTP authentication, while the main service keytab remains unchanged. If not specified, the web server defaults to using the primary service keytab for SPNEGO

Apache Jira: IMPALA-12318

Non-unique primary keys in Kudu

Kudu now supports non-unique primary keys by automatically adding an auto_increment_id column to form a unique composite primary key. This column, a system-generated big integer, ensures uniqueness within each tablet server region and is hidden unless specified in SELECT statements. ALTER TABLE modifications and UPSERT operations for this column are currently unsupported.

Apache Jira: IMPALA-11809

Unicode column name support in Impala

Impala now supports Unicode characters in column names, aligning with Hive's support for non-ASCII characters. This enhancement leverages Hive's validateColumnName() function, which removes restrictions on column names at the metadata level. With this update, Impala allows greater flexibility for column naming while remaining consistent with Hive's metadata validation standards.

Apache Jira: IMPALA-12465

Support custom hash partitions at range level in Kudu tables

Impala now supports specifying custom hash partitions at the range level in Kudu tables. You can define hash schemas within specific partitions using the updated CREATE TABLE and ALTER TABLE syntax, and view them with the new SHOW HASH SCHEMA statement. This update aligns hash partitioning more closely with range partitioning, enhancing flexibility while maintaining backward compatibility.

Apache Jira: IMPALA-11430