What's New in Apache Impala
Learn about the new features of Apache Impala in Cloudera Runtime 7.3.1, its service packs and cumulative hotfixes.
Cloudera Runtime 7.3.1.500 SP3:
- Python 3.9 Support for Impala on RHEL 8
- Cloudera now provides support for Python 3.9 on RHEL 8 within Cloudera 7.3.1. This new capability ensures that Impala components dependent on Python libraries function completely and reliably.
- SHOW VIEWS statement
- This release introduces the SHOW VIEWS statement , which simplifies the task of listing all views within a specified schema or database. Using this command, you can quickly identify and review views, thereby enhancing performance by reducing metadata scan operations.
- Planner changes to improve cardinality estimation
- Significant changes have been made to the query planner to improve cardinality estimation , a critical component of workload-aware autoscaling.
- Distribute runtime filter aggregation
- Aggregating runtime filters during runtime can impose significant memory overhead on the coordinator. To address this issue, we have implemented a solution that distributes the runtime filter aggregation across specific Impala backends.
- Improvement in catalog observability
- This release introduces significant enhancements to the Impala Catalog Web UI , focusing on addressing performance issues associated with delays in processing Hive Metastore (HMS) events. These improvements aim to mitigate the risk of queries using outdated metadata.
- Caching codegen functions
- In Impala, "codegen" involves generating specialized machine code for each query based on query-specific information. The codegen capability converts query-specific information into machine code, enhancing query performance through faster execution.
- Support ORDER BY for collections of variable length types in SELECT list
- This release introduces support for collections of variable length types in the
sorting tuple. While it's now possible to include these collection columns in the SELECT
list alongside other columns used for sorting, direct sorting by these collection
columns is not supported. Additionally, collections of variable-length types can now
serve as non-passthrough children of UNION ALL nodes.
It is important to note that structs containing collections, whether of variable or fixed length, are still not supported in the select list for ORDER BY queries.
Here are examples of supported queries:
select id, arr_string_1d from collection_tbl order by id; select id, map_1d from collection_tbl order by id;However, queries such as the following are not supported:
select id, struct_contains_map from collection_struct_mix order by id;Attempting to execute such queries will result in the error message AnalysisException: Sorting is not supported if the select list contains collection(s) nested in struct(s).
- Improved cardinality estimation for aggregation queries
- Impala now provides more accurate cardinality estimates for aggregation queries by
considering data distribution, predicates, and tuple tracing. Enhancements include:
- Pre-aggregation Cardinality Adjustments: A new estimation model accounts for duplicate keys across nodes, reducing underestimation errors.
- Predicate-Aware Cardinality Calculation: The planner now considers filtering conditions on group-by columns to refine cardinality estimates.
- Tuple Tracing for Better Accuracy: Improved tuple analysis allows deeper tracking across views and intermediate aggregation nodes.
- Consistent Aggregation Node Stats Computation: The planning process now ensures consistent and efficient recomputation of aggregation node statistics. These improvements lead to better memory estimates, optimized query execution, and more efficient resource utilization.
- Tuple-Based Cardinality Analysis: Analyzing grouping expressions from the same tuple to ensure their combined number of distinct values does not exceed the output cardinality of the source PlanNode, reducing overestimation.
- Refined number of distinct values Calculation for CPU Costing: The new approach applies a probabilistic formula to a single global NDV estimate, improving accuracy and reducing overestimation in processing cost calculations.
Apache Jira: IMPALA-2945, IMPALA-13086, IMPALA-13465 , IMPALA-13526, IMPALA-13405 IMPALA-13644
- Cleanup of host-level remote scratch dir on startup and exit
- Impala now removes leftover scratch files from remote storage during startup and shutdown, ensuring efficient storage management. The cleanup targets files in the host-specific directory within the configured remote scratch location.
A new flag,
remote_scratch_cleanup_on_start_stop, controls this behavior. By default, cleanup is enabled, but you can disable it if multiple Impala daemons on a host or multiple clusters share the same remote scratch directory to prevent unintended deletions.Apache Jira: IMPALA-13677, IMPALA-13798
- Graceful shutdown with query cancellation
- Impala now attempts to cancel running queries before reaching the graceful shutdown
deadline, ensuring resources are released properly. The new
shutdown_query_cancel_period_sflag controls this behavior. The default value is 60 seconds. If set to a value greater than 0, Impala will try to cancel running queries within this period before forcing shutdown. If the value exceeds 20% of the total shutdown deadline, it is automatically capped to prevent excessive delays. This approach helps prevent unfinished queries and unreleased resources during shutdown. For more information, see Setting Impala Query Cancellation on Shut down - Programmatic query termination
- Impala now supports the
KILL QUERYstatement, enabling you to forcibly terminate queries for better workload management. TheKILL QUERYstatement cancels and unregisters queries on any coordinator. For more information, see KILL QUERY statement - AI Functions in Impala
- Cloudera Runtime introduces Impala’s built-in
ai_generate_text function integrates Large Language Models (LLMs) into SQL for
tasks such as sentiment analysis and translation. It simplifies workflows, requires no
ML expertise, and supports default or custom UDF configurations.
Secure API key storage is supported through a JCEKS keystore. A lightweight tool included in the UDF SDK helps create or update keystores on Amazon S3 or Azure ABFS without a local Hadoop setup.
For more information, see Advantages and use cases of Impala AI functions
- Ability to log and manage Impala workloads (Technical Preview)
- Cloudera Runtime provides you the option to enable logging Impala
queries on an existing Virtual Warehouse or while creating a new Impala Virtual
Warehouse. The information for all completed Impala queries is stored in the
sys.impala_query_logsystem table. Information about all actively running and recently completed Impala queries is stored in thesys.impala_query_livesystem table. Users with appropriate permissions can query this table using SQL to monitor and optimize the Impala engine.For more information, see Impala workload management
- Support for Impala external JDBC data sources (Technical Preview)
- Apache Impala now supports reading from external JDBC data sources. An external JDBC table represents a table or a view in a remote RDBMS database or another Impala cluster. Using external JDBC tables, you can connect Impala to a database, such as MySQL, PostgreSQL, or another Impala cluster and read the data in the remote tables.
- Running queries on system tables (Technical Preview)
- Queries against Impala system tables, such as
sys.impala_query_live, could get delayed due to admission control constraints. These queries, which require only coordinator resources, were previously blocked by queries competing for executor resources. To address this, Impala introduces an "only coordinators" request pool, allowing system table queries to bypass executor queues and run only on the coordinators to prevent delays during admission.Apache Jira: IMPALA-13201
For more information, see Running queries on system tables
- User quotas in admission control (Technical Preview)
- This release introduces user quotas in Impala admission control, a new feature
designed to enhance resource management and ensure fair query distribution across users
and groups.
For more information, see User quotas in Admission Control
Cloudera Runtime 7.3.1.400 SP2:
There are no new features in this release.
Cloudera Runtime 7.3.1.300 SP1 CHF 1:
There are no new features in this release.
Cloudera Runtime 7.3.1.200 SP1:
There are no new features in this release.
Cloudera Runtime 7.3.1.100 CHF 1:
There are no new features in this release.
Cloudera Runtime 7.3.1
- Collections of fixed length types as non-passthrough children of unions
-
This update enables collections of fixed-length types to be used as non-passthrough children in
UNION ALLoperations. It achieves this by allowing the materialization of these collections.Apache Jira: IMPALA-12147
- Display query execution progress in Impala Web UI
-
Adds a query progress indicator to the /queries page in Impala's Web UI, showing the completion status of fragment instances. This feature provides better tracking for computation-intensive queries, supplementing the scan progress bar.
Apache Jira: IMPALA-12048
- Allow implicit casts between numeric and string types when inserting into table
-
The current implementation requires explicit casts for numeric and string-based literals. This is controlled through a query option
allow_unsafe_castsand turned off by default. This query option allows implicit casting between some numeric types and string types. See, implicit castingApache Jira: https://issues.apache.org/jira/browse/IMPALA-10173
- Optimize query planning by reducing getLocation() and getFileSystem() calls
-
The fix reduces planning time by calling
HdfsPartition.getLocation()once per partition and caching the FileSystem object based on the URI scheme and authority. This minimizes expensive decompression and redundantgetFileSystem()calls, improving performance for queries with many partitions.Apache Jira: IMPALA-12408
- JSON File Reader Prototype
-
This prototype enables reading JSON files using the rapidjson library with Arrow support such as HdfsJsonScanner, callback functions, and startup flag.
Apache Jira: IMPALA-10798
- CREATE TABLE LIKE for Kudu tables
-
Impala now supports a dedicated keytab for HTTP SPNEGO authentication, enabling easier management of Kerberos keytabs. A new
--spnego_keytab_fileflag lets you specify a separate keytab for the web console when--webserver_require_spnegois enabled. If this flag is set, the web server will use the SPNEGO keytab for HTTP authentication, while the main service keytab remains unchanged. If not specified, the web server defaults to using the primary service keytab for SPNEGOApache Jira: IMPALA-4052
- Dedicated SPNEGO keytab for Impala web console authentication
-
Impala now supports a dedicated keytab for HTTP SPNEGO authentication, enabling easier management of Kerberos keytabs. A new
--spnego_keytab_fileflag lets you specify a separate keytab for the web console when--webserver_require_spnegois enabled. If this flag is set, the web server will use the SPNEGO keytab for HTTP authentication, while the main service keytab remains unchanged. If not specified, the web server defaults to using the primary service keytab for SPNEGOApache Jira: IMPALA-12318
- Non-unique primary keys in Kudu
-
Kudu now supports non-unique primary keys by automatically adding an
auto_increment_idcolumn to form a unique composite primary key. This column, a system-generated big integer, ensures uniqueness within each tablet server region and is hidden unless specified inSELECTstatements.ALTER TABLEmodifications andUPSERToperations for this column are currently unsupported.Apache Jira: IMPALA-11809
- Unicode column name support in Impala
-
Impala now supports Unicode characters in column names, aligning with Hive's support for non-ASCII characters. This enhancement leverages Hive's
validateColumnName()function, which removes restrictions on column names at the metadata level. With this update, Impala allows greater flexibility for column naming while remaining consistent with Hive's metadata validation standards.Apache Jira: IMPALA-12465
- Support custom hash partitions at range level in Kudu tables
-
Impala now supports specifying custom hash partitions at the range level in Kudu tables. You can define hash schemas within specific partitions using the updated
CREATE TABLEandALTER TABLEsyntax, and view them with the newSHOW HASH SCHEMAstatement. This update aligns hash partitioning more closely with range partitioning, enhancing flexibility while maintaining backward compatibility.Apache Jira: IMPALA-11430
