What's New in Apache Impala

Java Dependencies - JDK 17 support

Although Impala is primarily written in C++, it does use Java to communicate with various Hadoop components. From this release, the officially supported JVMs for Impala will include JDK 17 along with the existing Oracle and OpenJDK variants of 8 and 11. When adding hosts to a new cluster, you will be prompted to choose the JDK version, and you can configure any of the supported JDK versions. Internally, the impalad daemon relies on the JAVA_HOME environment variable to locate the system Java libraries.

RHEL 9 support

From this release, Python 3-based impala clients (Impyla and impala-shell) are compatible with RHEL 9 and can be installed on RHEL 9 to communicate with an Impala instance on CDP 7.1.9.

SLES 15 support

SLES 15 is supported from CDP 7.1.8-CHF8 and above!

Python 3.8 support on all CDP certified OSs

Note to an Impala user using impala-shell from the parcel: Python 3 is supported from 7.1.8 CHF3 and the CDP versions prior to this hotfix release support Python 2. So if you must use Python 3, make sure to use a version of CDP that supports it.

Note to an Impala user using impala-shell downloaded from https://pypi.org/project/impala-shell/ With any version of CDP 7.x, you can use the latest PyPi version. If you must use Python 3 in your environment, make sure to use the latest Pypi impala-shell.

Impala WebUI improvements

This release enhanced the Impala daemon’s Web UI to display the following additional details:

Backends start time and version: In a large cluster, you can now use the Impala daemon’s Web UI to view the start time and version for all the backends.
Query performance characteristics: For a detailed report on how a query was executed and to understand the detailed performance characteristics of a query, you can use the built-in web server’s UI and look at the timeline shown in the Gantt chart. This chart is an alternative to the PROFILE command and is a graphical display in the WebUI that renders timing information and dependencies.
Export query plan and timeline: To understand the detailed performance characteristics for a query, you issue the PROFILE command in impala-shell immediately after executing a query. As an alternative to the profile download page, this release added support for exporting the graphical query plan and also for downloading the timeline in SVG/HTML format. Once you export the query plan or the timeline, memory resources consumed from the ObjectURLs get cleared.
Historical/in-flight query performance: You can now use the query list and query details page to analyze historical or in-flight query performance by viewing the memory consumed, the amount of data read, and other information about the query.

Ranger audit behavior enhancements

Before this release, the Ranger authorization is called for each Impala object; that is, database, table and each column, and this generates a bulky audit for a larger number of columns. This release consolidates the log entries of several columns’ accesses into one entry in the same table, which saves space.

Removing self events

Before this release, some metadata consistency issues lead to query failures because the metadata updates from multiple coordinators could not differentiate between self-generated events and those that are generated by a different coordinator. This issue is resolved now by adding a coordinator flag to each event, and when processing these events we check the coordinator flag to make a decision on whether to ignore the event or not.

Query hints for table cardinalities

Currently, Impala only uses simple estimation to compute selectivity. For some predicates, the estimation might deviate significantly from the actual value, which leads to a worse query plan.You can now use a new query hint, 'SELECTIVITY', to help specify a selectivity value for a predicate.

JWT auth for Impala

Authentication is the mechanism to ensure that only specified hosts and users can connect to Impala. To use JWT authentication, you must configure it in CDP using Cloudera Manager. Clients, such as Impala shell, can then authenticate to Impala using a JWT instead of a username/password.

Improvements in rolling restart

This release supports the rolling restart of Impala service during the rolling upgrade. However, zero downtime upgrade is not supported yet as Impala is not an HA service, and has singleton components like catalog and statestore. But the rolling restart has been enhanced to increase the speed by restarting half of the cluster together.

Using Knox as a proxy

In both CDP public cloud data hubs and private cloud base, clients access Impala through Knox as a proxy for its ability to do SSO. This is the officially encouraged technique, but it requires setting the parameter in impala-shell --http_cookie_names=KNOX_BACKEND-IMPALA to include the cookie that Knox uses for stateful connections. This config is needed for Active-Active HA to work for Impala.

Downgrade for Impala

After upgrading to CDP 7.1.9, if you must restore the software back to the pre-upgrade release and preserve the user data to 7.1.8 then do the following for Impala service:

Stop the running Impala service in CM.
Rollback the parcel to the older release parcel.
Start the Impala service.

Note: Suppose time T is the rolling upgrade start time and if you terminate the upgrade by a downgrade then the files created before or after T remain available in HDFS. The files deleted before or after T remain deleted in HDFS.

Impala Ozone EC support

Impala now supports reading from Ozone data stored with Erasure Coding (EC). The Ozone EC feature provides data durability and fault tolerance with reduced storage space and ensures data durability similar to the Ratis THREE replication approach. EC can be considered as an alternative to replication.

Spill to Ozone

You can now use Ozone as a scratch space for writing intermediate files during large sorts, joins, aggregations, or analytic function operations.

Ability to create an external table

A user can now create an external Kudu table pointing to an existing Kudu table if the user is granted the RWSTORAGE privilege on the resource specified by a storage handler URI. Before this release, a user was required to have the ALL privilege on SERVER to create an external Kudu table for an existing Kudu table. This has been simplified by the introduction of a new type of resource called storage handler URI and a new access type called RWSTORAGE that will be supported by Apache Ranger.

Ability to create a non-unique primary key for Kudu

Impala now supports creating a Kudu table with a non-unique primary key. When creating a Kudu table, specifying PRIMARY KEY is optional now. If there is no primary key attribute specified, the partition key columns could be promoted as non-unique primary keys if those columns are the beginning columns of the table.

TPC-DS performance improvements

In this release, the following enhancements are introduced in multiple areas in the planner and executor to improve query performance and to meet the TPC decision support (TPC-DS) benchmark.

Improve cardinality estimation for joins involving multiple conjuncts.
Introduced new query options to improve memory estimation for aggregation nodes.
Planner changes for CPU usage
This release brings some changes to the query planner to improve parallel sizing andresource estimation. This is done for workload-aware autoscaling and will be available as query options. These additional query options are added for tuning purposes. This new functionality will allow additional customers to enable multi-threaded queries globally for improved performance.

Impala late materialization of columns
This release introduces late materialization, which optimizes certain queries on Parquet tables by limiting table scanning. Only relevant data is materialized to improve query response.

Binary support

Impala now supports BINARY columns for all table formats except Kudu. See the BINARY support topic for more information on using this arbitrary-length byte array data type in CREATE TABLE and SELECT statements.

ALTER VIEW support

Before this release, altering only the VIEW definition, VIEW name, and owner was supported. Impala now supports altering the table properties of a VIEW by using the set tblproperties clause.

BYTES function support

Impala now supports the BYTES() function. This function returns the number of bytes contained in a byte string.

Resolving ORC columns by names

Before this release, Impala resolved ORC columns by index. In this release, a query option ORC_SCHEMA_RESOLUTION is added to support resolving ORC columns by names.

Retrieving the data file name

Impala now supports including a virtual column in a standard SELECT statement select INPUT__FILE__NAME from <tablename> to retrieve the name of the data file that stores the actual row in a table.

Min/Max filtering in Impala

Using Parquet format, you can query to find the minimum or maximum value for a column within a partition, row group, page, or row.

Reading and writing Parquet bloom filters

Bloom filter is a performance optimization feature now available in Impala. This filter tells you, rapidly and memory-efficiently, whether the data you are looking for is present in a file.

Printing query results in vertical format

Impala-shell now includes a new command option '-E' or '--vertical' to support printing of query results in vertical format.

Added support for thrift-0.16.0

Limited support for Hive Generic UDFs

Hive has 2 types of UDFs. This release contains limited support for the second generation UDFs called GenericUDFs. The main limitations are as follows:

Decimal types are not supported
Complex types are not supported
Functions are not extracted from the jar file

GenericUDFs cannot be made permanent. They will need to be recreated every time the server is restarted.

Reset all query options

UNSET ALL can unset all query options. This is especially useful when connections are reused, e.g. when a connection pool is used.