What's New in Apache Impala
Learn about the new features of Impala in Cloudera Runtime 7.1.9.
Java Dependencies - JDK 17 support
Although Impala is primarily written in C++, it does use Java to communicate with various Hadoop components. From this release, the officially supported JVMs for Impala will include JDK 17 along with the existing Oracle and OpenJDK variants of 8 and 11. When adding hosts to a new cluster, you will be prompted to choose the JDK version, and you can configure any of the supported JDK versions. Internally, the impalad daemon relies on the JAVA_HOME environment variable to locate the system Java libraries.
RHEL 9 support
SLES 15 support
SLES 15 is supported from CDP 7.1.8-CHF8 and above!
Python 3.8 support on all CDP certified OSs
Note to an Impala user using impala-shell from the parcel: Python 3 is supported from 7.1.8 CHF3 and the CDP versions prior to this hotfix release support Python 2. So if you must use Python 3, make sure to use a version of CDP that supports it.
Note to an Impala user using impala-shell downloaded from https://pypi.org/project/impala-shell/ With any version of CDP 7.x, you can use the latest PyPi version. If you must use Python 3 in your environment, make sure to use the latest Pypi impala-shell.
Impala WebUI improvements
This release enhanced the Impala daemon’s Web UI to display the following additional details:
-
Backends start time and version: In a large cluster, you can now use the Impala daemon’s Web UI to view the start time and version for all the backends.
-
Query performance characteristics: For a detailed report on how a query was executed and to understand the detailed performance characteristics of a query, you can use the built-in web server’s UI and look at the timeline shown in the Gantt chart. This chart is an alternative to the PROFILE command and is a graphical display in the WebUI that renders timing information and dependencies.
-
Export query plan and timeline: To understand the detailed performance characteristics for a query, you issue the PROFILE command in impala-shell immediately after executing a query. As an alternative to the profile download page, this release added support for exporting the graphical query plan and also for downloading the timeline in SVG/HTML format. Once you export the query plan or the timeline, memory resources consumed from the ObjectURLs get cleared.
-
Historical/in-flight query performance: You can now use the query list and query details page to analyze historical or in-flight query performance by viewing the memory consumed, the amount of data read, and other information about the query.
Ranger audit behavior enhancements
Before this release, the Ranger authorization is called for each Impala object; that is, database, table and each column, and this generates a bulky audit for a larger number of columns. This release consolidates the log entries of several columns’ accesses into one entry in the same table, which saves space.
Removing self events
Before this release, some metadata consistency issues lead to query failures because the metadata updates from multiple coordinators could not differentiate between self-generated events and those that are generated by a different coordinator. This issue is resolved now by adding a coordinator flag to each event, and when processing these events we check the coordinator flag to make a decision on whether to ignore the event or not.
Query hints for table cardinalities
Currently, Impala only uses simple estimation to compute selectivity. For some predicates, the estimation might deviate significantly from the actual value, which leads to a worse query plan.You can now use a new query hint, 'SELECTIVITY', to help specify a selectivity value for a predicate.
JWT auth for Impala
Authentication is the mechanism to ensure that only specified hosts and users can connect to Impala. To use JWT authentication, you must configure it in CDP using Cloudera Manager. Clients, such as Impala shell, can then authenticate to Impala using a JWT instead of a username/password.
Improvements in rolling restart
This release supports the rolling restart of Impala service during the rolling upgrade. However, zero downtime upgrade is not supported yet as Impala is not an HA service, and has singleton components like catalog and statestore. But the rolling restart has been enhanced to increase the speed by restarting half of the cluster together.
Using Knox as a proxy
In both CDP public cloud data hubs and private cloud base, clients access Impala through Knox as a proxy for its ability to do SSO. This is the officially encouraged technique, but it requires setting the parameter in impala-shell --http_cookie_names=KNOX_BACKEND-IMPALA to include the cookie that Knox uses for stateful connections. This config is needed for Active-Active HA to work for Impala.
Downgrade for Impala
After upgrading to CDP 7.1.9, if you must restore the software back to the pre-upgrade release and preserve the user data to 7.1.8 then do the following for Impala service:
-
Stop the running Impala service in CM.
-
Rollback the parcel to the older release parcel.
-
Start the Impala service.
Impala Ozone EC support
Impala now supports reading from Ozone data stored with Erasure Coding (EC). The Ozone EC feature provides data durability and fault tolerance with reduced storage space and ensures data durability similar to the Ratis THREE replication approach. EC can be considered as an alternative to replication.
Spill to Ozone
You can now use Ozone as a scratch space for writing intermediate files during large sorts, joins, aggregations, or analytic function operations.
Ability to create an external table
A user can now create an external Kudu table pointing to an existing Kudu table if the user is granted the RWSTORAGE privilege on the resource specified by a storage handler URI. Before this release, a user was required to have the ALL privilege on SERVER to create an external Kudu table for an existing Kudu table. This has been simplified by the introduction of a new type of resource called storage handler URI and a new access type called RWSTORAGE that will be supported by Apache Ranger.
Ability to create a non-unique primary key for Kudu
Impala now supports creating a Kudu table with a non-unique primary key. When creating a Kudu table, specifying PRIMARY KEY is optional now. If there is no primary key attribute specified, the partition key columns could be promoted as non-unique primary keys if those columns are the beginning columns of the table.
TPC-DS performance improvements
In this release, the following enhancements are introduced in multiple areas in the planner and executor to improve query performance and to meet the TPC decision support (TPC-DS) benchmark.
-
Improve cardinality estimation for joins involving multiple conjuncts.
-
Introduced new query options to improve memory estimation for aggregation nodes.
- This release brings some changes to the query planner to improve parallel sizing andresource estimation. This is done for workload-aware autoscaling and will be available as query options. These additional query options are added for tuning purposes. This new functionality will allow additional customers to enable multi-threaded queries globally for improved performance.
Impala late materialization of columns
This release introduces late materialization, which optimizes certain queries on Parquet tables by limiting table scanning. Only relevant data is materialized to improve query response.
Binary support
Impala now supports BINARY columns for all table formats except Kudu. See the BINARY support topic for more information on using this arbitrary-length byte array data type in CREATE TABLE and SELECT statements.
ALTER VIEW support
Before this release, altering only the VIEW definition, VIEW name, and owner was supported. Impala now supports altering the table properties of a VIEW by using the set tblproperties clause.
BYTES function support
Impala now supports the BYTES() function. This function returns the number of bytes contained in a byte string.
Resolving ORC columns by names
Before this release, Impala resolved ORC columns by index. In this release, a query option ORC_SCHEMA_RESOLUTION is added to support resolving ORC columns by names.
Retrieving the data file name
Impala now supports including a virtual column in a standard SELECT statement select INPUT__FILE__NAME from <tablename> to retrieve the name of the data file that stores the actual row in a table.
Min/Max filtering in Impala
Using Parquet format, you can query to find the minimum or maximum value for a column within a partition, row group, page, or row.
Reading and writing Parquet bloom filters
Bloom filter is a performance optimization feature now available in Impala. This filter tells you, rapidly and memory-efficiently, whether the data you are looking for is present in a file.
Printing query results in vertical format
Impala-shell now includes a new command option '-E' or '--vertical' to support printing of query results in vertical format.
Added support for thrift-0.16.0
Limited support for Hive Generic UDFs
Hive has 2 types of UDFs. This release contains limited support for the second generation UDFs called GenericUDFs. The main limitations are as follows:
-
Decimal types are not supported
-
Complex types are not supported
-
Functions are not extracted from the jar file
GenericUDFs cannot be made permanent. They will need to be recreated every time the server is restarted.
Reset all query options
UNSET ALL can unset all query options. This is especially useful when connections are reused, e.g. when a connection pool is used.