What's New in Apache Impala

Removing self-generated events🔗

In previous releases, metadata consistency issues resulted in query failures. This occurred because the metadata updates from various coordinators couldn't distinguish between events generated by the coordinator itself and those generated by a different coordinator. This release addresses this issue by introducing a coordinator flag to each event. When processing these events, we now examine the coordinator flag to determine whether to ignore the event or proceed accordingly, resolving the inconsistency and preventing query failures.

Impala WebUI improvements🔗

This release introduces significant enhancements to the Impala daemon’s Web UI, providing users with additional insights into the system's performance:

Backends Start Time and Version:

In large clusters, the Impala daemon’s Web UI now allows you to easily access and view the start time and version details for all backends.

Query Performance Characteristics:

Gain deeper insights into query execution with a detailed report on how a query was executed. The built-in web server’s UI features a Gantt chart timeline, serving as an alternative to the PROFILE command. This graphical display in the Web UI renders timing information and dependencies.

Export Query Plan and Timeline:

As an alternative to the PROFILE command's profile download page, this release introduces support for exporting the graphical query plan and downloading the timeline in SVG/HTML format. Exporting these elements clears memory resources consumed from the ObjectURLs.

Historical/In-flight Query Performance:

The query list and query details page now offer the capability to analyze historical or in-flight query performance. Users can access information such as memory consumption, data read, and other relevant details about each query.

JWT auth for Impala🔗

Authentication is a crucial mechanism to secure connections to Impala, ensuring that only designated hosts and users can access the system. To implement JSON Web Token (JWT) authentication for Impala, follow these steps:

Configuration in CDP with Cloudera Manager:

Begin by configuring JWT authentication in Cloudera Data Platform (CDP) using Cloudera Manager. This involves setting up the necessary parameters and security settings to enable JWT authentication.

Client Authentication:

Once JWT authentication is configured, clients—such as the Impala shell—can authenticate to Impala using a JWT instead of the traditional username/password combination. This enhances security and provides an alternative, token-based approach to authentication.

By adopting JWT authentication, Impala ensures a more secure and efficient authentication process for connecting hosts and users. This method offers a modern and flexible alternative to the conventional username/password authentication mechanism.

TPC-DS performance improvements🔗

This release incorporates several enhancements across the planner and executor components to elevate query performance and align with the TPC decision support (TPC-DS) benchmark standards. The key improvements include:

Cardinality Estimation for Joins:

Significantly enhances cardinality estimation for joins involving multiple conjuncts, leading to more accurate query execution plans and improved performance.

Memory Estimation for Aggregation Nodes:

Introduces new query options specifically designed to enhance memory estimation for aggregation nodes. This optimization contributes to more efficient memory utilization during query execution.

Planner changes for CPU usage:

Implements changes in the query planner to enhance parallel sizing and resource estimation, catering to workload-aware autoscaling. The introduced query options allow users to fine-tune these settings for improved CPU utilization and overall performance. This feature enables the global activation of multi-threaded queries, offering enhanced scalability.

Late Materialization of Columns:

Introduces late materialization, a feature optimizing certain queries on Parquet tables. This optimization minimizes table scanning by materializing only the relevant data, thereby improving query response times.

These improvements collectively contribute to a more robust and efficient Impala system, ensuring optimal performance and compliance with TPC-DS benchmark standards. Users can leverage the new query options for tuning purposes and take advantage of late materialization to enhance the processing of queries on Parquet tables.

Resetting all query options🔗

The unset all command provides a convenient way to reset all query options. This functionality becomes particularly valuable in scenarios where connections are reused, such as when utilizing a connection pool. By executing UNSET ALL, all query options are unset, allowing for a clean slate and ensuring that subsequent queries operate with default settings. This capability enhances flexibility and efficiency, especially in connection pool scenarios where a fresh start for query options is desired.

Limited support for Hive Generic UDFs🔗

In this release, support for the second generation of Hive User-Defined Functions (UDFs), known as GenericUDFs, is introduced. However, it comes with certain limitations that users should be aware of:

Decimal Types Not Supported:

GenericUDFs in this release do not provide support for decimal types, and their usage with such data types may lead to limitations or errors.

Complex Types Not Supported:

The support for GenericUDFs is limited, and complex types are not currently supported. Users should be mindful of this restriction when working with UDFs that involve complex data structures.

Functions Not Extracted from JAR Files:

Unlike other UDF types, GenericUDFs do not automatically extract functions from JAR files. Users need to manually manage and ensure that the required functions are appropriately included for use.

Non-Permanent Nature:

GenericUDFs created in this release are not permanent and will not persist across server restarts. Recreating them is necessary after each server restart to maintain functionality.

These limitations highlight considerations for users employing GenericUDFs in their workflows. It is advised to evaluate these constraints and plan accordingly when incorporating GenericUDFs into Hive queries.

Printing Query Results in Vertical Format🔗

In the latest update, Impala-shell introduces a new command option '-E' or '--vertical' to facilitate the printing of query results in a vertical format. This provides users with a more streamlined and readable display of query outputs.

Retrieving the Data File Name🔗

Impala now offers support for including a virtual column in a standard SELECT statement. By using the following syntax: SELECT INPUT__FILE__NAME FROM <tablename>, users can effortlessly retrieve the name of the data file associated with the actual row stored in a table. This enhancement provides valuable insights into the underlying data organization.

Resolving ORC Columns by Names🔗

In previous releases, Impala resolved ORC columns based on index. With the introduction of this release, a new query option, ORC_SCHEMA_RESOLUTION, is now available. This option allows users to resolve ORC columns by names, offering a more flexible and intuitive approach to working with ORC data.

Reading and Writing Parquet Bloom Filters🔗

Introducing a performance optimization feature in Impala — the Parquet bloom filter. This feature enables rapid and memory-efficient determination of whether the desired data is present in a file. Users can now benefit from enhanced efficiency when working with Parquet files.

BYTES Function Support🔗

Impala now incorporates support for the BYTES() function. This function efficiently returns the number of bytes contained within a byte string. Users can leverage this functionality to gain insights into the size of byte strings within their data.

Min/Max Filtering in Impala🔗

With the utilization of the Parquet format, Impala introduces the capability to perform min/max filtering at the Parquet row group, page, and row levels and skip the row group, page or row during scans. This enhancement provides a more granular and targeted approach to data analysis. For more information see,minimum or maximum

DDL Support for Bucketed Tables🔗

In the latest release, Impala introduces Data Definition Language (DDL) support for bucketed tables. This feature enables users to optimize query performance by creating tables with bucketing. Leveraging the CLUSTER BY clause, this functionality facilitates the partitioning of data into smaller, more manageable segments based on specified columns. This enhancement contributes to improved query efficiency and data organization.

Support for Collections of Fixed-Length Types as Non-Passthrough Children of Unions🔗

In this release, Impala introduces support for collections of fixed-length types as non-passthrough children of unions. While plain UNIONs are not yet supported for any collections, UNION ALL operations are fully supported. Users can take advantage of this feature to combine and analyze data efficiently within complex queries.

Example:

select id, int_array from complextypestbl

union all select cast(id as tinyint), int_array from complextypestbl

Support for ORDER BY in Collections of Fixed-Length Types in SELECT List🔗

With this release, Impala now supports collections of fixed-length types in the sorting tuple. Although sorting directly by these collection columns is not permitted, they can be included in the SELECT list alongside other columns by which sorting is applied. This enhancement provides users with greater flexibility in organizing and presenting query results.

Support for Complex Types in SELECT List🔗

In this release, Impala introduces comprehensive support for complex types in the SELECT list. While collections and structs were previously supported, the nesting and mixing of complex types were not. Now, users can leverage the flexibility of embedding complex types into other complex types, providing enhanced versatility in query results. For detailed information and any limitations, refer to the "Allowing Embedding Complex Types into Other Complex Types" section in the Complex types documentation.

Structs in SELECT List with Beeswax🔗

In previous releases, structs in the select list were limited to the HS2 protocol. With this release, the support for structs in the select list is extended to Beeswax as well. Users can now benefit from using structs in the select list when interacting with Beeswax, improving the consistency of functionality across different protocols.

Query Hints for Table Cardinalities🔗

Impala now offers improved control over query planning with the introduction of query hints for table cardinalities. Previously, Impala relied on simple estimation to compute selectivity, which could deviate significantly from actual values for certain predicates, leading to suboptimal query plans. With the addition of a new query hint, 'SELECTIVITY', users can now specify selectivity values for predicates, enabling more accurate query planning and better overall performance.