What's New in Apache Impala

Learn about the new features of Impala in Cloudera Runtime 7.2.18.

Removing self-generated events

In previous releases, metadata consistency issues resulted in query failures. This occurred because the metadata updates from various coordinators couldn't distinguish between events generated by the coordinator itself and those generated by a different coordinator. This release addresses this issue by introducing a coordinator flag to each event. When processing these events, we now examine the coordinator flag to determine whether to ignore the event or proceed accordingly, resolving the inconsistency and preventing query failures.

Impala WebUI improvements

This release introduces significant enhancements to the Impala daemon’s Web UI, providing users with additional insights into the system's performance:

Backends Start Time and Version:

  • In large clusters, the Impala daemon’s Web UI now allows you to easily access and view the start time and version details for all backends.

Query Performance Characteristics:

  • Gain deeper insights into query execution with a detailed report on how a query was executed. The built-in web server’s UI features a Gantt chart timeline, serving as an alternative to the PROFILE command. This graphical display in the Web UI renders timing information and dependencies.

Export Query Plan and Timeline:

  • As an alternative to the PROFILE command's profile download page, this release introduces support for exporting the graphical query plan and downloading the timeline in SVG/HTML format. Exporting these elements clears memory resources consumed from the ObjectURLs.

Historical/In-flight Query Performance:

  • The query list and query details page now offer the capability to analyze historical or in-flight query performance. Users can access information such as memory consumption, data read, and other relevant details about each query.

JWT auth for Impala

Authentication is a crucial mechanism to secure connections to Impala, ensuring that only designated hosts and users can access the system. To implement JSON Web Token (JWT) authentication for Impala, follow these steps:

Configuration in CDP with Cloudera Manager:

Client Authentication:

  • Once JWT authentication is configured, clients—such as the Impala shell—can authenticate to Impala using a JWT instead of the traditional username/password combination. This enhances security and provides an alternative, token-based approach to authentication.

By adopting JWT authentication, Impala ensures a more secure and efficient authentication process for connecting hosts and users. This method offers a modern and flexible alternative to the conventional username/password authentication mechanism.

TPC-DS performance improvements

This release incorporates several enhancements across the planner and executor components to elevate query performance and align with the TPC decision support (TPC-DS) benchmark standards. The key improvements include:

Cardinality Estimation for Joins:

Memory Estimation for Aggregation Nodes:

  • Introduces new query options specifically designed to enhance memory estimation for aggregation nodes. This optimization contributes to more efficient memory utilization during query execution.

Planner changes for CPU usage:

  • Implements changes in the query planner to enhance parallel sizing and resource estimation, catering to workload-aware autoscaling. The introduced query options allow users to fine-tune these settings for improved CPU utilization and overall performance. This feature enables the global activation of multi-threaded queries, offering enhanced scalability.

Late Materialization of Columns:

  • Introduces late materialization, a feature optimizing certain queries on Parquet tables. This optimization minimizes table scanning by materializing only the relevant data, thereby improving query response times.

These improvements collectively contribute to a more robust and efficient Impala system, ensuring optimal performance and compliance with TPC-DS benchmark standards. Users can leverage the new query options for tuning purposes and take advantage of late materialization to enhance the processing of queries on Parquet tables.

Resetting all query options

The unset all command provides a convenient way to reset all query options. This functionality becomes particularly valuable in scenarios where connections are reused, such as when utilizing a connection pool. By executing UNSET ALL, all query options are unset, allowing for a clean slate and ensuring that subsequent queries operate with default settings. This capability enhances flexibility and efficiency, especially in connection pool scenarios where a fresh start for query options is desired.

Limited support for Hive Generic UDFs

In this release, support for the second generation of Hive User-Defined Functions (UDFs), known as GenericUDFs, is introduced. However, it comes with certain limitations that users should be aware of:

Decimal Types Not Supported:

  • GenericUDFs in this release do not provide support for decimal types, and their usage with such data types may lead to limitations or errors.

Complex Types Not Supported:

  • The support for GenericUDFs is limited, and complex types are not currently supported. Users should be mindful of this restriction when working with UDFs that involve complex data structures.

Functions Not Extracted from JAR Files:

  • Unlike other UDF types, GenericUDFs do not automatically extract functions from JAR files. Users need to manually manage and ensure that the required functions are appropriately included for use.

Non-Permanent Nature:

  • GenericUDFs created in this release are not permanent and will not persist across server restarts. Recreating them is necessary after each server restart to maintain functionality.

These limitations highlight considerations for users employing GenericUDFs in their workflows. It is advised to evaluate these constraints and plan accordingly when incorporating GenericUDFs into Hive queries.

Printing Query Results in Vertical Format

In the latest update, Impala-shell introduces a new command option '-E' or '--vertical' to facilitate the printing of query results in a vertical format. This provides users with a more streamlined and readable display of query outputs.

Retrieving the Data File Name

Impala now offers support for including a virtual column in a standard SELECT statement. By using the following syntax: SELECT INPUT__FILE__NAME FROM <tablename>, users can effortlessly retrieve the name of the data file associated with the actual row stored in a table. This enhancement provides valuable insights into the underlying data organization.

Resolving ORC Columns by Names

In previous releases, Impala resolved ORC columns based on index. With the introduction of this release, a new query option, ORC_SCHEMA_RESOLUTION, is now available. This option allows users to resolve ORC columns by names, offering a more flexible and intuitive approach to working with ORC data.

Reading and Writing Parquet Bloom Filters

Introducing a performance optimization feature in Impala — the Parquet bloom filter. This feature enables rapid and memory-efficient determination of whether the desired data is present in a file. Users can now benefit from enhanced efficiency when working with Parquet files.

BYTES Function Support

Impala now incorporates support for the BYTES() function. This function efficiently returns the number of bytes contained within a byte string. Users can leverage this functionality to gain insights into the size of byte strings within their data.

Min/Max Filtering in Impala

With the utilization of the Parquet format, Impala introduces the capability to perform min/max filtering. Users can now execute queries to identify the minimum or maximum value for a column within various levels such as partition, row group, page, or row. This enhancement provides a more granular and targeted approach to data analysis.

DDL Support for Bucketed Tables

In the latest release, Impala introduces Data Definition Language (DDL) support for bucketed tables. This feature enables users to optimize query performance by creating tables with bucketing. Leveraging the CLUSTER BY clause, this functionality facilitates the partitioning of data into smaller, more manageable segments based on specified columns. This enhancement contributes to improved query efficiency and data organization.

Support for Collections of Fixed-Length Types as Non-Passthrough Children of Unions

In this release, Impala introduces support for collections of fixed-length types as non-passthrough children of unions. While plain UNIONs are not yet supported for any collections, UNION ALL operations are fully supported. Users can take advantage of this feature to combine and analyze data efficiently within complex queries.

Example:

select id, int_array from complextypestbl

union all select cast(id as tinyint), int_array from complextypestbl

Support for ORDER BY in Collections of Fixed-Length Types in SELECT List

With this release, Impala now supports collections of fixed-length types in the sorting tuple. Although sorting directly by these collection columns is not permitted, they can be included in the SELECT list alongside other columns by which sorting is applied. This enhancement provides users with greater flexibility in organizing and presenting query results.

Support for Complex Types in SELECT List

In this release, Impala introduces comprehensive support for complex types in the SELECT list. While collections and structs were previously supported, the nesting and mixing of complex types were not. Now, users can leverage the flexibility of embedding complex types into other complex types, providing enhanced versatility in query results. For detailed information and any limitations, refer to the "Allowing Embedding Complex Types into Other Complex Types" section in the Complex types documentation.

Structs in SELECT List with Beeswax

In previous releases, structs in the select list were limited to the HS2 protocol. With this release, the support for structs in the select list is extended to Beeswax as well. Users can now benefit from using structs in the select list when interacting with Beeswax, improving the consistency of functionality across different protocols.

Query Hints for Table Cardinalities

Impala now offers improved control over query planning with the introduction of query hints for table cardinalities. Previously, Impala relied on simple estimation to compute selectivity, which could deviate significantly from actual values for certain predicates, leading to suboptimal query plans. With the addition of a new query hint, 'SELECTIVITY', users can now specify selectivity values for predicates, enabling more accurate query planning and better overall performance.