What's New in Apache Impala

This topic lists new features for Apache Impala in this release of Cloudera Runtime.

Increased Compatibility with Components in Cloudera Data Platform

Impala is integrated with the following components:

Parquet Page Index

To improve performance when using Parquet files, Impala can now write page indexes in Parquet files and use those indexes to skip pages for the faster scan.

See Using Parquet in Impala for more information.

The Remote File Handle Cache Supports S3

Impala can now cache remote HDFS file handles when the tables that store their data in Amazon S3 cloud storage.

See Impala Scalability Considerations for the information on remote file handle cache.

Support for Kudu Integrated with Hive Metastore

Kudu is integrated with Hive Metastore (HMS), and from Impala, you can create, update, delete, and query the tables in the Kudu services integrated with HMS.

See Using Kudu with Impala for information on using Kudu tables in Impala.

New Compressions Supported for Parquet Files

The following compressions are supported for the tables in the Parquet file format:
  • Zstandard (Zstd)

    Zstd is a real-time compression algorithm offering a tradeoff between speed and ratio of compression. Compression levels from 1 up to 22 are supported. The lower the level, the faster the speed at the cost of compression ratio.

  • Lz4

    Lz4 is a lossless compression algorithm providing extremely fast and scalable compression and decompression.

Data Cache for Remote Reads

You can execute queries faster on multi-cluster HDFS environments and on object store environments as Impala now caches data for non-local reads (e.g. S3, ABFS, ADLS) on local storage.

The data cache is enabled with the --data_cache startup flag.

See Impala Remote Data Cache for the information and steps to enable remote data cache.

Metadata Performance Improvements

The following features for improving metadata performance are enabled by default in this release:

  • Incremental stats are now compressed in memory in catalogd, reducing memory footprint in catalogd.

  • impalad coordinators fetch incremental stats from catalogd on-demand, reducing the memory footprint and the network requirements for broadcasting metadata.

  • Time-based and memory-based automatic invalidation of metadata to keep the size of metadata bounded and to reduce the chances of catalogdcache running out of memory.

  • Automatic invalidation of metadata

    With automatic metadata management enabled, you no longer have to issue INVALIDATE / REFRESH in a number of conditions.

See Impala Metadata Management for the information on the above features.

Scalable Pool Configuration in Admission Controller

To offer more dynamic and flexible resource management, Impala supports the new configuration parameters that scale with the number of executors. You can use the parameters to control the number of running queries, queued queries, and maximum amount of memory allocated for Impala resource pools.

See Impala Admission Control for the information about the new parameters and using them for admission control.

Query Profile

The following metrics were added to the Query Profile output for better monitoring and troubleshooting of query performance.

  • Network I/O throughput
  • System disk I/O throughput

See Impala Query Profile for generating and reading query profile.

DATE Data Type and Functions

You can use the new DATE data type to describe particular year/month/day values.

This initial DATE type supports the Text, Parquet, and HBASE file formats.

Most of the built-in functions for TIMESTAMP now allow the DATE type arguments, as well.

The support of DATE data type includes the following features:

  • DATE type column as a partitioning key column
  • DATE literal

  • Implicit casting between DATE and other types, namely, STRING and TIMESTAMP

See DATE Data Type and Impala Date and Time Functions for using the DATE type.

Support Hive Insert-Only Transactional Tables

Impala added the support to create, drop, query, and insert into insert-only transactional tables.

Use the Hive compaction to compact small files to improve the performance and scalability of metadata in transactional tables.

See Impala Transactions for more information.

HiveServer2 HTTP Connection for Clients

Now client applications can connect to Impala over HTTP via HiveServer2 with the option to use the Kerberos SPNEGO and LDAP for authentication. See Impala Clients for details.

Default File Format Changed to Parquet

When you create a table, the default format for that table data is now Parquet unless the STORED AS clause is specified.

For backward compatibility, you can use the DEFAULT_FILE_FORMAT query option to set the default file format to the previous default, such as text or other formats.

Built-in Function to Process JSON Objects

The GET_JSON_OBJECT() function extracts JSON object from a string based on the path specified and returns the extracted JSON object.

See Impala Miscellaneous Functions.

Graceful Shutdown of Impala Daemons

You can perform a graceful shutdown of Impala Daemons in Cloudera Manager.

When you initiate a shutdown process for an Impala Daemon, the Impala daemon will notify other Impala daemons that it is shutting down, wait for a grace period, then will shut itself down once no more queries or fragments are executing on that daemon or when the configurable deadline is reached.

See Graceful Shutdown for the steps.

Object Ownership Support

Object ownership for tables, views, and databases is enabled by default in Impala. When you create a database, a table, or a view, as the owner of that object, you implicitly have the privileges on the object. The privileges that owners have are specified in Ranger on the special user, {OWNER}.

The {OWNER} user must be defined in Ranger for the object ownership privileges work in Impala.

See Authorization for details.