What's New in Apache Impala

Increased Compatibility with Components in Cloudera Data Platform🔗

Impala is integrated with the following components:

Apache Ranger: Use Apache Ranger to manage authorization in Impala. See Impala Authorization for details.
Apache Atlas: Use Apache Atlas to manage data governance in Impala. See Atlas Metadata Collection Overview
Hive 3

Parquet Page Index🔗

To improve performance when using Parquet files, Impala can now write page indexes in Parquet files and use those indexes to skip pages for the faster scan.

See Using Parquet in Impala for more information.

New Compression Supported for Parquet Files 🔗

Impala supports the Lz4 compression for the tables in the Parquet file format.

Lz4 is a lossless compression algorithm providing extremely fast and scalable compression and decompression.

Default File Format Changed to Parquet🔗

When you create a table, the default format for that table data is now Parquet unless the STORED AS clause is specified.

For backward compatibility, you can use the DEFAULT_FILE_FORMAT query option to set the default file format to the previous default, such as text or other formats.

Metadata Performance Improvements 🔗

The following features for improving metadata performance are enabled by default in this release:

Incremental stats are now compressed in memory in catalogd, reducing memory footprint in catalogd.
impalad coordinators fetch incremental stats from catalogd on-demand, reducing the memory footprint and the network requirements for broadcasting metadata.
Time-based and memory-based automatic invalidation of metadata to keep the size of metadata bounded and to reduce the chances of catalogdcache running out of memory.
Automatic invalidation of metadata

With automatic metadata management enabled, you no longer have to issue INVALIDATE / REFRESH in a number of conditions.

See Impala Metadata Management for the information on the above features.

Support for Kudu Integrated with Hive Metastore🔗

Kudu is now integrated with Hive Metastore (HMS), and from Impala, you can create, update, delete, and query the tables in the Kudu services that are integrated with HMS.

See Using Kudu with Impala for information on using Kudu tables in Impala.

DATE Data Type and Functions 🔗

You can use the new DATE data type to describe particular year/month/day values.

This initial DATE type supports the Text, Parquet, Avro, and HBASE file formats.

Most of the built-in functions for TIMESTAMP now allow the DATE type arguments, as well.

The support of DATE data type includes the following features:

DATE type column as a partitioning key column
DATE literal
Implicit casting between DATE and other types, namely, STRING and TIMESTAMP

See DATE Data Type and Impala Date and Time Functions for using the DATE type.

Support for Hive Insert-Only Transactional Tables🔗

Impala added the support to create, drop, truncate, query, and insert into insert-only transactional tables.

By default, Impala creates an insert-only transactional table when you issue the CREATE TABLE statement.

Use the Hive compaction to compact small files to improve the performance and scalability of metadata in transactional tables.

See Impala Transactions for more information.

Server-side Spooling of Query Results🔗

You can use the SPOOL_QUERY_RESULTS query option to control how query results are returned to the client.

By default, when a client fetches a set of query results, the next set of results are fetched in batches until all the result rows are produced. If a client issues a query without fetching all the results, the query fragments continue to hold on to the resources until the query is canceled and unregistered, potentially tying up resources and causing other queries to wait in admission control.

When the query result spooling feature is enabled, the result sets of queries are eagerly fetched and buffered until they are read by the client, and resources are freed up for other queries.

See Spooling Impala Query Results for the new feature and the query options.

HiveServer2 HTTP Connection for Clients 🔗

Now client applications can connect to Impala over HTTP via HiveServer2 with the option to use the Kerberos SPNEGO and LDAP for authentication. See Impala Clients for details.

Cookie-based Authentication🔗

Starting in this version, Impala supports cookies for authentication when clients connect via HiveServer2 over HTTP.

You can use the --max_cookie_lifetime_s startup flag to:

Disable the use of cookies
Control how long generated cookies are valid for

See Impala Clients for more information.

Support of Kerberos Authentication in Impala Web UI🔗

Starting in this release, you can configure Kerberos authentication in Cloudera Manager to secure the debug Web UI pages for Impala Daemon, Catalog Server, and StateStore.

See Configuring Impala Web UI for the steps to enable Kerberos authentication for the Impala Web UI pages.

Object Ownership Support🔗

Object ownership for tables, views, and databases is enabled by default in Impala. When you create a database, a table, or a view, as the owner of that object, you implicitly have the privileges on the object. The privileges that owners have are specified in Ranger on the special user, {OWNER}.

The {OWNER} user must be defined in Ranger for the object ownership privileges work in Impala.

See Authorization for details.

Graceful Shutdown of Impala Daemons🔗

You can perform a graceful shutdown of Impala Daemons in Cloudera Manager.

When you initiate a shutdown process for an Impala Daemon, the Impala Daemon notifies other Impala Daemons that it is shutting down, waits for a grace period, then shuts itself down once no more queries or fragments are executing on that daemon or when the configurable deadline is reached.

See Graceful Shutdown for the steps.

New Built-in Functions for Fuzzy Matching of Strings🔗

Use the new Jaro or Jaro-Winkler functions to perform fuzzy matches on relatively short strings, e.g. to scrub user inputs of names against the records in the database.

JARO_DISTANCE, JARO_DST
JARO_SIMILARITY, JARO_SIM
JARO_WINKLER_DISTANCE, JW_DST
JARO_WINKLER_SIMILARITY, JW_SIM

See Impala String Functions for details.

New Built-in Function to Process JSON Objects🔗

The GET_JSON_OBJECT() function extracts JSON object from a string, based on the path specified as an input argument, and returns the extracted JSON object.

See Impala Miscellaneous Functions.

Capacity Quota for Scratch Disks🔗

When configuring scratch space for intermediate files used in large sorts, joins, aggregations, or analytic function operations, use the ‑‑scratch_dirs startup flag to optionally specify a capacity quota per scratch directory, e.g., ‑‑scratch_dirs=/dir1:5MB,/dir2.

See Impala and HDFS for details.

S3 Supported for Remote File Handle Cache🔗

Impala can now cache remote HDFS file handles for the tables that store their data in Amazon S3 cloud storage.

See Impala Scalability Considerations for the information on remote file handle cache.

Data Cache for Remote Reads🔗

You can execute queries faster on multi-cluster HDFS environments and on object store environments as Impala now caches data for non-local reads (e.g. S3, ABFS, ADLS) on local storage.

The data cache is enabled with the --data_cache startup flag.

See Impala Remote Data Cache for the information and steps to enable remote data cache.

Query Option for Disabling HBase Row Estimation🔗

During query plan generation, Impala samples underlying HBase tables to estimate row count and row size, but the sampling process can negatively impact the planning time. To alleviate the issue, when the HBase table stats do not change much in a short time, disable the sampling with the DISABLE_HBASE_NUM_ROWS_ESTIMATE query option so that the Impala planner falls back to using Hive Metastore (HMS) table stats instead.

See Impala Query Options.

Query Option for Controlling Size of Parquet Splits on Non-block Stores🔗

To optimize query performance, Impala planner uses the value of the fs.s3a.block.size startup flag when calculating the split size on non-block based stores, e.g. S3, ADLS, etc. Starting in this release, Impala planner uses the PARQUET_OBJECT_STORE_SPLIT_SIZE query option to get the Parquet file format specific split size.

For Parquet files, the fs.s3a.block.size startup flag is no longer used.

The default value of the PARQUET_OBJECT_STORE_SPLIT_SIZE query option is 256 MB.

See Impala with Amazon S3 for tuning Impala query performance for S3.

Scalable Pool Configuration in Admission Controller🔗

To offer more dynamic and flexible resource management, Impala supports the new configuration parameters that scale with the number of executors. You can use the parameters to control the number of running queries, queued queries, and maximum amount of memory allocated for Impala resource pools.

See Impala Admission Control for the information about the new parameters and using them for admission control.

Query Profile🔗

The following metrics were added to the Query Profile output for better monitoring and troubleshooting of query performance.

Network I/O throughput
System disk I/O throughput

See Impala Query Profile for generating and reading query profile.

Query Profile Exported to JSON🔗

On the Query Details page of Impala Daemon Web UI, you have a new option, in addition to the existing Thrift and Text formats, to export the query profile output in the JSON format.

See Impala Daemon Web UI for generating JSON query profile outputs in Web UI.