What's New in Apache Impala
This topic lists new features for Apache Impala in this release of Cloudera Runtime.
Server-side Spooling of Query Results
You can use the SPOOL_QUERY_RESULTS
query option to
control how query results are returned to the client.
By default, when a client fetches a set of query results, the next set of results are fetched in batches until all the result rows are produced. If a client issues a query without fetching all the results, the query fragments continue to hold on to the resources until the query is canceled and unregistered, potentially tying up resources and cause other queries to wait in admission control.
When the query result spooling feature is enabled, the result sets of queries are eagerly fetched and buffered until they are read by the client, and resources are freed and available for other queries.
See Spooling Impala Query Results for the new feature and the query options.
New Built-in Functions for Fuzzy Matching of Strings
Use the new Jaro or Jaro-Winkler functions to perform fuzzy matches on relatively short strings, e.g. to scrub user inputs of names against the records in the database.
JARO_DISTANCE
,JARO_DST
JARO_SIMILARITY
,JARO_SIM
JARO_WINKLER_DISTANCE
,JW_DST
JARO_WINKLER_SIMILARITY
,JW_SIM
See Impala String Functions for details.
Query Profile Exported to JSON
On the Query Details page of Impala Daemon Web UI, you have a new option, in addition to the existing Thrift and Text formats, to export the query profile output in the JSON format.
See Impala Daemon Web UI for generating JSON query profile outputs in Web UI.
DATE Data Type Support for Avro
You can now use the DATE
data type to query date
values from Avro tables.
See DATE Data Type and Using the Avro File Format with Impala Tables for details.
Cookie-based Authentication
Starting in this version, Impala supports cookies for authentication when clients connect via HiveServer2 over HTTP.
You can use the --max_cookie_lifetime_s startup
flag to:
- Disable the use of cookies
- Control how long generated cookies are valid for
See Impala Clients for more information.
Capacity Quota for Scratch Disks
When configuring scratch space for intermediate files used in large
sorts, joins, aggregations, or analytic function operations, use the
‑‑scratch_dirs
startup flag to optionally specify a
capacity quota per scratch directory, e.g.,
‑‑scratch_dirs=/dir1:5MB,/dir2
.
See Impala and HDFS for details.
TRUNCATE for Insert-only Transactional Tables
Now you can truncate insert-only transactional tables in Impala with
the TRUNCATE
statement.
See Impala Transactions for more information on transactional tables.
Query Option for Disabling HBase Row Estimation
During query plan generation, Impala samples underlying HBase tables to
estimate row count and row size, but the sampling process can negatively
impact the planning time. To alleviate the issue, when the HBase table
stats do not change much in a short time, disable the sampling with the
DISABLE_HBASE_NUM_ROWS_ESTIMATE
query option so that
the Impala planner falls back to using Hive Metastore (HMS) table stats
instead.
See Impala Query Options.
Query Option for Controlling Size of Parquet Splits on Non-block Stores
To optimize query performance, Impala planner uses the value of the
fs.s3a.block.size
startup flag when calculating the
split size on non-block based stores, e.g. S3, ADLS, etc. Starting in
this release, Impala planner uses the
PARQUET_OBJECT_STORE_SPLIT_SIZE
query option to get
the Parquet file format specific split size.
For Parquet files, the fs.s3a.block.size
startup flag
is no longer used.
The default value of the
PARQUET_OBJECT_STORE_SPLIT_SIZE
query option is 256
MB.
See Impala with Amazon S3 for tuning Impala query performance for S3.
Support of Kerberos Authentication in Impala Web UI
Starting in this release, you can configure Kerberos authentication in Cloudera Manager to secure the debug Web UI pages for Impala Daemon, Catalog Server, and StateStore.
See Configuring Impala Web UI for the steps to enable Kerberos authentication for the Impala Web UI pages.