What's New in Apache Impala

This release of Impala contains the following changes and enhancements from previous releases.

Continue reading:

New Features in CDH 5.16.x

The following are some of the most significant new features in this release.

Fine Grained Privileges

Sentry and Impala introduced fine grained privileges to provide object-level privileges to roles.

Fine grained privileges include the REFRESH and CREATE privileges, which allow users to create databases and tables, and to execute commands that update metadata information on Impala databases and tables. See Impala Sentry documentation for the new privileges and the scopes of the objects that you can grant the new privileges on.

The following new privileges were added:

  • The REFRESH privilege
  • The CREATE privilege
  • The SELECT and INSERT privileges on SERVER

If a role has SELECT or INSERT privilege on an object in Impala before upgrading to CDH 5.16.1, that role will automatically get the REFRESH privilege during the upgrade.

Object Ownership

Object ownership designates an owner for a database, table, or view in Sentry. The owner of an object has the OWNER privilege which is the equivalent of the ALL privilege on the object. See the Object Ownership documentation for information about enabling object ownership.

If the object ownership feature is enabled, Sentry grants the user the OWNER privilege. Whether or not object ownership is enabled, HMS stores the object creator as the default object owner. Previously, HMS stored the Kerberos user as the object owner.

The following statements were added to Impala to support object ownership via Sentry:

  • ALTER DATABASE SET OWNER
  • ALTER TABLE SET OWNER
  • ALTER VIEW SET OWNER
  • SHOW GRANT USER

Admission Control Enhancement

A new query option, MAX_MEM_ESTIMATE_FOR_ADMISSION, was added. Use the new option to set an upper limit on the memory estimates of a query as a workaround for over-estimates precluding a query from being admitted.

General Performance Improvements

A new query option, SHUFFLE_DISTINCT_EXPRS, controls the shuffling behavior when a query has both grouping and distinct expressions.

Metadata Performance Improvements

  • Incremental Stats

    The following enhancements improve Impala stability. The features reduce chances of having catalogd and impalad crash due to be out of memory when using incremental stats.

    • Incremental stats are now compressed in memory in catalogd, reducing memory footprint in catalogd.
    • Incremental stats are fetched on demand from catalogd by impalad coordinators. This enhancement reduces memory footprint for impalad coordinators and statestored and also reduces network requirements to broadcast metadata.

    See Loading Incremental Statistics from Catalogd for details.

  • Automatic Invalidation of Metadata

    To keep the size of metadata bounded and to reduce the chances of catalogd cache running out of memory, this release introduces an automatic metadata invalidation feature with time-based and memory-based invalidation.

    Automatic invalidation of metadata provides more stability with lower chances of running out of memory, but could potentially cause performance risks. The feature is turned off by default.

    See Startup Options for Automatic Invalidation of Metadata for details.

Compatibility and Usability Enhancements

  • Additional separators are supported between date and time in default TIMESTAMP format, specifically, the multi-space separator and the 'T' separator. See TIMESTAMP Data Type for more information on TIMESTAMP format.

  • New hint placement is supported for INSERT statements. See Optimizer Hints in Impala for detail.

  • The REGEX_ESCAPE() function was implemented for escaping special characters to treat them literally in string literals.

  • SHOW CREATE VIEW was implemented with the same functionality as SHOW CREATE TABLE.

  • Impala Shell can connect directly to impalad when configured with proxy load balancer and Kerberos. See Configuration Options for the new flag that enables the direct connection.

New Features in CDH 5.15.x

The following are some of the most significant new features in this release:

  • Added the TABLESAMPLE clause in the COMPUTE STATS statement. See COMPUTE STATS Statement for the new TABLESAMPLE clause.
  • Extended COMPUTE STATS to support a list of columns. See COMPUTE STATS Statement for the new syntax.
  • Added the new COMPUTE_STATS_MIN_SAMPLE_SIZE query option. The query option specifies the minimum number of bytes that will be scanned in COMPUTE STATS TABLESAMPLE, regardless of the user-supplied sampling percent. See COMPUTE_STATS_MIN_SAMPLE_SIZE Query Option.
  • Added a TBLPROPERTY for controlling stats extrapolation on a per-table basis: impala.enable.stats.extrapolation=true/false. See Table and Column Statistics for information about stats extrapolation.
  • Added the new built-in regex_escape function. The function instructs Impala to interpret the following special characters literally rather than as special characters: .\+*?[^]$(){}=!<>|:-

    See Impala String Functions for information about the regex_escape function.

  • Enhanced the existing ltrim and rtrim string functions to accept an argument that specifies a set of characters to be trimmed from the input string. See STRING Data Type for information about the functions.
  • Implemented murmur_hash function. See Impala Mathematical Functions for information about the new function.
  • Introduced the support for the Kudu DECIMAL type in Kudu 1.7.0.
  • Now Impala maps a signed integer logical type in Parquet to a supported Impala column type as below:
    • INT_8 -> TINYINT
    • INT_16 -> SMALLINT
    • INT_32 -> INT
    • INT_64 -> BIGINT
  • Parquet dictionary filtering now works on nested data.
  • Based on the existing Parquet column chunk level statistics null_count, Impala's Parquet scanner was enhanced to skip an entire row group if the null_count statistics indicate that all the values under the predicated column are NULL as no result rows would be returned from that row group.
  • The Oracle-style hint placement for INSERT statements is now supported. See Optimizer Hints in Impala for information on hints in Impala SQL.

  • Insert plan hints for CREATE TABLE AS SELECT are now supported. See Optimizer Hints in Impala for information on hints in Impala SQL.
  • Improved concurrency of DDL and DML operations during catalog updates.
  • The statestore update logic was improved to reduce issues, such as too many queries being admitted by different coordinators, or queries being queued for longer than necessary and blocking subsequent updates to different topics.
  • The size limit for statestore updates was increased, and copying of the metadata and reduce the memory footprint were reduced. Now the catalog objects are passed and (de)compressed between FE and BE one at a time.

New Features in CDH 5.14.x

The following are some of the most significant new features in this release:

  • New Boolean operators: IS TRUE, IS NOT TRUE, IS FALSE, IS NOT FALSE, IS UNKNOWN, and IS NOT UNKNOWN. These operators always return TRUE or FALSE, even if the left-hand expression evaluates to NULL. IS [NOT] TRUE and IS [NOT] FALSE are NULL-safe equivalents for equality and inequality comparisons. IS [NOT] UNKNOWN is identical to IS [NOT] NULL.

  • The new built-in function DATE_TRUNC() lets you truncate a TIMESTAMP value to a given precision. For example, calling date_trunc('hour',ts) truncates ts to the beginning of the corresponding hour, with all minutes, seconds, milliseconds, and so on set to zero.

  • The impala-shell command now accepts a command-line argument --query_option=option=value to set a query option for that shell session. You can specify multiple query options by supplying multiple instances of --query_option in the impala-shell invocation.

    Configuration files for impala-shell, such as .impalarc, can also contain a section beginning with [impala.query_options], followed by lines of the form query_option=value. These query options take effect by default for all impala-shell sessions that use the specified configuration file.

  • The SET statement, when run without any argument, now divides its output into groups corresponding to "regular", "advanced", "development" (unfinished, not ready for general use), and "deprecated" (subject to removal). By default, the "development" and "deprecated" option groups are not shown in SET output. Issue the statement SET ALL to see the options and values from all groups.

  • Impala optimizes some join queries against Kudu tables by constructing runtime filters representing the minimum and maximum matching values from the join columns. The min-max information is "pushed down" to Kudu so that Kudu can prune unnecessary rows before sending the result set back to Impala.

  • The allowed range for the BATCH_SIZE query option is now 0 to 65536. Formerly, the upper bound was undefined.

  • Arithmetic expressions involving both DECIMAL and FLOAT or DOUBLE arguments now produce results of DECIMAL type instead of DOUBLE.

New Features in CDH 5.13.x

The following are some of the most significant new features in this release:

  • Improvements to memory management through the use of a buffer pool. This mechanism allows queries to use less memory, reserves the required memory during query startup, and reduces the frequency of out-of-memory errors. It makes query planning and memory estimation more accurate, so that if a query begins executing it is unlikely to encounter an out-of-memory error partway through. The memory buffer used during spill-to-disk processing is smaller: instead of 8 MiB, this buffer defaults to 2 MiB and Impala can reduce it to as little as 64 KiB when appropriate.

    This feature includes new query options for fine-tuning memory areas used during query processing: MIN_SPILLABLE_BUFFER_SIZE, DEFAULT_SPILLABLE_BUFFER_SIZE, MAX_ROW_SIZE, and BUFFER_POOL_LIMIT.

  • Improvements to the mechanism for caching HDFS file handles. This caching mechanism improves the performance and scalability of queries that access the same file multiple times, for example to retrieve different columns from a Parquet file. Caching the file handle across open() calls reduces the load on the HDFS NameNode.

    This feature is currently disabled by default. It is enabled by setting a non-zero value for the max_cached_file_handles configuration setting. Currently, ETL processes that append to existing HDFS files or overwrite HDFS files in place can interact with this setting in a way that turns off short-circuit reads for some impalad hosts. See HDFS-12528 for tracking information.

  • A new command in impala-shell, rerun or its abbreviation @, lets you re-execute previous commands based on their numbering in the history output.

  • You can specify the minimum required TLS/SSL version using the --ssl_minimum_version setting, for example --ssl_minimum_version=tlsv1.2.

  • You can specify the set of allowed TLS ciphers using the --ssl_cipher_list configuration setting. See the output of man ciphers for the full set of keywords and notation allowed in the argument string.

  • New or enhanced built-in functions:

    • trunc() can now apply to numeric types (FLOAT, DOUBLE, and DECIMAL) in addition to TIMESTAMP. Although this functionality was already available through the truncate() function, the new signatures for trunc() make it easier to port code from other popular database systems to Impala.

    • A new date/time function utc_timestamp() provides a simple way to get a stable, interoperable representation of a TIMESTAMP value without using a chain of functions to convert between representations and apply a specific timezone.

  • The CREATE TABLE LIKE PARQUET statement can now handle Parquet files produced outside of Impala and containing ENUM types. The ENUM columns become STRING columns in the target table, and the ENUM values are turned into corresponding STRING values.

  • Kudu enhancements:

    • You can now create a Kudu table without using a PARTITION BY clause. Kudu automatically creates a single partition to cover the entire possible range of values. This feature is intended for small lookup tables, where the overhead of partitioning does not make sense for tables that typically have a full table scan for each query.

    • More granular Sentry authorization for Kudu tables. Kudu tables can now use column-level privileges. The SELECT and INSERT statements can now use the corresponding SELECT and INSERT privileges. Other Kudu statements still require the ALL privilege.

    • The ALTER TABLE statement can modify a number of storage attributes for the columns of Kudu tables. You can use the ALTER COLUMN clause of ALTER TABLE along with the SET keyword to change the properties DEFAULT, BLOCK_SIZE, ENCODING, and COMPRESSION. You can use the DROP DEFAULT clause to remove the default value from a column.

  • For non-Kudu tables, you can use the ALTER TABLE syntax ALTER COLUMN col SET COMMENT 'text' to change the comment for an individual column.

New Features in CDH 5.12.x

The following are some of the most significant new features in this release:

  • Impala can now read and write data stored on the Microsoft Azure Data Lake Store (ADLS). See Using Impala with the Azure Data Lake Store (ADLS) for details.

  • New built-in functions:

    • A new string function, replace(), which is faster than regexp_replace() for simple string substitutions. See Impala String Functions for details.

    • A new conditional function, nvl2(), which offers more flexibility than the nvl() function. It lets you return one value for NOT NULL arguments, and a different value for NULL arguments. See Impala Conditional Functions for details.

    • A new date/time function, last_day(), for finding the date corresponding to the last day of a particular month. See Impala Date and Time Functions for details.

  • New syntax, REFRESH FUNCTIONS db_name, lets Impala recognize newly added functions, such as UDFs created through Hive. Impala scans the metadata for a specified database to locate the new functions, which is faster and more convenient than doing a full INVALIDATE METADATA operation.

  • The new TABLESAMPLE clause for the SELECT statement lets you specify a percentage of the overall data volume within a table to process during a query.

  • Startup flags for the impalad daemon, is_executor and is_coordinator, let you divide the work on a large, busy cluster between a small number of hosts acting as query coordinators, and a larger number of hosts acting as query executors. By default, each host can act in both roles, potentially introducing bottlenecks during heavily concurrent workloads. See How to Configure Impala with Dedicated Coordinators for details.

  • A new query option, DEFAULT_JOIN_DISTRIBUTION_MODE, lets you change the default assumption about how join queries should handle tables with no statistics. This can help to avoid out-of-memory conditions for join queries, without manual tuning to add the /* +SHUFFLE */ hint for queries on large tables with missing statistics.

  • The SORT BY clause lets you create Parquet files with more efficient compression and smaller ranges of values for specified columns, allowing Impala to apply optimizations to skip reading data from Parquet files that do not contain any values that match equality and range operators in the WHERE clause. See CREATE TABLE Statement for details.

  • The max_audit_event_log_files lets you perform log rotation for the audit event log files, similar to the rotation for regular Impala log files.

  • Kudu enhancements:

    • The ALTER TABLE statement can specify more attributes for a Kudu table with the ADD COLUMNS clause. Now you can specify [NOT] NULL, ENCODING COMPRESSION, DEFAULT, and BLOCK_SIZE. See ALTER TABLE Statement for details.

    • The TIMESTAMP type is now available for Kudu tables.
    • The INSERT and CREATE TABLE AS SELECT statements are more efficient when writing to Kudu tables. Formerly, the overhead for the write operations could result in timeouts when writing large numbers of rows in a single operation.

New Features in CDH 5.12 / Impala 2.9

For the full list of issues closed in this release, including the issues marked as "new features", see the changelog for CDH 5.12.

The following are some of the most significant new features in this release:

  • A new function, replace(), which is faster than regexp_replace() for simple string substitutions. See Impala String Functions for details.

  • Startup flags for the impalad daemon, is_executor and is_coordinator, let you divide the work on a large, busy cluster between a small number of hosts acting as query coordinators, and a larger number of hosts acting as query executors. By default, each host can act in both roles, potentially introducing bottlenecks during heavily concurrent workloads. See How to Configure Impala with Dedicated Coordinators for details.

New Features in CDH 5.11.x

The following are some of the most significant new features in this release:

  • Improved security for Kudu tables. This enhancement comes from upgrading the Kudu client code shipped with CDH 5.11.

    • Encryption on the wire, and in the web UI. (Not on the client side.)

    • Kerberos authentication.

New Features in Impala 2.8.x / CDH 5.10.x

  • Performance and scalability improvements:

    • Improved scalability for highly concurrent loads by reducing the possibility of TCP/IP timeouts. A configuration setting, accepted_cnxn_queue_depth, can be adjusted upwards to avoid this type of timeout on large clusters.

    • Several performance improvements were made to the mechanism for generating native code:

      • Some queries involving analytic functions can take better advantage of native code generation.

      • Modules produced during intermediate code generation are organized to be easier to cache and reuse during the lifetime of a long-running or complicated query.

      • The COMPUTE STATS statement is more efficient (less time for the codegen phase) for tables with a large number of columns, especially for tables containing TIMESTAMP columns.

      • The logic for determining whether or not to use a runtime filter is more reliable, and the evaluation process itself is faster because of native code generation.

    • The MT_DOP query option enables multithreading for a number of Impala operations. COMPUTE STATS statements for Parquet tables use a default of MT_DOP=4 to improve the intra-node parallelism and CPU efficiency of this data-intensive operation. See MT_DOP Query Option for details.

    • A new hint, CLUSTERED, allows Impala INSERT operations on a Parquet table that use dynamic partitioning to process a high number of partitions in a single statement. The data is ordered based on the partition key columns, and each partition is only written by a single host, reducing the amount of memory needed to buffer Parquet data while the data blocks are being constructed.

    • The new configuration setting inc_stats_size_limit_bytes lets you reduce the load on the catalog server when running the COMPUTE INCREMENTAL STATS statement for very large tables.

    • Impala folds many constant expressions within query statements, rather than evaluating them for each row. This optimization is especially useful when using functions to manipulate and format TIMESTAMP values, such as the result of an expression such as to_date(now() - interval 1 day).

    • Parsing of complicated expressions is faster. This speedup is especially useful for queries containing large CASE expressions.

    • Evaluation is faster for IN operators with many constant arguments. The same performance improvement applies to other functions with many constant arguments.

    • Impala optimizes identical comparison operators within multiple OR blocks.

    • The reporting for wall-clock times and total CPU time in profile output is more accurate.

    • A new query option, SCRATCH_LIMIT, lets you restrict the amount of space used when a query exceeds the memory limit and activates the "spill to disk" mechanism. This option helps to avoid runaway queries or make queries "fail fast" if they require more memory than anticipated. You can prevent runaway queries from using excessive amounts of spill space, without restarting the cluster to turn the spilling feature off entirely. See SCRATCH_LIMIT Query Option for details.

  • Integration with Apache Kudu:

    • The experimental Impala support for the Kudu storage layer has been folded into the main Impala development branch. Impala can now directly access Kudu tables, opening up new capabilities such as enhanced DML operations and continuous ingestion.

    • The DELETE statement is a flexible way to remove data from a Kudu table. Previously, removing data from an Impala table involved removing or rewriting the underlying data files, dropping entire partitions, or rewriting the entire table. This Impala statement only works for Kudu tables.

    • The UPDATE statement is a flexible way to modify data within a Kudu table. Previously, updating data in an Impala table involved replacing the underlying data files, dropping entire partitions, or rewriting the entire table. This Impala statement only works for Kudu tables.

    • The UPSERT statement is a flexible way to ingest, modify, or both data within a Kudu table. Previously, ingesting data that might contain duplicates involved an inefficient multi-stage operation, and there was no built-in protection against duplicate data. The UPSERT statement, in combination with the primary key designation for Kudu tables, lets you add or replace rows in a single operation, and automatically avoids creating any duplicate data.

    • The CREATE TABLE statement gains some new clauses that are specific to Kudu tables: PARTITION BY, PARTITIONS, STORED AS KUDU, and column attributes PRIMARY KEY, NULL and NOT NULL, ENCODING, COMPRESSION, DEFAULT, and BLOCK_SIZE. These clauses replace the explicit TBLPROPERTIES settings that were required in the early experimental phases of integration between Impala and Kudu.

    • The ALTER TABLE statement can change certain attributes of Kudu tables. You can add, drop, or rename columns. You can add or drop range partitions. You can change the TBLPROPERTIES value to rename or point to a different underlying Kudu table, independently from the Impala table name in the metastore database. You cannot change the data type of an existing column in a Kudu table.

    • The SHOW PARTITIONS statement displays information about the distribution of data between partitions in Kudu tables. A new variation, SHOW RANGE PARTITIONS, displays information about the Kudu-specific partitions that apply across ranges of key values.

    • Not all Impala data types are supported in Kudu tables. In particular, currently the Impala TIMESTAMP type is not allowed in a Kudu table. Impala does not recognize the UNIXTIME_MICROS Kudu type when it is present in a Kudu table. (These two representations of date/time data use different units and are not directly compatible.) You cannot create columns of type TIMESTAMP, DECIMAL, VARCHAR, or CHAR within a Kudu table. Within a query, you can cast values in a result set to these types. Certain types, such as BOOLEAN, cannot be used as primary key columns.

    • Currently, Kudu tables are not interchangeable between Impala and Hive the way other kinds of Impala tables are. Although the metadata for Kudu tables is stored in the metastore database, currently Hive cannot access Kudu tables.

    • The INSERT statement works for Kudu tables. The organization of the Kudu data makes it more efficient than with HDFS-backed tables to insert data in small batches, such as with the INSERT ... VALUES syntax.

    • Some audit data is recorded for data governance purposes. All UPDATE, DELETE, and UPSERT statements are characterized as INSERT operations in the audit log. Currently, lineage metadata is not generated for UPDATE and DELETE operations on Kudu tables.

    • Currently, Kudu tables have limited support for Sentry:
      • Access to Kudu tables must be granted to roles as usual.

      • Currently, access to a Kudu table through Sentry is "all or nothing". You cannot enforce finer-grained permissions such as at the column level, or permissions on certain operations such as INSERT.

      • Only users with ALL privileges on SERVER can create external Kudu tables.

      Because non-SQL APIs can access Kudu data without going through Sentry authorization, currently the Sentry support is considered preliminary.
    • Equality and IN predicates in Impala queries are pushed to Kudu and evaluated efficiently by the Kudu storage layer.

  • Security:

    • Impala can take advantage of the S3 encrypted credential store, to avoid exposing the secret key when accessing data stored on S3.

  • [IMPALA-1654] Several kinds of DDL operations can now work on a range of partitions. The partitions can be specified using operators such as <, >=, and != rather than just an equality predicate applying to a single partition. This new feature extends the syntax of several clauses of the ALTER TABLE statement (DROP PARTITION, SET [UN]CACHED, SET FILEFORMAT | SERDEPROPERTIES | TBLPROPERTIES), the SHOW FILES statement, and the COMPUTE INCREMENTAL STATS statement. It does not apply to statements that are defined to only apply to a single partition, such as LOAD DATA, ALTER TABLE ... ADD PARTITION, SET LOCATION, and INSERT with a static partitioning clause.

  • The instr() function has optional second and third arguments, representing the character to position to begin searching for the substring, and the Nth occurrence of the substring to find.

  • Improved error handling for malformed Avro data. In particular, incorrect precision or scale for DECIMAL types is now handled.

  • Impala debug web UI:

    • In addition to "inflight" and "finished" queries, the web UI now also includes a section for "queued" queries.

    • The /sessions tab now clarifies how many of the displayed sections are active, and lets you sort by Expired status to distinguish active sessions from expired ones.

  • Improved stability when DDL operations such as CREATE DATABASE or DROP DATABASE are run in Hive at the same time as an Impala INVALIDATE METADATA statement.

  • The "out of memory" error report was made more user-friendly, with additional diagnostic information to help identify the spot where the memory limit was exceeded.

  • Improved disk space usage for Java-based UDFs. Temporary copies of the associated JAR files are removed when no longer needed, so that they do not accumulate across restarts of the catalogd daemon and potentially cause an out-of-space condition. These temporary files are also created in the directory specified by the local_library_dir configuration setting, so that the storage for these temporary files can be independent from any capacity limits on the /tmp filesystem.

New Features in Impala 2.7.x / CDH 5.9.x

  • Performance improvements:

    • [IMPALA-3206] Speedup for queries against DECIMAL columns in Avro tables. The code that parses DECIMAL values from Avro now uses native code generation.

    • [IMPALA-3674] Improved efficiency in LLVM code generation can reduce codegen time, especially for short queries.

    • [IMPALA-2979] Improvements to scheduling on worker nodes, enabled by the REPLICA_PREFERENCE query option. See REPLICA_PREFERENCE Query Option (CDH 5.9 or higher only) for details.

  • [IMPALA-1683] The REFRESH statement can be applied to a single partition, rather than the entire table. See REFRESH Statement and Refreshing a Single Partition for details.

  • Improvements to the Impala web user interface:

    • [IMPALA-2767] You can now force a session to expire by clicking a link in the web UI, on the /sessions tab.

    • [IMPALA-3715] The /memz tab includes more information about Impala memory usage.

    • [IMPALA-3716] The Details page for a query now includes a Memory tab.

  • [IMPALA-3499] Scalability improvements to the catalog server. Impala handles internal communication more efficiently for tables with large numbers of columns and partitions, where the size of the metadata exceeds 2 GiB.

  • [IMPALA-3677] You can send a SIGUSR1 signal to any Impala-related daemon to write a Breakpad minidump. For advanced troubleshooting, you can now produce a minidump without triggering a crash. See Breakpad Minidumps for Impala (CDH 5.8 or higher only) for details about the Breakpad minidump feature.

  • [IMPALA-3687] The schema reconciliation rules for Avro tables have changed slightly for CHAR and VARCHAR columns. Now, if the definition of such a column is changed in the Avro schema file, the column retains its CHAR or VARCHAR type as specified in the SQL definition, but the column name and comment from the Avro schema file take precedence. See Creating Avro Tables for details about column definitions in Avro tables.

  • [IMPALA-3575] Some network operations now have additional timeout and retry settings. The extra configuration helps avoid failed queries for transient network problems, to avoid hangs when a sender or receiver fails in the middle of a network transmission, and to make cancellation requests more reliable despite network issues.

New Features in Impala 2.6.x / CDH 5.8.x

  • Improvements to Impala support for the Amazon S3 filesystem:

  • Performance improvements for the runtime filtering feature:

    • The default for the RUNTIME_FILTER_MODE query option is changed to GLOBAL (the highest setting). See RUNTIME_FILTER_MODE Query Option (CDH 5.7 or higher only) for details about this option.

    • The RUNTIME_BLOOM_FILTER_SIZE setting is now only used as a fallback if statistics are not available; otherwise, Impala uses the statistics to estimate the appropriate size to use for each filter. See RUNTIME_BLOOM_FILTER_SIZE Query Option (CDH 5.7 or higher only) for details about this option.

    • New query options RUNTIME_FILTER_MIN_SIZE and RUNTIME_FILTER_MAX_SIZE let you fine-tune the sizes of the Bloom filter structures used for runtime filtering. If the filter size derived from Impala internal estimates or from the RUNTIME_FILTER_BLOOM_SIZE falls outside the size range specified by these options, any too-small filter size is adjusted to the minimum, and any too-large filter size is adjusted to the maximum. See RUNTIME_FILTER_MIN_SIZE Query Option (CDH 5.8 or higher only) and RUNTIME_FILTER_MAX_SIZE Query Option (CDH 5.8 or higher only) for details about these options.

    • Runtime filter propagation now applies to all the operands of UNION and UNION ALL operators.

    • Runtime filters can now be produced during join queries even when the join processing activates the spill-to-disk mechanism.

    See Runtime Filtering for Impala Queries (CDH 5.7 or higher only) for general information about the runtime filtering feature.
  • Admission control and dynamic resource pools are enabled by default. See Admission Control and Query Queuing for details about admission control.

  • Impala can now manually set column statistics, using the ALTER TABLE statement with a SET COLUMN STATS clause. See for details.

  • Impala can now write lightweight "minidump" files, rather than large core files, to save diagnostic information when any of the Impala-related daemons crash. This feature uses the open source breakpad framework. See Breakpad Minidumps for Impala (CDH 5.8 or higher only) for details.

  • New query options improve interoperability with Parquet files:
    • The PARQUET_FALLBACK_SCHEMA_RESOLUTION query option lets Impala locate columns within Parquet files based on column name rather than ordinal position. This enhancement improves interoperability with applications that write Parquet files with a different order or subset of columns than are used in the Impala table. See PARQUET_FALLBACK_SCHEMA_RESOLUTION Query Option (CDH 5.8 or higher only) for details.

    • The PARQUET_ANNOTATE_STRINGS_UTF8 query option makes Impala include the UTF-8 annotation metadata for STRING, CHAR, and VARCHAR columns in Parquet files created by INSERT or CREATE TABLE AS SELECT statements. See PARQUET_ANNOTATE_STRINGS_UTF8 Query Option (CDH 5.8 or higher only) for details.

    See Using the Parquet File Format with Impala Tables for general information about working with Parquet files.
  • Improvements to security and reduction in overhead for secure clusters:

    • Overall performance improvements for secure clusters. (TPC-H queries on a secure cluster were benchmarked at roughly 3x as fast as the previous release.)

    • Impala now recognizes the auth_to_local setting, specified through the HDFS configuration setting hadoop.security.auth_to_local. This feature is disabled by default; to enable it, specify --load_auth_to_local_rules=true in the impalad configuration settings. See Mapping Kerberos Principals to Short Names for Impala for details.

    • Timing improvements in the mechanism for the impalad daemon to acquire Kerberos tickets. This feature spreads out the overhead on the KDC during Impala startup, especially for large clusters.

    • For Kerberized clusters, the Catalog service now uses the Kerberos principal instead of the operating sytem user that runs the catalogd daemon. This eliminates the requirement to configure a hadoop.user.group.static.mapping.overrides setting to put the OS user into the Sentry administrative group, on clusters where the principal and the OS user name for this user are different.

  • Overall performance improvements for join queries, by using a prefetching mechanism while building the in-memory hash table to evaluate join predicates. See PREFETCH_MODE Query Option (CDH 5.8 or higher only) for the query option to control this optimization.

  • The impala-shell interpreter has a new command, SOURCE, that lets you run a set of SQL statements or other impala-shell commands stored in a file. You can run additional SOURCE commands from inside a file, to set up flexible sequences of statements for use cases such as schema setup, ETL, or reporting. See impala-shell Command Reference for details and Running Commands and SQL Statements in impala-shell for examples.

  • The millisecond() built-in function lets you extract the fractional seconds part of a TIMESTAMP value. See Impala Date and Time Functions for details.

  • New built-in functions base64encode() and base64decode() let you encode arbitrary string values in plaintext format. See Impala String Functions for details.

  • If an Avro table is created without column definitions in the CREATE TABLE statement, and columns are later added through ALTER TABLE, the resulting table is now queryable. Missing values from the newly added columns now default to NULL. See Using the Avro File Format with Impala Tables for general details about working with Avro files.

  • The mechanism for interpreting DECIMAL literals is improved, no longer going through an intermediate conversion step to DOUBLE:
    • Casting a DECIMAL value to TIMESTAMP DOUBLE produces a more precise value for the TIMESTAMP than formerly.

    • Certain function calls involving DECIMAL literals now succeed, when formerly they failed due to lack of a function signature with a DOUBLE argument.

    • Faster runtime performance for DECIMAL constant values, through improved native code generation for all combinations of precision and scale.

    See DECIMAL Data Type for details about the DECIMAL type.
  • Improved type accuracy for CASE return values. If all WHEN clauses of the CASE expression are of CHAR type, the final result is also CHAR instead of being converted to STRING. See Impala Conditional Functions for details about the CASE function.

  • Uncorrelated queries using the NOT EXISTS operator are now supported. Formerly, the NOT EXISTS operator was only available for correlated subqueries.

  • Improved performance for reading Parquet files.

  • Improved performance for top-N queries, that is, those including both ORDER BY and LIMIT clauses.

  • Impala optionally skips an arbitrary number of header lines from text input files on HDFS based on the skip.header.line.count value in the TBLPROPERTIES field of the table metadata. See Data Files for Text Tables for details.

  • Trailing comments are now allowed in queries processed by the impala-shell options -q and -f.

  • Impala can run COUNT queries for RCFile tables that include complex type columns. See CDH 5.5 for general information about working with complex types, and CDH 5.5, CDH 5.5, and CDH 5.5 for syntax details of each type.

New Features in Impala 2.5.x / CDH 5.7.x

  • Dynamic partition pruning. When a query refers to a partition key column in a WHERE clause, and the exact set of column values are not known until the query is executed, Impala evaluates the predicate and skips the I/O for entire partitions that are not needed. For example, if a table was partitioned by year, Impala would apply this technique to a query such as SELECT c1 FROM partitioned_table WHERE year = (SELECT MAX(year) FROM other_table).

    The dynamic partition pruning optimization technique lets Impala avoid reading data files from partitions that are not part of the result set, even when that determination cannot be made in advance. This technique is especially valuable when performing join queries involving partitioned tables. For example, if a join query includes an ON clause and a WHERE clause that refer to the same columns, the query can find the set of column values that match the WHERE clause, and only scan the associated partitions when evaluating the ON clause.

    Dynamic partition pruning is controlled by the same settings as the runtime filtering feature. By default, this feature is enabled at a medium level, because the maximum setting can use slightly more memory for queries than in previous releases. To fully enable this feature, set the query option RUNTIME_FILTER_MODE=GLOBAL.

  • Runtime filtering. This is a wide-ranging set of optimizations that are especially valuable for join queries. Using the same technique as with dynamic partition pruning, Impala uses the predicates from WHERE and ON clauses to determine the subset of column values from one of the joined tables could possibly be part of the result set. Impala sends a compact representation of the filter condition to the hosts in the cluster, instead of the full set of values or the entire table. See Runtime Filtering for Impala Queries (CDH 5.7 or higher only) for details.

    By default, this feature is enabled at a medium level, because the maximum setting can use slightly more memory for queries than in previous releases. To fully enable this feature, set the query option RUNTIME_FILTER_MODE=GLOBAL. See RUNTIME_FILTER_MODE Query Option (CDH 5.7 or higher only) for details.

    This feature involves some new query options: RUNTIME_FILTER_MODE, MAX_NUM_RUNTIME_FILTERS, RUNTIME_BLOOM_FILTER_SIZE, RUNTIME_FILTER_WAIT_TIME_MS, and DISABLE_ROW_RUNTIME_FILTERING. See RUNTIME_FILTER_MODE, MAX_NUM_RUNTIME_FILTERS, RUNTIME_BLOOM_FILTER_SIZE, RUNTIME_FILTER_WAIT_TIME_MS, and DISABLE_ROW_RUNTIME_FILTERING for details.

  • More efficient use of the HDFS caching feature, to avoid hotspots and bottlenecks that could occur if heavily used cached data blocks were always processed by the same host. By default, Impala now randomizes which host processes each cached HDFS data block, when cached replicas are available on multiple hosts. (Remember to use the WITH REPLICATION clause with the CREATE TABLE or ALTER TABLE statement when enabling HDFS caching for a table or partition, to cache the same data blocks across multiple hosts.) The new query option SCHEDULE_RANDOM_REPLICA lets you fine-tune the interaction with HDFS caching even more. See Using HDFS Caching with Impala (CDH 5.3 or higher only) for details.

  • The TRUNCATE TABLE statement now accepts an IF EXISTS clause, making TRUNCATE TABLE easier to use in setup or ETL scripts where the table might or might not exist. See TRUNCATE TABLE Statement (CDH 5.5 or higher only) for details.

  • Improved performance and reliability for the DECIMAL data type:
    • Using DECIMAL values in a GROUP BY clause now triggers the native code generation optimization, speeding up queries that group by values such as prices.

    • Checking for overflow in DECIMAL multiplication is now substantially faster, making DECIMAL a more practical data type in some use cases where formerly DECIMAL was much slower than FLOAT or DOUBLE.

    • Multiplying a mixture of DECIMAL and FLOAT or DOUBLE values now returns the DOUBLE rather than DECIMAL. This change avoids some cases where an intermediate value would underflow or overflow and become NULL unexpectedly.

    See DECIMAL Data Type for details.
  • For UDFs written in Java, or Hive UDFs reused for Impala, Impala now allows parameters and return values to be primitive types. Formerly, these things were required to be one of the "Writable" object types. See Using Hive UDFs with Impala for details.

  • Performance improvements for HDFS I/O. Impala now caches HDFS file handles to avoid the overhead of repeatedly opening the same file.

  • Performance improvements for queries involving nested complex types. Certain basic query types, such as counting the elements of a complex column, now use an optimized code path.

  • Improvements to the memory reservation mechanism for the Impala admission control feature. You can specify more settings, such as the timeout period and maximum aggregate memory used, for each resource pool instead of globally for the Impala instance. The default limit for concurrent queries (the max requests setting) is now unlimited instead of 200. The Cloudera Manager user interface for admission control has been reworked, with the settings available under the Dynamic Resource Pools window.

  • Performance improvements related to code generation. Even in queries where code generation is not performed for some phases of execution (such as reading data from Parquet tables), Impala can still use code generation in other parts of the query, such as evaluating functions in the WHERE clause.

  • Performance improvements for queries using aggregation functions on high-cardinality columns. Formerly, Impala could do unnecessary extra work to produce intermediate results for operations such as DISTINCT or GROUP BY on columns that were unique or had few duplicate values. Now, Impala decides at run time whether it is more efficient to do an initial aggregation phase and pass along a smaller set of intermediate data, or to pass raw intermediate data back to next phase of query processing to be aggregated there. This feature is known as streaming pre-aggregation. In case of performance regression, this feature can be turned off using the DISABLE_STREAMING_PREAGGREGATIONS query option. See DISABLE_STREAMING_PREAGGREGATIONS Query Option (CDH 5.7 or higher only) for details.

  • Spill-to-disk feature now always recommended. In earlier releases, the spill-to-disk feature could be turned off using a pair of configuration settings, enable_partitioned_aggregation=false and enable_partitioned_hash_join=false. The latest improvements in the spill-to-disk mechanism, and related features that interact with it, make this feature robust enough that disabling it is now no longer needed or supported. In particular, some new features in CDH 5.7 / Impala 2.5 and higher do not work when the spill-to-disk feature is disabled.

  • Improvements to scripting capability for the impala-shell command, through user-specified substitution variables that can appear in statements processed by impala-shell:

    • The --var command-line option lets you pass key-value pairs to impala-shell. The shell can substitute the values into queries before executing them, where the query text contains the notation ${var:varname}. For example, you might prepare a SQL file containing a set of DDL statements and queries containing variables for database and table names, and then pass the applicable names as part of the impala-shell -f filename command. See Running Commands and SQL Statements in impala-shell for details.

    • The SET and UNSET commands within the impala-shell interpreter now work with user-specified substitution variables, as well as the built-in query options. The two kinds of variables are divided in the SET output. As with variables defined by the --var command-line option, you refer to the user-specified substitution variables in queries by using the notation ${var:varname} in the query text. Because the substitution variables are processed by impala-shell instead of the impalad backend, you cannot define your own substitution variables through the SET statement in a JDBC or ODBC application. See SET Statement for details.

  • Performance improvements for query startup. Impala better parallelizes certain work when coordinating plan distribution between impalad instances, which improves startup time for queries involving tables with many partitions on large clusters, or complicated queries with many plan fragments.

  • Performance and scalability improvements for tables with many partitions. The memory requirements on the coordinator node are reduced, making it substantially faster and less resource-intensive to do joins involving several tables with thousands of partitions each.

  • Whitelisting for access to internal APIs. For applications that need direct access to Impala APIs, without going through the HiveServer2 or Beeswax interfaces, you can specify a list of Kerberos users who are allowed to call those APIs. By default, the impala and hdfs users are the only ones authorized for this kind of access. Any users not explicitly authorized through the internal_principals_whitelist configuration setting are blocked from accessing the APIs. This setting applies to all the Impala-related daemons, although currently it is primarily used for HDFS to control the behavior of the catalog server.

  • Improvements to Impala integration and usability for Hue. (The code changes are actually on the Hue side.)

    • The list of tables now refreshes dynamically.

  • Usability improvements for case-insensitive queries. You can now use the operators ILIKE and IREGEXP to perform case-insensitive wildcard matches or regular expression matches, rather than explicitly converting column values with UPPER or LOWER. See ILIKE Operator and IREGEXP Operator for details.

  • Performance and reliability improvements for DDL and insert operations on partitioned tables with a large number of partitions. Impala only re-evaluates metadata for partitions that are affected by a DDL operation, not all partitions in the table. While a DDL or insert statement is in progress, other Impala statements that attempt to modify metadata for the same table wait until the first one finishes.

  • Reliability improvements for the LOAD DATA statement. Previously, this statement would fail if the source HDFS directory contained any subdirectories at all. Now, the statement ignores any hidden subdirectories, for example _impala_insert_staging.

  • A new operator, IS [NOT] DISTINCT FROM, lets you compare values and always get a true or false result, even if one or both of the values are NULL. The IS NOT DISTINCT FROM operator, or its equivalent <=> notation, improves the efficiency of join queries that treat key values that are NULL in both tables as equal. See IS DISTINCT FROM Operator for details.

  • Security enhancements for the impala-shell command. A new option, --ldap_password_cmd, lets you specify a command to retrieve the LDAP password. The resulting password is then used to authenticate the impala-shell command with the LDAP server. See impala-shell Configuration Options for details.

  • The CREATE TABLE AS SELECT statement now accepts a PARTITIONED BY clause, which lets you create a partitioned table and insert data into it with a single statement. See CREATE TABLE Statement for details.

  • User-defined functions (UDFs and UDAFs) written in C++ now persist automatically when the catalogd daemon is restarted. You no longer have to run the CREATE FUNCTION statements again after a restart.

  • User-defined functions (UDFs) written in Java can now persist when the catalogd daemon is restarted, and can be shared transparently between Impala and Hive. You must do a one-time operation to recreate these UDFs using new CREATE FUNCTION syntax, without a signature for arguments or the return value. Afterwards, you no longer have to run the CREATE FUNCTION statements again after a restart. Although Impala does not have visibility into the UDFs that implement the Hive built-in functions, user-created Hive UDFs are now automatically available for calling through Impala. See CREATE FUNCTION Statement for details.

  • Reliability enhancements for memory management. Some aggregation and join queries that formerly might have failed with an out-of-memory error due to memory contention, now can succeed using the spill-to-disk mechanism.

  • The SHOW DATABASES statement now returns two columns rather than one. The second column includes the associated comment string, if any, for each database. Adjust any application code that examines the list of databases and assumes the result set contains only a single column. See SHOW DATABASES for details.

  • A new optimization speeds up aggregation operations that involve only the partition key columns of partitioned tables. For example, a query such as SELECT COUNT(DISTINCT k), MIN(k), MAX(k) FROM t1 can avoid reading any data files if T1 is a partitioned table and K is one of the partition key columns. Because this technique can produce different results in cases where HDFS files in a partition are manually deleted or are empty, you must enable the optimization by setting the query option OPTIMIZE_PARTITION_KEY_SCANS. See OPTIMIZE_PARTITION_KEY_SCANS Query Option (CDH 5.7 or higher only) for details.

  • The DESCRIBE statement can now display metadata about a database, using the syntax DESCRIBE DATABASE db_name. See DESCRIBE Statement for details.

  • The uuid() built-in function generates an alphanumeric value that you can use as a guaranteed unique identifier. The uniqueness applies even across tables, for cases where an ascending numeric sequence is not suitable. See Impala Miscellaneous Functions for details.

New Features in Impala 2.4.x / CDH 5.6.x

  • From a user perspective, the Impala features are the same as in CDH 5.5 / Impala 2.3.

New Features in Impala 2.3.x / CDH 5.5.x

The following are the major new features in Impala 2.3.x. This major release, available as part of CDH 5.5.x, contains improvements to SQL syntax (particularly new support for complex types), performance, manageability, security.

  • Complex data types: STRUCT, ARRAY, and MAP. These types can encode multiple named fields, positional items, or key-value pairs within a single column. You can combine these types to produce nested types with arbitrarily deep nesting, such as an ARRAY of STRUCT values, a MAP where each key-value pair is an ARRAY of other MAP values, and so on. Currently, complex data types are only supported for the Parquet file format. See CDH 5.5 for usage details and CDH 5.5, CDH 5.5, and CDH 5.5 for syntax.

  • Column-level authorization lets you define access to particular columns within a table, rather than the entire table. This feature lets you reduce the reliance on creating views to set up authorization schemes for subsets of information. See Hive SQL Syntax for Use with Sentry for background details, and GRANT Statement (CDH 5.2 or higher only) and REVOKE Statement (CDH 5.2 or higher only) for Impala-specific syntax.

  • The TRUNCATE TABLE statement removes all the data from a table without removing the table itself. See TRUNCATE TABLE Statement (CDH 5.5 or higher only) for details.

  • Nested loop join queries. Some join queries that formerly required equality comparisons can now use operators such as < or >=. This same join mechanism is used internally to optimize queries that retrieve values from complex type columns. See Joins in Impala SELECT Statements for details about Impala join queries.

  • Reduced memory usage and improved performance and robustness for spill-to-disk feature. See SQL Operations that Spill to Disk for details about this feature.

  • Performance improvements for querying Parquet data files containing multiple row groups and multiple data blocks:

    • For files written by Hive, SparkSQL, and other Parquet MR writers and spanning multiple HDFS blocks, Impala now scans the extra data blocks locally when possible, rather than using remote reads.

    • Impala queries benefit from the improved alignment of row groups with HDFS blocks for Parquet files written by Hive, MapReduce, and other components in CDH 5.5 and higher. (Impala itself never writes multiblock Parquet files, so the alignment change does not apply to Parquet files produced by Impala.) These Parquet writers now add padding to Parquet files that they write to align row groups with HDFS blocks. The parquet.writer.max-padding setting specifies the maximum number of bytes, by default 8 megabytes, that can be added to the file between row groups to fill the gap at the end of one block so that the next row group starts at the beginning of the next block. If the gap is larger than this size, the writer attempts to fit another entire row group in the remaining space. Include this setting in the hive-site configuration file to influence Parquet files written by Hive, or the hdfs-site configuration file to influence Parquet files written by all non-Impala components.

    See Using the Parquet File Format with Impala Tables for instructions about using Parquet data files with Impala, and Using Apache Parquet Data Files with CDH for instructions for other components that can read and write Parquet files.

  • Many new built-in scalar functions, for convenience and enhanced portability of SQL that uses common industry extensions.

    Math functions (see Impala Mathematical Functions for details):

    • ATAN2
    • COSH
    • COT
    • DCEIL
    • DEXP
    • DFLOOR
    • DLOG10
    • DPOW
    • DROUND
    • DSQRT
    • DTRUNC
    • FACTORIAL, and corresponding ! operator
    • FPOW
    • RADIANS
    • RANDOM
    • SINH
    • TANH

    String functions (see Impala String Functions for details):

    • BTRIM
    • CHR
    • REGEXP_LIKE
    • SPLIT_PART

    Date and time functions (see Impala Date and Time Functions for details):

    • INT_MONTHS_BETWEEN
    • MONTHS_BETWEEN
    • TIMEOFDAY
    • TIMESTAMP_CMP

    Bit manipulation functions (see Impala Bit Functions for details):

    • BITAND
    • BITNOT
    • BITOR
    • BITXOR
    • COUNTSET
    • GETBIT
    • ROTATELEFT
    • ROTATERIGHT
    • SETBIT
    • SHIFTLEFT
    • SHIFTRIGHT

    Type conversion functions (see Impala Type Conversion Functions for details):

    • TYPEOF

    The effective_user() function (see Impala Miscellaneous Functions for details).

  • New built-in analytic functions: PERCENT_RANK, NTILE, CUME_DIST. See Impala Analytic Functions for details.

  • The DROP DATABASE statement now works for a non-empty database. When you specify the optional CASCADE clause, any tables in the database are dropped before the database itself is removed. See DROP DATABASE Statement for details.

  • The DROP TABLE and ALTER TABLE DROP PARTITION statements have a new optional keyword, PURGE. This keyword causes Impala to immediately remove the relevant HDFS data files rather than sending them to the HDFS trashcan. This feature can help to avoid out-of-space errors on storage devices, and to avoid files being left behind in case of a problem with the HDFS trashcan, such as the trashcan not being configured or being in a different HDFS encryption zone than the data files. See DROP TABLE Statement and ALTER TABLE Statement for syntax.

  • The impala-shell command has a new feature for live progress reporting. This feature is enabled through the --live_progress and --live_summary command-line options, or during a session through the LIVE_SUMMARY and LIVE_PROGRESS query options. See LIVE_PROGRESS Query Option (CDH 5.5 or higher only) and LIVE_SUMMARY Query Option (CDH 5.5 or higher only) for details.

  • The impala-shell command also now displays a random "tip of the day" when it starts.

  • The impala-shell option -f now recognizes a special filename - to accept input from stdin. See impala-shell Configuration Options for details about the options for running impala-shell in non-interactive mode.

  • Format strings for the unix_timestamp() function can now include numeric timezone offsets. See Impala Date and Time Functions for details.

  • Impala can now run a specified command to obtain the password to decrypt a private-key PEM file, rather than having the private-key file be unencrypted on disk. See Configuring TLS/SSL for Impala for details.

  • Impala components now can use SSL for more of their internal communication. SSL is used for communication between all three Impala-related daemons when the configuration option ssl_server_certificate is enabled. SSL is used for communication with client applications when the configuration option ssl_client_ca_certificate is enabled. See Configuring TLS/SSL for Impala for details.

    Currently, you can only use one of server-to-server TLS/SSL encryption or Kerberos authentication. This limitation is tracked by the issue IMPALA-2598.

  • Improved flexibility for intermediate data types in user-defined aggregate functions (UDAFs). See Writing User-Defined Aggregate Functions (UDAFs) for details.

In CDH 5.5.2 / Impala 2.3.2, the bug fix for IMPALA-2598 removes the restriction on using both Kerberos and SSL for internal communication between Impala components.

New Features in Impala 2.2.x for CDH 5.4.3 and 5.4.4

No new features added to the Impala code. The certification of Impala with EMC Isilon under CDH 5.4.4 means that now you can query data stored on Isilon storage devices through Impala. See Installing Cloudera Manager and CDH with Isilon for details. The same level of Impala is included with both CDH 5.4.3 and 5.4.4.

New Features in Impala 2.2.x / CDH 5.4.x

The following are the major new features in Impala 2.2.x. This release, available as part of CDH 5.4.x, contains improvements to performance, manageability, security, and SQL syntax.

  • Several improvements to date and time features enable higher interoperability with Hive and other database systems, provide more flexibility for handling time zones, and future-proof the handling of TIMESTAMP values:

    • The WITH REPLICATION clause for the CREATE TABLE and ALTER TABLE statements lets you control the replication factor for HDFS caching for a specific table or partition. By default, each cached block is only present on a single host, which can lead to CPU contention if the same host processes each cached block. Increasing the replication factor lets Impala choose different hosts to process different cached blocks, to better distribute the CPU load.

    • Startup flags for the impalad daemon enable a higher level of compatibility with TIMESTAMP values written by Hive, and more flexibility for working with date and time data using the local time zone instead of UTC. To enable these features, set the impalad startup flags -use_local_tz_for_unix_timestamp_conversions=true and -convert_legacy_hive_parquet_utc_timestamps=true.

      The -use_local_tz_for_unix_timestamp_conversions setting controls how the unix_timestamp(), from_unixtime(), and now() functions handle time zones. By default (when this setting is turned off), Impala considers all TIMESTAMP values to be in the UTC time zone when converting to or from Unix time values. When this setting is enabled, Impala treats TIMESTAMP values passed to or returned from these functions to be in the local time zone. When this setting is enabled, take particular care that all hosts in the cluster have the same timezone settings, to avoid inconsistent results depending on which host reads or writes TIMESTAMP data.

      The -convert_legacy_hive_parquet_utc_timestamps setting causes Impala to convert TIMESTAMP values to the local time zone when it reads them from Parquet files written by Hive. This setting only applies to data using the Parquet file format, where Impala can use metadata in the files to reliably determine that the files were written by Hive. If in the future Hive changes the way it writes TIMESTAMP data in Parquet, Impala will automatically handle that new TIMESTAMP encoding.

      See TIMESTAMP Data Type for details about time zone handling and the configuration options for Impala / Hive compatibility with Parquet format.

    • In Impala 2.2.0 and higher, built-in functions that accept or return integers representing TIMESTAMP values use the BIGINT type for parameters and return values, rather than INT. This change lets the date and time functions avoid an overflow error that would otherwise occur on January 19th, 2038 (known as the "Year 2038 problem" or "Y2K38 problem"). This change affects the FROM_UNIXTIME() and UNIX_TIMESTAMP() functions. You might need to change application code that interacts with these functions, change the types of columns that store the return values, or add CAST() calls to SQL statements that call these functions.

      See Impala Date and Time Functions for the current function signatures.

  • The SHOW FILES statement lets you view the names and sizes of the files that make up an entire table or a specific partition. See SHOW FILES Statement for details.

  • Impala can now run queries against Parquet data containing columns with complex or nested types, as long as the query only refers to columns with scalar types.

  • Performance improvements for queries that include IN() operators and involve partitioned tables.

  • The new -max_log_files configuration option specifies how many log files to keep at each severity level. The default value is 10, meaning that Impala preserves the latest 10 log files for each severity level (INFO, WARNING, and ERROR) for each Impala-related daemon (impalad, statestored, and catalogd). Impala checks to see if any old logs need to be removed based on the interval specified in the logbufsecs setting, every 5 seconds by default. See Rotating Impala Logs for details.

  • Redaction of sensitive data from Impala log files. This feature protects details such as credit card numbers or tax IDs from administrators who see the text of SQL statements in the course of monitoring and troubleshooting a Hadoop cluster. See Redacting Sensitive Information from Impala Log Files for background information for Impala users, and How to Enable Sensitive Data Redaction for usage details.

  • Lineage information is available for data created or queried by Impala. This feature lets you track who has accessed data through Impala SQL statements, down to the level of specific columns, and how data has been propagated between tables. See Viewing Lineage Information for Impala Data for background information for Impala users, Managing Hive and Impala Lineage Properties for usage details, and Using the Lineage View for how to interpret the lineage information.

  • Impala tables and partitions can now be located on the Amazon Simple Storage Service (S3) filesystem, for convenience in cases where data is already located in S3 and you prefer to query it in-place. Queries might have lower performance than when the data files reside on HDFS, because Impala uses some HDFS-specific optimizations. Impala can query data in S3, but cannot write to S3. Therefore, statements such as INSERT and LOAD DATA are not available when the destination table or partition is in S3. See Using Impala with the Amazon S3 Filesystem for details.

  • Improved support for HDFS encryption. The LOAD DATA statement now works when the source directory and destination table are in different encryption zones. See http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_sg_component_kms.html for details about using HDFS encryption with Impala.

  • Additional arithmetic function mod(). See Impala Mathematical Functions for details.

  • Flexibility to interpret TIMESTAMP values using the UTC time zone (the traditional Impala behavior) or using the local time zone (for compatibility with TIMESTAMP values produced by Hive).

  • Enhanced support for ETL using tools such as Flume. Impala ignores temporary files typically produced by these tools (filenames with suffixes .copying and .tmp).

  • The CPU requirement for Impala, which had become more restrictive in Impala 2.0.x and 2.1.x, has now been relaxed.

    The prerequisite for CPU architecture has been relaxed in Impala 2.2.0 and higher. From this release onward, Impala works on CPUs that have the SSSE3 instruction set. The SSE4 instruction set is no longer required. This relaxed requirement simplifies the upgrade planning from Impala 1.x releases, which also worked on SSSE3-enabled processors.

  • Enhanced support for CHAR and VARCHAR types in the COMPUTE STATS statement.

  • The amount of memory required during setup for "spill to disk" operations is greatly reduced. This enhancement reduces the chance of a memory-intensive join or aggregation query failing with an out-of-memory error.

  • Several new conditional functions provide enhanced compatibility when porting code that uses industry extensions. The new functions are: isfalse(), isnotfalse(), isnottrue(), istrue(), nonnullvalue(), and nullvalue(). See Impala Conditional Functions for details.

  • The Impala debug web UI now can display a visual representation of the query plan. On the /queries tab, select Details for a particular query. The Details page includes a Plan tab with a plan diagram that you can zoom in or out (using scroll gestures through mouse wheel or trackpad).

New Features in Impala 2.1.8 / CDH 5.3.10

This point release is exclusively a bug fix release.

New Features in Impala 2.1.7 / CDH 5.3.9

This point release is exclusively a bug fix release.

New Features in Impala 2.1.6 / CDH 5.3.8

This point release is exclusively a bug fix release.

New Features in Impala 2.1.5 / CDH 5.3.6

This point release is exclusively a bug fix release.

New Features in Impala 2.1.0 / CDH 5.3.0

This release contains the following enhancements to query performance and system scalability:

  • Impala can now collect statistics for individual partitions in a partitioned table, rather than processing the entire table for each COMPUTE STATS statement. This feature is known as incremental statistics, and is controlled by the COMPUTE INCREMENTAL STATS syntax. (You can still use the original COMPUTE STATS statement for nonpartitioned tables or partitioned tables that are unchanging or whose contents are entirely replaced all at once.) See COMPUTE STATS Statement and Table and Column Statistics for details.

  • Optimization for small queries lets Impala process queries that process very few rows without the unnecessary overhead of parallelizing and generating native code. Reducing this overhead lets Impala clear small queries quickly, keeping YARN resources and admission control slots available for data-intensive queries. The number of rows considered to be a "small" query is controlled by the EXEC_SINGLE_NODE_ROWS_THRESHOLD query option. See EXEC_SINGLE_NODE_ROWS_THRESHOLD Query Option (CDH 5.3 or higher only) for details.

  • An enhancement to the statestore component lets it transmit heartbeat information independently of broadcasting metadata updates. This optimization improves reliability of health checking on large clusters with many tables and partitions.

  • The memory requirement for querying gzip-compressed text is reduced. Now Impala decompresses the data as it is read, rather than reading the entire gzipped file and decompressing it in memory.

New Features in Impala 2.0.0 / CDH 5.2.0

The following are the major new features in Impala 2.0. This major release, available both with CDH 5.2 and for CDH 4, contains improvements to performance, scalability, security, and SQL syntax.

  • Queries with joins or aggregation functions involving high volumes of data can now use temporary work areas on disk, reducing the chance of failure due to out-of-memory errors. When the required memory for the intermediate result set exceeds the amount available on a particular node, the query automatically uses a temporary work area on disk. This "spill to disk" mechanism is similar to the ORDER BY improvement from Impala 1.4. For details, see SQL Operations that Spill to Disk.

  • Subquery enhancements:
    • Subqueries are now allowed in the WHERE clause, for example with the IN operator.
    • The EXISTS and NOT EXISTS operators are available. They are always used in conjunction with subqueries.
    • The IN and NOT IN queries can now operate on the result set from a subquery, not just a hardcoded list of values.
    • Uncorrelated subqueries let you compare against one or more values for equality, IN, and EXISTS comparisons. For example, you might use WHERE clauses such as WHERE column = (SELECT MAX(some_other_column FROM table) or WHERE column IN (SELECT some_other_column FROM table WHERE conditions).
    • Correlated subqueries let you cross-reference values from the outer query block and the subquery.
    • Scalar subqueries let you substitute the result of single-value aggregate functions such as MAX(), MIN(), COUNT(), or AVG(), where you would normally use a numeric value in a WHERE clause.

    For details about subqueries, see Subqueries in Impala SELECT Statements For information about new and improved operators, see EXISTS Operator and IN Operator.

  • Analytic functions such as RANK(), LAG(), LEAD(), and FIRST_VALUE() let you analyze sequences of rows with flexible ordering and grouping. Existing aggregate functions such as MAX(), SUM(), and COUNT() can also be used in an analytic context. See Impala Analytic Functions for details. See Impala Aggregate Functions for enhancements to existing aggregate functions.

  • New data types provide greater compatibility with source code from traditional database systems:

  • Security enhancements:

    The new security-related SQL statements work along with the Sentry authorization framework. See Enabling Sentry Authorization for Impala for details.

  • Impala can now read compressed text files compressed by gzip, bzip, or Snappy. These files do not require any special table settings to work in an Impala text table. Impala recognizes the compression type automatically based on file extensions of .gz, .bz2, and .snappy respectively. These types of compressed text files are intended for convenience with existing ETL pipelines. Their non-splittable nature means they are not optimal for high-performance parallel queries. See Using gzip, bzip2, or Snappy-Compressed Text Files for details.

  • Query hints can now use comment notation, /* +hint_name */ or -- +hint_name, at the same places in the query where the hints enclosed by [ ] are recognized. This enhancement makes it easier to reuse Impala queries on other database systems. See Optimizer Hints in Impala for details.

  • A new query option, QUERY_TIMEOUT_S, lets you specify a timeout period in seconds for individual queries.

    The working of the --idle_query_timeout configuration option is extended. If no QUERY_OPTION_S query option is in effect, --idle_query_timeout works the same as before, setting the timeout interval. When the QUERY_OPTION_S query option is specified, its maximum value is capped by the value of the --idle_query_timeout option.

    That is, the system administrator sets the default and maximum timeout through the --idle_query_timeout startup option, and then individual users or applications can set a lower timeout value if desired through the QUERY_TIMEOUT_S query option. See Setting Timeout Periods for Daemons, Queries, and Sessions and QUERY_TIMEOUT_S Query Option (CDH 5.2 or higher only) for details.

  • New functions VAR_SAMP() and VAR_POP() are aliases for the existing VARIANCE_SAMP() and VARIANCE_POP() functions.

  • A new date and time function, DATE_PART(), provides similar functionality to EXTRACT(). You can also call the EXTRACT() function using the SQL-99 syntax, EXTRACT(unit FROM timestamp). These enhancements simplify the porting process for date-related code from other systems. See Impala Date and Time Functions for details.

  • New approximation features provide a fast way to get results when absolute precision is not required:

    The APPX_MEDIAN() aggregate function produces an estimate for the median value of a column by using sampling. See APPX_MEDIAN Function for details.
  • Impala now supports a DECODE() function. This function works as a shorthand for a CASE() expression, and improves compatibility with SQL code containing vendor extensions. See Impala Conditional Functions for details.

  • The STDDEV(), STDDEV_POP(), STDDEV_SAMP(), VARIANCE(), VARIANCE_POP(), VARIANCE_SAMP(), and NDV() aggregate functions now all return DOUBLE results rather than STRING. Formerly, you were required to CAST() the result to a numeric type before using it in arithmetic operations.

  • The default settings for Parquet block size, and the associated PARQUET_FILE_SIZE query option, are changed. Now, Impala writes Parquet files with a size of 256 MB and an HDFS block size of 256 MB. Previously, Impala attempted to write Parquet files with a size of 1 GB and an HDFS block size of 1 GB. In practice, Impala used a conservative estimate of the disk space needed for each Parquet block, leading to files that were typically 512 MB anyway. Thus, this change will make the file size more accurate if you specify a value for the PARQUET_FILE_SIZE query option. It also reduces the amount of memory reserved during INSERT into Parquet tables, potentially avoiding out-of-memory errors and improving scalability when inserting data into Parquet tables.

  • Anti-joins are now supported, expressed using the LEFT ANTI JOIN and RIGHT ANTI JOIN clauses. These clauses returns results from one table that have no match in the other table. You might use this type of join in the same sorts of use cases as the NOT EXISTS and NOT IN operators. See Joins in Impala SELECT Statements for details.

  • The SET command in impala-shell has been promoted to a real SQL statement. You can now set query options such as PARQUET_FILE_SIZE, MEM_LIMIT, and SYNC_DDL within JDBC, ODBC, or any other kind of application that submits SQL without going through the impala-shell interpreter. See SET Statement for details.

  • The impala-shell interpreter now reads settings from an optional configuration file, named $HOME/.impalarc by default. See impala-shell Configuration File for details.

  • The library used for regular expression parsing has changed from Boost to Google RE2. This implementation change adds support for non-greedy matches using the .*? notation. This and other changes in the way regular expressions are interpreted means you might need to re-test queries that use functions such as regexp_extract() or regexp_replace(), or operators such as REGEXP or RLIKE. See Apache Impala Incompatible Changes and Limitations for those details.

New Features in Impala 1.4.0 / CDH 5.1.0

The following are the major new features in Impala 1.4.

  • The DECIMAL data type lets you store fixed-precision values, for working with currency or other fractional values where it is important to represent values exactly and avoid rounding errors. This feature includes enhancements to built-in functions, numeric literals, and arithmetic expressions. See DECIMAL Data Type for details.

  • On CDH 5, Impala can take advantage of the HDFS caching feature to "pin" entire tables or individual partitions in memory, to speed up queries on frequently accessed data and reduce the CPU overhead of memory-to-memory copying. When HDFS files are cached in memory, Impala can read the cached data without any disk reads, and without making an additional copy of the data in memory. Other Hadoop components that read the same data files also experience a performance benefit.

    For background information about HDFS caching in CDH, see the documentation for HDFS caching. For performance information about using this feature with Impala, see Using HDFS Caching with Impala (CDH 5.3 or higher only). For the SET CACHED and SET UNCACHED clauses that let you control cached table data through DDL statements, see CREATE TABLE Statement and ALTER TABLE Statement.

  • Impala can now use Sentry-based authorization based either on the original policy file, or on rules defined by GRANT and REVOKE statements issued through Hive. See Enabling Sentry Authorization for Impala for details.

  • For interoperability with Parquet files created through other Hadoop components, such as Pig or MapReduce jobs, you can create an Impala table that automatically sets up the column definitions based on the layout of an existing Parquet data file. See CREATE TABLE Statement for the syntax, and Creating Parquet Tables in Impala for usage information.

  • ORDER BY queries no longer require a LIMIT clause. If the size of the result set to be sorted exceeds the memory available to Impala, Impala uses a temporary work space on disk to perform the sort operation. See ORDER BY Clause for details.

  • LDAP connections can be secured through either SSL or TLS. See Enabling LDAP Authentication for Impala for details.

  • The following new built-in scalar and aggregate functions are available:

    • A new built-in function, EXTRACT(), returns one date or time field from a TIMESTAMP value. See Impala Date and Time Functions for details.

    • A new built-in function, TRUNC(), truncates date/time values to a particular granularity, such as year, month, day, hour, and so on. See Impala Date and Time Functions for details.

    • ADD_MONTHS() built-in function, an alias for the existing MONTHS_ADD() function. See Impala Date and Time Functions for details.

    • A new built-in function, ROUND(), rounds DECIMAL values to a specified number of fractional digits. See Impala Mathematical Functions for details.

    • Several built-in aggregate functions for computing properties for statistical distributions: STDDEV(), STDDEV_SAMP(), STDDEV_POP(), VARIANCE(), VARIANCE_SAMP(), and VARIANCE_POP(). See STDDEV, STDDEV_SAMP, STDDEV_POP Functions and VARIANCE, VARIANCE_SAMP, VARIANCE_POP, VAR_SAMP, VAR_POP Functions for details.

    • Several new built-in functions, such as MAX_INT(), MIN_SMALLINT(), and so on, let you conveniently check whether data values are in an expected range. You might be able to switch a column to a smaller type, saving memory during processing. See Impala Mathematical Functions for details.

    • New built-in functions, IS_INF() and IS_NAN(), check for the special values infinity and "not a number". These values could be specified as inf or nan in text data files, or be produced by certain arithmetic expressions. See Impala Mathematical Functions for details.

  • The SHOW PARTITIONS statement displays information about the structure of a partitioned table. See SHOW Statement for details.

  • New configuration options for the impalad daemon let you specify initial memory usage for all queries. The initial resource requests handled by Llama and YARN can be expanded later if needed, avoiding unnecessary over-allocation and reducing the chance of out-of-memory conditions. See Resource Management for Impala for details.

  • The Impala CREATE TABLE statement now has a STORED AS AVRO clause, allowing you to create Avro tables through Impala. See Using the Avro File Format with Impala Tables for details and examples.
  • New impalad configuration options let you fine-tune the calculations Impala makes to estimate resource requirements for each query. These options can help avoid problems due to overconsumption due to too-low estimates, or underutilization due to too-high estimates. See Resource Management for Impala for details.

  • A new SUMMARY command in the impala-shell interpreter provides a high-level summary of the work performed at each stage of the explain plan. The summary is also included in output from the PROFILE command. See impala-shell Command Reference and Using the SUMMARY Report for Performance Tuning for details.

  • Performance improvements for the COMPUTE STATS statement:

    • The NDV function is speeded up through native code generation.
    • Because the NULL count is not currently used by the Impala query planner, in Impala 1.4.0 and higher, COMPUTE STATS does not count the NULL values for each column. (The #Nulls field of the stats table is left as -1, signifying that the value is unknown.)

    See COMPUTE STATS Statement for general details about the COMPUTE STATS statement, and Table and Column Statistics for how to use the statistics to improve query performance.

  • Performance improvements for partition pruning. This feature reduces the time spent in query planning, for partitioned tables with thousands of partitions. Previously, Impala typically queried tables with up to approximately 3000 partitions. With the performance improvement in partition pruning, now Impala can comfortably handle tables with tens of thousands of partitions. See Partition Pruning for Queries for information about partition pruning.

  • The documentation provides additional guidance for planning tasks. See Planning for Impala Deployment.

  • The impala-shell interpreter now supports UTF-8 characters for input and output. You can control whether impala-shell ignores invalid Unicode code points through the --strict_unicode option. (Although this option is removed in Impala 2.0.)

New Features in Impala 1.3.2 / CDH 5.0.4

No new features. This point release is exclusively a bug fix release for the IMPALA-1019 issue related to HDFS caching.

New Features in Impala 1.3.1 / CDH 5.0.3

This point release is primarily a vehicle to deliver bug fixes. Any new features are minor changes resulting from fixes for performance, reliability, or usability issues.

Because 1.3.1 is the first 1.3.x release for CDH 4, if you are on CDH 4, also consult New Features in Impala 1.3.0 / CDH 5.0.0 for more features that are new to you.

  • A new impalad startup option, --insert_inherit_permissions, causes Impala INSERT statements to create each new partition with the same HDFS permissions as its parent directory. By default, INSERT statements create directories for new partitions using default HDFS permissions. See INSERT Statement for examples of INSERT statements for partitioned tables.

  • The SHOW FUNCTIONS statement now displays the return type of each function, in addition to the types of its arguments. See SHOW Statement for examples.

  • You can now specify the clause FIELDS TERMINATED BY '\0' with a CREATE TABLE statement to use text data files that use ASCII 0 (nul) characters as a delimiter. See Using Text Data Files with Impala Tables for details.

  • In Impala 1.3.1 and higher, the REGEXP and RLIKE operators now match a regular expression string that occurs anywhere inside the target string, the same as if the regular expression was enclosed on each side by .*. See REGEXP Operator for examples. Previously, these operators only succeeded when the regular expression matched the entire target string. This change improves compatibility with the regular expression support for popular database systems. There is no change to the behavior of the regexp_extract() and regexp_replace() built-in functions.

New Features in Impala 1.3.0 / CDH 5.0.0

  • The admission control feature lets you control and prioritize the volume and resource consumption of concurrent queries. This mechanism reduces spikes in resource usage, helping Impala to run alongside other kinds of workloads on a busy cluster. It also provides more user-friendly conflict resolution when multiple memory-intensive queries are submitted concurrently, avoiding resource contention that formerly resulted in out-of-memory errors. See Admission Control and Query Queuing for details.

  • Enhanced EXPLAIN plans provide more detail in an easier-to-read format. Now there are four levels of verbosity: the EXPLAIN_LEVEL option can be set from 0 (most concise) to 3 (most verbose). See EXPLAIN Statement for syntax and Understanding Impala Query Performance - EXPLAIN Plans and Query Profiles for usage information.

  • The TIMESTAMP data type accepts more kinds of input string formats through the UNIX_TIMESTAMP function, and produces more varieties of string formats through the FROM_UNIXTIME function. The documentation now also lists more functions for date arithmetic, used for adding and subtracting INTERVAL expressions from TIMESTAMP values. See Impala Date and Time Functions for details.

  • New conditional functions, NULLIF(), NULLIFZERO(), and ZEROIFNULL(), simplify porting SQL containing vendor extensions to Impala. See Impala Conditional Functions for details.

  • New utility function, CURRENT_DATABASE(). See Impala Miscellaneous Functions for details.

  • Integration with the YARN resource management framework. Only available in combination with CDH 5. This feature makes use of the underlying YARN service, plus an additional service (Llama) that coordinates requests to YARN for Impala resources, so that the Impala query only proceeds when all requested resources are available. See Resource Management for Impala for full details.

    On the Impala side, this feature involves some new startup options for the impalad daemon:

    • -enable_rm
    • -llama_host
    • -llama_port
    • -llama_callback_port
    • -cgroup_hierarchy_path

    For details of these startup options, see Modifying Impala Startup Options.

    This feature also involves several new or changed query options that you can set through the impala-shell interpreter and apply within a specific session:

    • MEM_LIMIT: the function of this existing option changes when Impala resource management is enabled.
    • REQUEST_POOL: a new option. (Renamed to RESOURCE_POOL in Impala 1.3.0.)
    • V_CPU_CORES: a new option.
    • RESERVATION_REQUEST_TIMEOUT: a new option.

    For details of these query options, see impala-shell Query Options for Resource Management.

New Features in Impala 1.2.4

  • On Impala startup, the metadata loading and synchronization mechanism has been improved and optimized, to give more responsiveness when starting Impala on a system with a large number of databases, tables, or partitions. The initial metadata loading happens in the background, allowing queries to be run before the entire process is finished. When a query refers to a table whose metadata is not yet loaded, the query waits until the metadata for that table is loaded, and the load operation for that table is prioritized to happen first.

  • Formerly, if you created a new table in Hive, you had to issue the INVALIDATE METADATA statement (with no table name) which was an expensive operation that reloaded metadata for all tables. Impala did not recognize the name of the Hive-created table, so you could not do INVALIDATE METADATA new_table to get the metadata for just that one table. Now, when you issue INVALIDATE METADATA table_name, Impala checks to see if that name represents a table created in Hive, and if so recognizes the new table and loads the metadata for it. Additionally, if the new table is in a database that was newly created in Hive, Impala also recognizes the new database.

  • If you issue INVALIDATE METADATA table_name and the table has been dropped through Hive, Impala will recognize that the table no longer exists.

  • New startup options let you control the parallelism of the metadata loading during startup for the catalogd daemon:

    • --load_catalog_in_background makes Impala load and cache metadata using background threads after startup. It is true by default. Previously, a system with a large number of databases, tables, or partitions could be unresponsive or even time out during startup.

    • --num_metadata_loading_threads determines how much parallelism Impala devotes to loading metadata in the background. The default is 16. You might increase this value for systems with huge numbers of databases, tables, or partitions. You might lower this value for busy systems that are CPU-constrained due to jobs from components other than Impala.

New Features in Impala 1.2.3

Impala 1.2.3 contains exactly the same feature set as Impala 1.2.2. Its only difference is one additional fix for compatibility with Parquet files generated outside of Impala by components such as Hive, Pig, or MapReduce.

New Features in Impala 1.2.2

Impala 1.2.2 includes new features for performance, security, and flexibility. The major enhancements over 1.2.1 are performance related, primarily for join queries.

New user-visible features include:

  • Join order optimizations. This highly valuable feature automatically distributes and parallelizes the work for a join query to minimize disk I/O and network traffic. The automatic optimization reduces the need to use query hints or to rewrite join queries with the tables in a specific order based on size or cardinality. The new COMPUTE STATS statement gathers statistical information about each table that is crucial for enabling the join optimizations. See Performance Considerations for Join Queries for details.

  • COMPUTE STATS statement to collect both table statistics and column statistics with a single statement. Intended to be more comprehensive, efficient, and reliable than the corresponding Hive ANALYZE TABLE statement, which collects statistics in multiple phases through MapReduce jobs. These statistics are important for query planning for join queries, queries on partitioned tables, and other types of data-intensive operations. For optimal planning of join queries, you need to collect statistics for each table involved in the join. See COMPUTE STATS Statement for details.

  • Reordering of tables in a join query can be overridden by the STRAIGHT_JOIN operator, allowing you to fine-tune the planning of the join query if necessary, by using the original technique of ordering the joined tables in descending order of size. See Overriding Join Reordering with STRAIGHT_JOIN for details.

  • The CROSS JOIN clause in the SELECT statement to allow Cartesian products in queries, that is, joins without an equality comparison between columns in both tables. Because such queries must be carefully checked to avoid accidental overconsumption of memory, you must use the CROSS JOIN operator to explicitly select this kind of join. See Cross Joins and Cartesian Products with the CROSS JOIN Operator for examples.

  • The ALTER TABLE statement has new clauses that let you fine-tune table statistics. You can use this technique as a less-expensive way to update specific statistics, in case the statistics become stale, or to experiment with the effects of different data distributions on query planning.

  • LDAP username/password authentication in JDBC/ODBC. See Enabling LDAP Authentication for Impala for details.

  • GROUP_CONCAT() aggregate function to concatenate column values across all rows of a result set.

  • The INSERT statement now accepts hints, [SHUFFLE] and [NOSHUFFLE], to influence the way work is redistributed during INSERT...SELECT operations. The hints are primarily useful for inserting into partitioned Parquet tables, where using the [SHUFFLE] hint can avoid problems due to memory consumption and simultaneous open files in HDFS, by collecting all the new data for each partition on a specific node.

  • Several built-in functions and operators are now overloaded for more numeric data types, to reduce the requirement to use CAST() for type coercion in INSERT statements. For example, the expression 2+2 in an INSERT statement formerly produced a BIGINT result, requiring a CAST() to be stored in an INT variable. Now, addition, subtraction, and multiplication only produce a result that is one step "bigger" than their arguments, and numeric and conditional functions can return SMALLINT, FLOAT, and other smaller types rather than always BIGINT or DOUBLE.

  • New fnv_hash() built-in function for constructing hashed values. See Impala Mathematical Functions for details.

  • The clause STORED AS PARQUET is accepted as an equivalent for STORED AS PARQUETFILE. This more concise form is recommended for new code.

Because Impala 1.2.2 builds on a number of features introduced in 1.2.1, if you are upgrading from an older 1.1.x release straight to 1.2.2, also review New Features in Impala 1.2.1 to see features such as the SHOW TABLE STATS and SHOW COLUMN STATS statements, and user-defined functions (UDFs).

New Features in Impala 1.2.1

Impala 1.2.1 includes new features for security, performance, and flexibility.

New user-visible features include:

  • SHOW TABLE STATS table_name and SHOW COLUMN STATS table_name statements, to verify that statistics are available and to see the values used during query planning.

  • CREATE TABLE AS SELECT syntax, to create a new table and transfer data into it in a single operation.

  • OFFSET clause, for use with the ORDER BY and LIMIT clauses to produce "paged" result sets such as items 1-10, then 11-20, and so on.

  • NULLS FIRST and NULLS LAST clauses to ensure consistent placement of NULL values in ORDER BY queries.

  • New built-in functions: least(), greatest(), initcap().

  • New aggregate function: ndv(), a fast alternative to COUNT(DISTINCT col) returning an approximate result.

  • The LIMIT clause can now accept a numeric expression as an argument, rather than only a literal constant.

  • The SHOW CREATE TABLE statement displays the end result of all the CREATE TABLE and ALTER TABLE statements for a particular table. You can use the output to produce a simplified setup script for a schema.

  • The --idle_query_timeout and --idle_session_timeout options for impalad control the time intervals after which idle queries are cancelled, and idle sessions expire. See Setting Timeout Periods for Daemons, Queries, and Sessions for details.

  • User-defined functions (UDFs). This feature lets you transform data in very flexible ways, which is important when using Impala as part of an ETL or ELT pipeline. Prior to Impala 1.2, using UDFs required switching into Hive. Impala 1.2 can run scalar UDFs and user-defined aggregate functions (UDAs). Impala can run high-performance functions written in C++, or you can reuse existing Hive functions written in Java.

    You create UDFs through the CREATE FUNCTION statement and drop them through the DROP FUNCTION statement. See User-Defined Functions (UDFs) for instructions about coding, building, and deploying UDFs, and CREATE FUNCTION Statement and DROP FUNCTION Statement for related SQL syntax.

  • A new service automatically propagates changes to table data and metadata made by one Impala node, sending the new or updated metadata to all the other Impala nodes. The automatic synchronization mechanism eliminates the need to use the INVALIDATE METADATA and REFRESH statements after issuing Impala statements such as CREATE TABLE, ALTER TABLE, DROP TABLE, INSERT, and LOAD DATA.

    For even more precise synchronization, you can enable the SYNC_DDL query option before issuing a DDL, INSERT, or LOAD DATA statement. This option causes the statement to wait, returning only after the catalog service has broadcast the applicable changes to all Impala nodes in the cluster.

    This service is implemented by the catalogd daemon. See The Impala Catalog Service for details.

  • CREATE TABLE ... AS SELECT syntax, to create a table and copy data into it in a single operation. See CREATE TABLE Statement for details.

  • The CREATE TABLE and ALTER TABLE statements have new clauses TBLPROPERTIES and WITH SERDEPROPERTIES. The TBLPROPERTIES clause lets you associate arbitrary items of metadata with a particular table as key-value pairs. The WITH SERDEPROPERTIES clause lets you specify the serializer/deserializer (SerDes) classes that read and write data for a table; although Impala does not make use of these properties, sometimes particular values are needed for Hive compatibility. See CREATE TABLE Statement and ALTER TABLE Statement for details.

  • Delegation support lets you authorize certain OS users associated with applications (for example, hue), to submit requests using the credentials of other users. Only available in combination with CDH 5. See Configuring Impala Delegation for Hue and BI Tools for details.

  • Enhancements to EXPLAIN output. In particular, when you enable the new EXPLAIN_LEVEL query option, the EXPLAIN and PROFILE statements produce more verbose output showing estimated resource requirements and whether table and column statistics are available for the applicable tables and columns. See EXPLAIN Statement for details.

  • SHOW CREATE TABLE summarizes the effects of the original CREATE TABLE statement and any subsequent ALTER TABLE statements, giving you a CREATE TABLE statement that will re-create the current structure and layout for a table.

  • The LIMIT clause for queries now accepts an arithmetic expression, in addition to numeric literals.

New Features in Impala 1.2.0 (Beta)

The Impala 1.2.0 beta includes new features for security, performance, and flexibility.

New user-visible features include:

  • User-defined functions (UDFs). This feature lets you transform data in very flexible ways, which is important when using Impala as part of an ETL or ELT pipeline. Prior to Impala 1.2, using UDFs required switching into Hive. Impala 1.2 can run scalar UDFs and user-defined aggregate functions (UDAs). Impala can run high-performance functions written in C++, or you can reuse existing Hive functions written in Java.

    You create UDFs through the CREATE FUNCTION statement and drop them through the DROP FUNCTION statement. See User-Defined Functions (UDFs) for instructions about coding, building, and deploying UDFs, and CREATE FUNCTION Statement and DROP FUNCTION Statement for related SQL syntax.

  • A new service automatically propagates changes to table data and metadata made by one Impala node, sending the new or updated metadata to all the other Impala nodes. The automatic synchronization mechanism eliminates the need to use the INVALIDATE METADATA and REFRESH statements after issuing Impala statements such as CREATE TABLE, ALTER TABLE, DROP TABLE, INSERT, and LOAD DATA.

    This service is implemented by the catalogd daemon. See The Impala Catalog Service for details.

  • Integration with the YARN resource management framework. Only available in combination with CDH 5. This feature makes use of the underlying YARN service, plus an additional service (Llama) that coordinates requests to YARN for Impala resources, so that the Impala query only proceeds when all requested resources are available. See Resource Management for Impala for full details.

    On the Impala side, this feature involves some new startup options for the impalad daemon:

    • -enable_rm
    • -llama_host
    • -llama_port
    • -llama_callback_port
    • -cgroup_hierarchy_path

    For details of these startup options, see Modifying Impala Startup Options.

    This feature also involves several new or changed query options that you can set through the impala-shell interpreter and apply within a specific session:

    • MEM_LIMIT: the function of this existing option changes when Impala resource management is enabled.
    • YARN_POOL: a new option. (Renamed to RESOURCE_POOL in Impala 1.3.0.)
    • V_CPU_CORES: a new option.
    • RESERVATION_REQUEST_TIMEOUT: a new option.

    For details of these query options, see impala-shell Query Options for Resource Management.

  • CREATE TABLE ... AS SELECT syntax, to create a table and copy data into it in a single operation. See CREATE TABLE Statement for details.

  • The CREATE TABLE and ALTER TABLE statements have a new TBLPROPERTIES clause that lets you associate arbitrary items of metadata with a particular table as key-value pairs. See CREATE TABLE Statement and ALTER TABLE Statement for details.

  • Delegation support lets you authorize certain OS users associated with applications (for example, hue), to submit requests using the credentials of other users. Only available in combination with CDH 5. See Configuring Impala Delegation for Hue and BI Tools for details.

  • Enhancements to EXPLAIN output. In particular, when you enable the new EXPLAIN_LEVEL query option, the EXPLAIN and PROFILE statements produce more verbose output showing estimated resource requirements and whether table and column statistics are available for the applicable tables and columns. See EXPLAIN Statement for details.

New Features in Impala 1.1.1

Impala 1.1.1 includes new features for security and stability.

New user-visible features include:

  • Additional security feature: auditing. New startup options for impalad let you capture information about Impala queries that succeed or are blocked due to insufficient privileges. To take full advantage of this feature with Cloudera Manager, upgrade to Cloudera Manager 4.7 or higher. For details, see Impala Security Overview.
  • Parquet data files generated by Impala 1.1.1 are now compatible with the Parquet support in Hive. See Apache Impala Incompatible Changes and Limitations for the procedure to update older Impala-created Parquet files to be compatible with the Hive Parquet support.
  • Additional improvements to stability and resource utilization for Impala queries.
  • Additional enhancements for compatibility with existing file formats.

New Features in Impala 1.1

Impala 1.1 includes new features for security, performance, and usability.

New user-visible features include:

  • Extensive new security features, built on top of the Sentry open source project. Impala now supports fine-grained authorization based on roles. A policy file determines which privileges on which schema objects (servers, databases, tables, and HDFS paths) are available to users based on their membership in groups. By assigning privileges for views, you can control access to table data at the column level. For details, see Impala Security Overview.
  • Impala 1.1 works with Cloudera Manager 4.6 or higher. To use Cloudera Manager to manage authorization for the Impala web UI (the web pages served from port 25000 by default), use Cloudera Manager 4.6.2 or higher.
  • Impala can now create, alter, drop, and query views. Views provide a flexible way to set up simple aliases for complex queries; hide query details from applications and users; and simplify maintenance as you rename or reorganize databases, tables, and columns. See the overview section Overview of Impala Views and the statements CREATE VIEW Statement, ALTER VIEW Statement, and DROP VIEW Statement.
  • Performance is improved through a number of automatic optimizations. Resource consumption is also reduced for Impala queries. These improvements apply broadly across all kinds of workloads and file formats. The major areas of performance enhancement include:
    • Improved disk and thread scheduling, which applies to all queries.
    • Improved hash join and aggregation performance, which applies to queries with large build tables or a large number of groups.
    • Dictionary encoding with Parquet, which applies to Parquet tables with short string columns.
    • Improved performance on systems with SSDs, which applies to all queries and file formats.
  • Some new built-in functions are implemented: translate() to substitute characters within strings, user() to check the login ID of the connected user.
  • The new WITH clause for SELECT statements lets you simplify complicated queries in a way similar to creating a view. The effects of the WITH clause only last for the duration of one query, unlike views, which are persistent schema objects that can be used by multiple sessions or applications. See WITH Clause.
  • An enhancement to DESCRIBE statement, DESCRIBE FORMATTED table_name, displays more detailed information about the table. This information includes the file format, location, delimiter, ownership, external or internal, creation and access times, and partitions. The information is returned as a result set that can be interpreted and used by a management or monitoring application. See DESCRIBE Statement.
  • You can now insert a subset of columns for a table, with other columns being left as all NULL values. Or you can specify the columns in any order in the destination table, rather than having to match the order of the corresponding columns in the source. VALUES clause. This feature is known as "column permutation". See INSERT Statement.
  • The new LOAD DATA statement lets you load data into a table directly from an HDFS data file. This technique lets you minimize the number of steps in your ETL process, and provides more flexibility. For example, you can bring data into an Impala table in one step. Formerly, you might have created an external table where the data files are not entirely under your control, or copied the data files to Impala data directories manually, or loaded the original data into one table and then used the INSERT statement to copy it to a new table with a different file format, partitioning scheme, and so on. See LOAD DATA Statement.
  • Improvements to Impala-HBase integration:
  • You can issue REFRESH as a SQL statement through any of the programming interfaces that Impala supports. REFRESH formerly had to be issued as a command through the impala-shell interpreter, and was not available through a JDBC or ODBC API call. As part of this change, the functionality of the REFRESH statement is divided between two statements. In Impala 1.1, REFRESH requires a table name argument and immediately reloads the metadata; the new INVALIDATE METADATA statement works the same as the Impala 1.0 REFRESH did: the table name argument is optional, and the metadata for one or all tables is marked as stale, but not actually reloaded until the table is queried. When you create a new table in the Hive shell or through a different Impala node, you must enter INVALIDATE METADATA with no table parameter before you can see the new table in impala-shell. See REFRESH Statement and INVALIDATE METADATA Statement.

New Features in Impala 1.0.1

The primary enhancements in Impala 1.0.1 are internal, for compatibility with the new Cloudera Manager 4.6 release. Try out the new Impala Query Monitoring feature in Cloudera Manager 4.6, which requires Impala 1.0.1.

New user-visible features include:

  • The VALUES clause lets you INSERT one or more rows using literals, function return values, or other expressions. For performance and scalability, you should still use INSERT ... SELECT for bringing large quantities of data into an Impala table. The VALUES clause is a convenient way to set up small tables, particularly for initial testing of SQL features that do not require large amounts of data. See VALUES Clause for details.
  • The -B and -o options of the impala-shell command can turn query results into delimited text files and store them in an output file. The plain text results are useful for using with other Hadoop components or Unix tools. In benchmark tests, it is also faster to produce plain rather than pretty-printed results, and write to a file rather than to the screen, giving a more accurate picture of the actual query time.
  • Several bug fixes. See Issues Fixed in the 1.0.1 Release for details.

New Features in Impala 1.0

This version has multiple performance improvements and adds the following functionality:

New Features in Version 0.7 of the Impala Beta Release

This version has multiple performance improvements and adds the following functionality:

  • Several bug fixes. See Issues Fixed in Version 0.7 of the Beta Release.
  • Support for the Parquet file format. For more information on file formats, see How Impala Works with Hadoop File Formats.
  • Added support for Avro.
  • Support for the memory limits. For more information, see the example on modifying memory limits in Modifying Impala Startup Options.
  • Bigger and faster joins through the addition of partitioned joins to the already supported broadcast joins.
  • Fully distributed aggregations.
  • Fully distributed top-n computation.
  • Support for creating and altering tables.
  • Support for GROUP BY with floats and doubles.

In this version, both CDH 4.1 and 4.2 are supported, but due to performance improvements added, we highly recommend you use CDH 4.2 or higher to see the full benefit. If you are using Cloudera Manager, version 4.5 is required.

New Features in Version 0.6 of the Impala Beta Release

  • Several bug fixes. See Issues Fixed in Version 0.6 of the Beta Release.
  • Added support for Impala on SUSE and Debian/Ubuntu. Impala is now supported on:
    • RHEL5.7/6.2 and Centos5.7/6.2
    • SUSE 11 with Service Pack 1 or higher
    • Ubuntu 10.04/12.04 and Debian 6.03
  • Cloudera Manager 4.5 and CDH 4.2 support Impala 0.6.
  • Support for the RCFile file format. For more information on file formats, see Understanding File Formats.

New Features in Version 0.5 of the Impala Beta Release

New Features in Version 0.4 of the Impala Beta Release

  • Several bug fixes. See Issues Fixed in Version 0.4 of the Beta Release.
  • Added support for Impala on RHEL5.7/Centos5.7. Impala is now supported on RHEL5.7/6.2 and Centos5.7/6.2.
  • Cloudera Manager 4.1.3 supports Impala 0.4.
  • The Impala debug webserver now has the ability to serve static files from ${IMPALA_HOME}/www. This can be disabled by setting --enable_webserver_doc_root=false on the command line. As a result, Impala now uses the Twitter Bootstrap library to style its debug webpages, and the /queries page now tracks the last 25 queries run by each Impala daemon.
  • Additional metrics available on the Impala Debug Webpage.

New Features in Version 0.3 of the Impala Beta Release

  • Several bug fixes. See Issues Fixed in Version 0.3 of the Beta Release.
  • The state-store-service binary has been renamed statestored.
  • The location of the Impala configuration files has changed from the /usr/lib/impala/conf directory to the /etc/impala/conf directory.

New Features in Version 0.2 of the Impala Beta Release

  • Several bug fixes. See Issues Fixed in Version 0.2 of the Beta Release.
  • Added Default Query Options Default query options override all default QueryOption values when starting impalad. The format is:
    -default_query_options='key=value;key=value'