What's New in CDH 5.13.x

The following sections describe new features introduced in 5.13.0.

Apache Hive / Hive-on-Spark / HCatalog

  • Support for dynamic partition pruning for map joins on Hive on Spark. Dynamic partition pruning (DPP) is a database optimization that can significantly decrease the amount of data that a query scans, thereby executing your workloads faster. It is disabled by default, but can be enabled by setting the hive.spark.dynamic.partition.pruning.map.join.only property to true. When enabled, DPP only triggers for queries where the join on the partitioned column is a map join. For details, see Dynamic Partition Pruning for Hive Map Joins.

  • Sentry supports Hive metastore high availability. In CDH 5.13 and above, you can use Sentry with Hive metastore high availability, with or without Sentry high availability. For information about the high availability architecture and steps to set up Sentry high availability, see Sentry High Availability.

  • Apache Pig now supports writing partitioned Hive tables in the Parquet format using HCatalog. For details, see Using HCatalog to Write to Parquet Hive Tables with Pig.

Apache Impala

The following are some of the most significant new Impala features in this release:

  • Improvements to memory management through the use of a buffer pool. This mechanism allows queries to use less memory, reserves the required memory during query startup, and reduces the frequency of out-of-memory errors. It makes query planning and memory estimation more accurate, so that if a query begins executing it is unlikely to encounter an out-of-memory error partway through. The memory buffer used during spill-to-disk processing is smaller: instead of 8 MiB, this buffer defaults to 2 MiB and Impala can reduce it to as little as 64 KiB when appropriate.

    This feature includes new query options for fine-tuning memory areas used during query processing: MIN_SPILLABLE_BUFFER_SIZE, DEFAULT_SPILLABLE_BUFFER_SIZE, MAX_ROW_SIZE, and BUFFER_POOL_LIMIT.

  • Improvements to the mechanism for caching HDFS file handles. This caching mechanism improves the performance and scalability of queries that access the same file multiple times, for example to retrieve different columns from a Parquet file. Caching the file handle across open() calls reduces the load on the HDFS NameNode.

    This feature is currently disabled by default. It is enabled by setting a non-zero value for the max_cached_file_handles configuration setting. Currently, ETL processes that append to existing HDFS files or overwrite HDFS files in place can interact with this setting in a way that turns off short-circuit reads for some impalad hosts. See HDFS-12528 for tracking information.

  • A new command in impala-shell, rerun or its abbreviation @, lets you re-execute previous commands based on their numbering in the history output.

  • You can specify the minimum required TLS/SSL version using the --ssl_minimum_version setting, for example --ssl_minimum_version=tlsv1.2.

  • You can specify the set of allowed TLS ciphers using the --ssl_cipher_list configuration setting. See the output of man ciphers for the full set of keywords and notation allowed in the argument string.

  • New or enhanced built-in functions:

    • trunc() can now apply to numeric types (FLOAT, DOUBLE, and DECIMAL) in addition to TIMESTAMP. Although this functionality was already available through the truncate() function, the new signatures for trunc() make it easier to port code from other popular database systems to Impala.

    • A new date/time function utc_timestamp() provides a simple way to get a stable, interoperable representation of a TIMESTAMP value without using a chain of functions to convert between representations and apply a specific timezone.

  • The CREATE TABLE LIKE PARQUET statement can now handle Parquet files produced outside of Impala and containing ENUM types. The ENUM columns become STRING columns in the target table, and the ENUM values are turned into corresponding STRING values.

  • Kudu enhancements:

    • You can now create a Kudu table without using a PARTITION BY clause. Kudu automatically creates a single partition to cover the entire possible range of values. This feature is intended for small lookup tables, where the overhead of partitioning does not make sense for tables that typically have a full table scan for each query.

    • More granular Sentry authorization for Kudu tables. Kudu tables can now use column-level privileges. The SELECT and INSERT statements can now use the corresponding SELECT and INSERT privileges. Other Kudu statements still require the ALL privilege.

    • The ALTER TABLE statement can modify a number of storage attributes for the columns of Kudu tables. You can use the ALTER COLUMN clause of ALTER TABLE along with the SET keyword to change the properties DEFAULT, BLOCK_SIZE, ENCODING, and COMPRESSION. You can use the DROP DEFAULT clause to remove the default value from a column.

  • For non-Kudu tables, you can use the ALTER TABLE syntax ALTER COLUMN col SET COMMENT 'text' to change the comment for an individual column.

Apache Kudu

Starting with Apache Kudu 1.5.0 / CDH 5.13.x, Kudu has been fully integrated into CDH. Kudu now ships as part of the CDH parcel and packages. The documentation for Kudu has also been incorporated into the Cloudera Enterprise documentation here.

For a complete list of new features and changes introduced in Kudu (in CDH 5.13), see What's New in Apache Kudu.

Apache Pig

Apache Sentry

  • High availability for Sentry provides automatic failover in the event that your primary Sentry host goes down or is unavailable. In CDH 5.13.0, you can have two Sentry hosts. For information about how to configure high availability for Sentry, see Sentry High Availability.
  • You must add Sentry to the list of groups that the Hive user can impersonate in the Hive Metastore Access Control and Proxy User Groups Override setting. See Configuring the Sentry Service for more information about Sentry service configuration.
  • Sentry supports Hive metastore high availability.