New Features in CDH 6.3.0

OpenJDK 11 support for Cloudera Manager and CDH 6.3 and higher

You can now use OpenJDK 11 with Cloudera Enterprise 6.3.

When you install OpenJDK 11 in your cluster, it uses the G1GC method for garbage collection for most services, which may require tuning to avoid overcommitting memory. See Tuning JVM Garbage Collection.

OPSAPS-50993, OPSAPS-49390, OPSAPS-51643

Apache Accumulo

There are no notable new features in this release.

Apache Avro

There are no notable new features in this release.

Apache Crunch

There are no notable new features in this release.

Apache Flume

There are no notable new features in this release.

Apache Hadoop

Hadoop Common

There are no notable new features in this release.

HDFS

There are no notable new features in this release.

MapReduce

There are no notable new features in this release.

YARN

YARN Distributed Shell with File Localization

YARN distributed shell is a tool for YARN feature test. The file localization features allows you to localize a file remotely that is defined in the command line.

Queue Based Maximum Container Allocation Limit for Fair Scheduler

The yarn.scheduler.maximum-allocation-mb property allows you to limit the overall size of a container on a scheduler-level. The maxContainerAllocation property set maximum resources on queue level, expressed in the form of “X mb, Y vcores” or “vcores=X, memory-mb=Y”. If this queue specific configuration is defined, it overrides the scheduler level configuration for that particular queue. If the queue based maximum allocation limit is not set, the scheduler level setting is used.

Apache HBase

There are no notable new features in this release.

Apache Hive / Hive on Spark / HCatalog

Continue reading:

Apache Hive

There are no notable new features in this release.

Hive on Spark

There are no notable new features in this release.

HCatalog

There are no notable new features in this release.

Hue

There are no notable new features in this release.

Apache Impala

The following are some of the notable new features in this release of Impala.

Automatic Invalidate/Refresh Metadata

With automatic metadata management enabled, you no longer have to issue INVALIDATE/REFRESH in a number of conditions. In CDH 6.3, the following additional event in Hive Metastore can trigger automatic INVALIDATE / REFRESH of Metadata:

  • INSERT into tables and partitions from Impala or from Spark on the same or multiple cluster configuration.

This is a preview feature in CDH 6.3 and is disabled by default.

See Impala Metadata Management for the information and steps to enable the Zero Touch Metadata feature.

Data Cache for Remote Reads

To improve performance on multi-cluster HDFS environments as well as on object store environments, Impala now caches data for non-local reads (e.g. S3, ABFS, ADLS) on local storage.

This is a preview feature in CDH 6.3 and is disabled by default.

The data cache is enabled with the --data_cache startup flag.

See Impala Remote Data Cache for the information and steps to enable remote data cache.

Query Profile

The following information was added to the Query Profile output for better monitoring and troubleshooting of query performance.

  • Network I/O throughput

  • System disk I/O throughput

See Impala Query Profile for generating and reading query profile.

Support for Kudu integrated with Hive Metastore

In CDH 6.3, Kudu is integrated with Hive Metastore (HMS), and from Impala, you can create, update, delete, and query the tables in the Kudu services integrated with HMS.

See Using Kudu with Impala for information on using Kudu tables in Impala.

See Using the Hive Metastore with Kudu for upgrading existing tables.

Support for zstd compression for Parquet files

Zstandard (Zstd) is a real-time compression algorithm offering a tradeoff between speed and ratio of compression. Compression levels from 1 up to 22 are supported. The lower the level, the faster the speed at the cost of compression ratio.

Apache Kafka

The following are some of the notable new features in this release of Kafka CDH 6.3.0.

Rebase on Apache Kafka 2.2.1

The Kafka version in CDH 6.3.0 is based on Apache Kafka 2.2.1. For upstream release notes, see Apache Kafka version 2.2.0 and 2.2.1 release notes.

Kafka Topics Tool Able to Connect Directly to Brokers

The kafka-topics command line tool is now able to connect directly to brokers with the --bootstrap-server option instead of zookeeper. The old --zookeeper option is still available for now. For more information, see KIP-377.

Apache Kudu

The following are some of the notable new features in this release of Kudu:

  • Kudu supports both full and incremental table backups via a job implemented using Apache Spark. Additionally, it supports restoring tables from full and incremental backups via a restore job implemented using Apache Spark. See the backup documentation for more details.

  • Kudu can now synchronize its internal catalog with the Apache Hive Metastore, by automatically updating Hive Metastore table entries upon table creation, deletion, and alterations in Kudu. See the HMS synchronization documentation for more details.

  • Kudu also supports native, fine-grained authorization via integration with Apache Sentry. Kudu may now enforce access control policies defined for the Kudu tables and columns, as well as policies defined on the Hive servers and databases that may store the Kudu tables. See the authorization documentation for more details.

  • Kudu’s web UI now supports SPNEGO, a protocol for securing HTTP requests with Kerberos by passing negotiation through the HTTP headers. To enable authorization using SPNEGO, set the --webserver_require_spnego command line flag.

  • Column comments can now be stored in the Kudu tables, and can be updated using the AlterTable API.

  • The Java scan token builder can now create multiple tokens per tablet. To use this functionality, call setSplitSizeBytes() to specify how many bytes of data each token should scan. The same API is also available in Kudu’s Spark integration, where it can be used to spawn multiple Spark tasks per scanned tablet.

  • Apache Kudu now has an experimental Kubernetes StatefulSet manifest and Helm chart which can be used to define and provision Kudu clusters using Kubernetes.

  • The Kudu CLI now has a rudimentary, YAML-based configuration file support, which can be used to provide cluster connection information via cluster name instead of keying in comma-separated lists of master addresses. See the cluster name documentation for more details.

  • The kudu perf table_scan command scans a table and displays a table’s row count as well as the time it took to run the scan.

  • The kudu table copy command copies data from one table to another, within the same cluster or across clusters. Note that this implementation leverages a single client, and therefore, it may not be suitable for large tables.

  • The tablet history retention time can now be configured on a table-by-table basis.

The following are some of the notable optimizations and improvements in this release of Kudu:

  • The performance of mutations (i.e. UPDATE, DELETE, and re-INSERT) to not-yet-flushed Kudu data has been significantly optimized.

  • Predicate performance for primitive columns has been optimized.

  • IS NULL and IS NOT NULL predicate performance has been optimized.

  • The performance of fetching the tablet locations from the master, for tables with large numbers of partitions has been optimized. This can improve the performance of short-running Spark or Impala queries, as well as user applications which make use of the short-lived client instances.

  • The tableExists() (Java) and TableExists() (C++) APIs are more performant.

  • Fault tolerant scans are much more performant and they consume far less memory.

  • kudu cluster ksck now sends more requests in parallel. This improves the speed when running against clusters with many tables, or when there is a high latency between the node running the CLI and the cluster nodes.

  • Kudu’s block manager now deletes the spent block containers when needed instead of just at server startup. This reduces the server startup time.

  • DNS resolutions are now cached by the Kudu masters, the tablet servers, and the Kudu C++ clients. By default, the time-to-live (TTL) for a resolved DNS entry in the cache is 15 seconds.

  • Tables created in Kudu 1.10.0 or later will show their creation time as well as their last alteration time in the web UI.

  • The Kudu CLI and the C++ client now support overriding the local username using the ‘KUDU_USER_NAME’ environment variable. This enables you to operate against a Kudu cluster using an identity which differs from the local Unix user on the client. Note that this has no effect on secure clusters, where client identity is determined by Kerberos authentication.

  • The Kudu C++ client now performs a stricter verification on the input data of the INSERT and the UPSERT operations with respect to the table schema constraints. This helps in spotting the schema violations before sending the data to a tablet server.

  • The KuduScanner parameter in the Java client is now iterable. Additionally, the KuduScannerIterator will automatically make the scanner to keep the calls alive so that the scanners do not time out while iterating.

  • A KuduPartitioner API has been added to the Java client. The KuduPartitioner API allows a client to determine which partition a row falls into without actually writing that row. For example, the KuduPartitioner is used in the Spark integration to optionally repartition and pre-sort the data before writing to Kudu.

  • The PartialRow and the RowResult Java APIs have new methods that accept and return Java Objects. These methods are useful when you don’t care about autoboxing and your existing type handling logic is based on Java types. See the javadoc for more details.

  • The Kudu Java client now logs RPC trace summaries instead of full RPC traces when the log level is INFO or higher. This reduces the log noise and makes the RPC issues visible in a more compact format.

  • The Kudu servers now display the time at which they were started in their web UIs.

  • The Kudu tablet servers now display a table’s total column count in the web UI.

  • The /metrics web UI endpoint now supports filtering data by entity types, entity IDs, entity attributes, and metric names. This can be used to collect important metrics more efficiently when there is a large number of tablets on a tablet server.

  • The Kudu rebalancer now accepts the --ignored_tservers command line argument, which can be used to ignore the health status of specific tablet servers (i.e. if they are down) when deciding whether or not it is safe to rebalance the cluster.

  • The kudu master list command now displays the Raft consensus role (a LEADER or a FOLLOWER) of each master in the cluster.

  • kudu table scan command no longer interleaves its output. It projects all the columns without having to manually list the column names.

  • The kudu perf loadgen command now supports creating empty tables. The semantics of the special value of 0 for the --num_rows_per_thread flag has changed. A value of 0 now indicates that no rows should be generated, and -1 indicates that there should be no limit to the number of rows generated.

  • Running the make install command after building Kudu from the source will now install the Kudu binaries into appropriate locations.

Apache Oozie

There are no notable new features in this release.

Apache Parquet

There are no notable new features in this release.

Apache Pig

There are no notable new features in this release.

Cloudera Search

There are no notable new features in this release.

Apache Sentry

There are no notable new features in this release.

Apache Spark

There are no notable new features in this release.

Apache Sqoop

There are no notable new features in this release.

Apache Zookeeper

There are no notable new features in this release.