New Features in CDH 6.0.0

Apache Accumulo

Running Apache Accumulo on top of a CDH 6.0.0 cluster is not currently supported. If you try to upgrade to CDH 6.0.0 you will be asked to remove the Accumulo service from your cluster. Running Accumulo on top of CDH 6 will be supported in a future release.

Apache Avro

The following are new features from the upstream release Avro 1.8.2 that are available in the CDH 6.0.0 release. For more information on Avro 1.8.2, see Apache Avro 1.8.2 Documentation.
  • AVRO-834: Data File corruption recovery tool
  • AVRO-1502: Avro objects should implement Serializable
  • AVRO-1684: Add date, time, and timestamp to specific object model classes
  • AVRO-1402: Support for DECIMAL type
  • AVRO-1439: MultipleInputs equivalent for Avro MR

Apache Crunch

There are no notable new features in this release.

Apache Flume

Apache Flume contains the following new features:
  • CDH 6.0.0 uses HBase 2.0.
  • HBaseSink is replaced with HBase2Sink. For more information about Flume HBase2Sink changes in CDH 6.0.0, see Incompatible Changes in CDH 6.0.0.

Serializer Class Names

The HBase serializer classes have been renamed to change Hbase to HBase and to include the version number.

The following class names have been updated:
New HBase2 Name Legacy Name
org.apache.flume.sink.hbase2.HBase2EventSerializer org.apache.flume.sink.hbase.HbaseEventSerializer
org.apache.flume.sink.hbase2.SimpleHBase2EventSerializer org.apache.flume.sink.hbase.SimpleHbaseEventSerializer
org.apache.flume.sink.hbase2.RegexHBase2EventSerializer org.apache.flume.sink.hbase.RegexHbaseEventSerializer
org.apache.flume.sink.hbase2.HBase2Sink org.apache.flume.sink.hbase.HBaseSink

In CDH, the legacy names have been kept for backward compatibility and are aliases to the new names. Existing configurations will work without changes, but new configurations should use the HBase 2.x seralizer class names. In the upstream Apache Flume version, the legacy names refer to HBase 1.x classes and the new names refer to HBase 2.x classes.

For example, the following configuration contains the new class name and will use HBase2Sink for CDH and upstream Apache Flume:

host1.sinks.sink1.type = org.apache.flume.sink.hbase.HBase2Sink # or hbase2
host1.sinks.sink1.serializer = org.apache.flume.sink.hbase.SimpleHBase2EventSerializer

However, the following configuration, which contains the legacy name, will use HBase2Sink for CDH, but will use the old HBaseSink for upstream Apache Flume:

host1.sinks.sink1.type = org.apache.flume.sink.hbase.HBaseSink # or hbase
host1.sinks.sink1.serializer = org.apache.flume.sink.hbase.SimpleHbaseEventSerializer

Apache Hadoop

Hadoop Common

As part of the Hadoop 3 rebase, CDH 6.0.0 supports the updated version of the Hadoop shell discussed in HADOOP-9902. This introduces several incompatible changes as noted HADOOP-9902. Customers using the Hadoop shell may need to make adjustments to their scripts.

For more information, see HADOOP-9902 and the Apache Hadoop shell documentation.

HDFS

There are no notable new features in this release.

MapReduce

There are no notable new features in this release.

YARN

YARN JARs

YARN jobs running MapReduce2 are now configured to read MapReduce JARs from HDFS instead of from local disk. This makes jobs more robust during rolling upgrades when the local binaries are modified while a job is executing. Clusters created in or upgrading to CDH 6.0 will use this new behavior.

Apache HBase

CDH 6.0.0 includes most of the features introduced in HBase 2.0 and ensures that they are fully compatible with other CDH components. However, HBase 2.0 may include features that are not supported by Cloudera as part of CDH 6.0.0.

Supportability

CDH 6.0.0 contains a significant number of enhancements designed to make HBase easier to configure and to support.

Assignment Manager

CDH 6.0.0 has some significant architectural changes to the assignment manager. The assignment manager is a component in the HBase master that manages region to region server assignment and ensures that any one region replica is assigned to just one region server. The overall goal is to make HBase more resilient through providing a unified place within HBase to handle some tricky scenarios related to region assignment including:

  • Multi step procedures with rollback and rollforward capabilities in case of failure.
  • Notifications across multiple region servers.
  • Coordination of long-running and/or computationally expensive procedures.
  • Procedures that require precise coordination across multiple machines (e.g., snapshots).

Note that HBCK is read-only in CDH 6.0.0. HBASE-18792 is listed in the CDH HBase Incompatible Changes.

Replication

Replication peers can be defined for entire namespaces instead of for each individual table. For clusters with hundreds or thousands of tables defining replication configuration settings for each table is cumbersome and error prone and being able to define configuration at the namespace level is a significant improvement.

  • HBASE-16447: Replication by namespaces config in peer; Set a namespace in peer config means that all tables in this namespace will be replicated.
  • HBASE-19293: Support adding a new replication peer in disabled state.
  • HBASE-17314: (Replication) Limit total buffered size for all replication sources - Add a conf "replication.total.buffer.quota" to limit total size of buffered entries in all replication peers. It will prevent server getting OOM if there are many peers. Default value is 256MB.
  • HBASE-17296: Peer level throttling for replication - Add the bandwidth upper limit to ReplicationPeerConfig and a new shell cmd set_peer_bandwidth to update the bandwidth in need.
  • HBASE-16466: HBase snapshots supported by VerifyReplication tool - verifyrep can now compare source table snapshot against peer table snapshot which reduces load on RS by reading data from HDFS directly using Snapshot scanners.

Balancer Improvements

Periodically, the HBase master will look at load across region servers and attempt to balance it out by reassigning regions. The handling of this cluster maintenance has improved in CDH 6.0.0:

  • HBASE-18164: Much faster stochastic load balancer; Improvements in locality cost function and candidate generator
  • HBASE-17178: Region balancing throttling - New config to protect availability of cluster by specifying maximum number of regions (in percentage) that can be in transition at any time.
  • HBASE-14309: Added boolean parameter, force, to 'balancer' command so that admin can force region balancing even when there is region (other than hbase:meta) in transition.

Shell

HBase’s command line shell has been enhanced with several new commands that allow for easier cluster administration. New commands allow for administrators to better understand the current compaction state of a table and to clear the compaction queue if needed. There is also a new command to display metadata associated with a region.

New shell commands:
  • HBASE-17928: Clear compaction queues : clear_compaction_queues.
  • HBASE-16147: Get compaction state of a table : compaction_state.
  • HBASE-14925: Displaying the table's region info : list_regions.

Major Compaction Tool

HBase deployments with a heavy write workload often disable HBase's automatic handling of major compactions in order to reduce IO during peak cluster load. HBase now ships a tool to make it easier for operators who then have to ensure major compactions can happen during off-hours.

  • HBASE-19528: Major Compaction Tool - Tool allows you to compact a cluster with given concurrency of regionservers compacting at a given time.

Metrics

In CDH 6.0.0, there are a significant number of new metrics available. Additionally, client side metrics are now collected by default. Some examples include:

  • Block cache metrics for the primary region replica instead of lumping metrics for the primary, secondary, and tertiary replicas.
  • Per table metrics that describe the memstore, storefiles, and region size.
  • Metrics that track assignment manager responsibilities like merging and splitting regions, assigning and unassigning regions to region servers, and state information that exists at the time in the event of a server crash.
  • Latency metrics for for checkAndPut, checkAndDelete, putBatch and deleteBatch.
  • Counts for hedgedReads and hedgedWrites Latency histograms on a per-region basis.
  • Metrics for tracking coprocessor usage.
New metrics:
  • HBASE-14314: 3 new Block cache metrics for primary region replica.
  • HBASE-15671: Per-table metrics on memstore, storefile and regionsize.
  • HBASE-15518: Per-table metrics aggregated from per-region metrics in region server metrics.
  • HBASE-16549: Few AM metrics.
  • HBASE-18374: Latency metrics for checkAndPut, checkAndDelete, putBatch and deleteBatch.
  • HBASE-12220: hedgedReads and hedgedReadWins counts.
  • HBASE-19285: Per-RegionServer table latency histograms have been returned to HBase.
Metrics changes:
  • HBASE-15943: New "Process Metrics' tab that dumps mbean -- mostly jvm -- metrics.
  • HBASE-14583: Client side metrics are enabled by default - Sets the default value of hbase.client.metrics.enable=true.

Spark 2 Integration

Spark integration for accessing data in HBase has been updated to work with the Spark 2 version that ships with CDH 6.x. It should provide equivalent functionality to the CDH 5.x integration with Spark 1.6.

Improved Performance

CDH 6.0.0 has an extensive set of enhancements aimed at improving overall performance. The following sections describe some of the improvements, but this list is not exhaustive.

Off-Heap Write Cache

CDH 6.0.0 includes the ability to define an off-heap write cache. Off-heap caches allow HBase to manage memory directly instead of delegating memory management to java. HBase has much more context on how to safely reclaim memory it’s using compared to the java garbage collection process which allows HBase to more efficiently manage memory usage. Off-heap caching allows HBase to use more memory than would be practical if delegating the responsibility to java where “stop the world” garbage collection can cause large spikes in latency. One consequence of allowing significantly more write side cache memory through an off-heap cache is an increase in density of regions per region server.

Off-Heap Read Path

It is now possible to configure HBase so that data remains off-heap along the entire read path, avoiding copies of the data onto the java heap. With this feature enabled cached data will have a similar latency to on-heap caching in earlier versions of HBase but without putting GC pressure on the JVM. For more information on configuring and running a deployment with an off-heap read path see the Apache HBase documentation reference guide section on Offheap Read Path.

Improved Cleaning of Old Files

CDH 6.0.0 has an option to allow HFileCleaner to use multiple threads. HFileCleaner is a background process that removes files containing dereferenced data. This new configuration is useful when a system is under substantial write load to ensure that clean up operations complete in a timely manner to prevent the archive directory from ballooning in size.

  • HBASE-18083: Configure HFileCleaner to use multiple threads for large/small (archived) hfile cleaning.
  • HBASE-17215: Separate small/large file delete threads in HFileCleaner to accelerate archived hfile cleanup speed.
  • HBASE-18309: CleanerChore (for cleaning up HFiles and old log files) now supports multiple threads.

Streaming Scan API

An additional performance enhancement in CDH 6.0.0 is the introduction of a “streaming scan” API. This API can improve the performance of intensive workloads by allowing multiple chunks of data to be prefetched concurrently. Previously, prefetch operations were synchronous, i.e, they required the previous prefetch operation to complete. Synchronous prefetches can result in a “stop-and-wait” access pattern where an application has finished processing its previous chunk of data and is now waiting for the next chunk of data to arrive. Prefetching a second chunk of data before the first is entirely processed helps eliminate the “stop-and-wait” access pattern.

  • HBASE-13071: A pipelined scan API for speeding up applications that combine massive data traversal with compute-intensive processing.

New RPC Server

A new RPC server based on Netty4 is included in CDH 6.0.0. This RPC server option can increase the performance of workloads with large amounts of random reads and writes. This new RPC server option is on by default.

  • HBASE-17263, HBASE-19323: NettyRpcServer- A new RPC server based on Netty4 which can improve random read (get) performance. It is also the default RPC server replacing SimpleRpcServer.
  • HBASE-15136: New RPC scheduler - CoDel - The purpose is to prevent long standing call queues caused by discrepancy between request rate and available throughput, caused by kernel/disk IO/networking stalls.
Note:
  • HBASE-15212: New configuration to limit RPC request size to protect the server against very large incoming RPC requests. All requests larger than this size will be immediately rejected before allocating any resources.

    These changes should eliminate a number of issues in earlier versions of HBase commonly referred to as "regions stuck in transition."

  • Split/Merge have moved to the Master; it runs them now. RegionServer hooks are now noops. To intercept Split/Merge phases, CPs need to intercept on MasterObserver.
  • HBASE-16414: Improve performance for RPC encryption with Apache Common Crypto.
  • HBASE-16023: Fastpath for the FIFO rpcscheduler - Will shine best when high random read workload (YCSB workloadc for instance).
  • HBASE-15994: Allow selection of RpcSchedulers.

Improved MapReduce over Snapshots

HBase provides users the ability to run map/reduce jobs over HBase snapshots. Snapshots are immutable, and by running map/reduce jobs directly against them clients can avoid performance overhead associated with going through the region servers. In prior HBase versions exactly one map task per region was created. In CDH 6.0.0, the ability to assign multiple mappers to a particular region is possible. This change can further improve the performance of map/reduce jobs over HBase snapshots and eliminates the need to design schemas optimized for the prior limitation.

  • HBASE-18090: TableSnapshotInputFormat now supports multiple mappers per region.

Heterogeneous Storage Management

HDFS has supported heterogeneous storage management (HSM) for several releases. This feature allows users to specify that certain HDFS files or directories should reside on specific physical media. In CDH 6.0.0, users can do something very similar at the logical, column-family level. Currently supported storage policies include ALL_SSD/ONE_SSD/HOT/WARM/COLD.

Note that HBase has supported setting storage policies on WALs since CDH 5.7.

  • HBASE-14061: Column Family-level Storage Policy - Currently supported storage policies include ALL_SSD/ONE_SSD/HOT/WARM/COLD
  • HBASE-15172: Support setting storage policy in bulkload

Client Robustness Against Slow Region Servers

Clients can now optionally elect to handle slow region servers in application code. By setting the configuration value for hbase.client.perserver.requests.threshold an application can control how many concurrent requests to a single region server are allowed. After reaching this threshold the HBase client code will throw a ServerTooBusyException rather than attempting to establish an additional connection. If an HBase cluster has a small number of region servers that are having trouble this will allow an application to react in some way rather than simply waiting for the server to eventually respond. Because using this feature requires application side changes to handle retry logic you must opt-in to it; by default hbase.client.perserver.requests.threshold is set to practically unlimited.

  • HBASE-16388: Prevent client threads being blocked by only one slow region server - Added a new configuration to limit the max number of concurrent request to one region server.

Maven Archetypes

  • HBASE-14877: Maven archetype: client application - Introduces a new infrastructure for creation and maintenance of Maven archetypes in the context of the hbase project, and it also introduces the first archetype, which end-users may utilize to generate a simple hbase-client dependent project.

JIRAs

CDH 6.0.0 contains the following new features:
  • HBASE-18519: Use builder pattern to create cell
  • HBASE-13259: mmap() based BucketCache IOEngine - Can be configured using the property hbase.bucketcache.ioengine.
  • HBASE-14247: Separate the old WALs into different regionserver directories - Disabled by default. Can be enabled by using the configuration hbase.separate.oldlogdir.by.regionserver.
  • HBASE-12706: Support multiple port numbers in ZK quorum string - hbase.zookeeper.quorum configuration now allows servers together with client ports consistent with the way Zookeeper java client accepts the quorum string. In this case, using hbase.zookeeper.clientPort is not needed. eg. hbase.zookeeper.quorum=myserver1:2181,myserver2:20000,myserver3:31111.
  • HBASE-15806: A coprocessor service based export tool - org.apache.hadoop.hbase.coprocessor.Export.
  • HBASE-15536: Changed default WALProvider to AsyncFSWALProvider.
  • HBASE-18533: New configurations for BucketCache.
  • HBASE-17583: Add inclusive/exclusive support for startRow and endRow of scan for sync client.
  • HBASE-16672: Added option for bulk load to always copy hfile(s) instead of renaming.
  • HBASE-19336: Two new RSGroup commands to operate on namespace level - move_namespaces_rsgroup and move_servers_namespaces_rsgroup.
  • HBASE-17437: Support WAL directory outside of the root directory - Useful for cloud deployments because it allows keeping WALs on a faster network drive than hfiles, giving better latency and throughput.
  • HBASE-16213: New data block encoding for fast random gets.
  • HBASE-15265: Asynchronous WALs.
  • HBASE-14969: Throughput controller for flush - Added means of throttling flush throughput. By default there is no limit (NoLimitThroughputController). An alternative controller, PressureAwareFlushThroughputController, allows specifying throughput bounds. See PressureAwareFlushThroughputController.java class for detail.
  • HBASE-15187: Added CSRF prevention filter to REST gateway - Disabled by default, can be enabled by using the configuration hbase.rest.csrf.enabled.
  • HBASE-15711: Added client side property to provide more logging details for batch errors.
  • HBASE-15511: New ClusterStatus.Option field to limit the scope of response.
  • HBASE-15576: Scanning cursor to prevent blocking long time on ResultScanner.next().
  • HBASE-15921: New AsyncTable client.
  • HBASE-11344: Hide row keys from the web UIs
  • HBASE-3462: UI pages for splitting/merging now operate by taking a row key prefix from the user rather than a full region name.
  • HBASE-8410: Basic quota support for namespaces - Namespace auditor provides basic quota support for namespaces in terms of number of tables and number of regions.

Apache Hive / Hive on Spark / HCatalog

Continue reading:

Apache Hive

Query Vectorization Support for Parquet Files

By default, the Hive query execution engine processes one row of a table at a time. The single row of data goes through all the operators in the query before the next row is processed, resulting in very inefficient CPU usage. In vectorized query execution, data rows are batched together and represented as a set of column vectors. The query engine then processes these vectors of columns, which greatly reduces CPU usage for typical query operations like scans, filters, aggregates, and joins.

Hive query vectorization is enabled by setting the hive.vectorized.execution.enabled property to true. In both CDH 5 and CDH 6, this property is set to true by default. But in CDH 5, vectorized query execution in Hive is only possible on ORC-formatted tables, which Cloudera recommends that you do not use for overall compatibility with the CDH platform. Instead, Cloudera recommends that you use tables in the Parquet format because all CDH components support this format and can be consumed by all CDH components. In CDH 6, query vectorization is supported for Parquet tables in Hive.

For more information, see Query Vectorization for Apache Hive in CDH and Apache Parquet Tables with Hive in CDH.

Support for UNION DISTINCT

Support has been added for the UNION DISTINCT clause in Hive. See HIVE-9039 and the Apache wiki for more details. This feature introduces the following incompatible changes to Hive:

  • Behavior in CDH 5:

    • SORT BY, CLUSTER BY, ORDER BY, LIMIT, and DISTRIBUTE BY can be specified without delineating parentheses either before a UNION ALL clause or at the end of the query, resulting in the following behaviors:

      • When specified before, these clauses are applied to the query before UNION ALL is applied.
      • When specified at the end of the query, these clauses are applied to the query after UNION ALL is applied.
    • The UNION clause is equivalent to UNION ALL, in which no duplicates are removed.
  • Behavior in CDH 6:

    • SORT BY, CLUSTER BY, ORDER BY, LIMIT, and DISTRIBUTE BY can be specified without delineating parentheses only at the end of the query, resulting in the following behaviors:

      • These clauses are applied to the entire query.
      • Specifying these clauses before the UNION ALL clause results in a parsing error.
    • The UNION clause is equivalent to UNION DISTINCT, in which all duplicates are removed.

Support for NULLS FIRST/NULLS LAST

Support has been added for NULLS FIRST and NULLS LAST options. These options can be used to determine whether null values appear before or after non-null data values when the ORDER BY clause is used. Hive follows the SQL:2003 standard for this feature, but the SQL standard does not specify the behavior by default. By default in Hive, null values are sorted as if lower than non-null values. This means that NULLS FIRST is the default behavior for ASC order, and NULLS LAST is the default behavior for DESC order. See Syntax of Order By on the Apache Hive wiki and HIVE-12994 for further details.

Here are some usage examples:

SELECT x.* FROM table1 x ORDER BY a ASC NULLS FIRST;
SELECT x.* FROM table1 x ORDER BY a ASC NULLS LAST;
        

Added Support for Windowing and Analytics Functions

Support for the following has been added to CDH 6.0:

  • Using DISTINCT with windowing functions. See HIVE-9534 for details.
  • Support for ORDER BY and a windowing clause when DISTINCT is used in a partitioning clause. See HIVE-13453 for details.
  • Support to reference aggregate functions within the OVER clause. See HIVE-13475 for details.

For further details, see the Apache Language Manual on Windowing and Analytics.

Table or Partition Statistics Editing

Support has been added for editing the statistics information that is stored for a table or a partition. For example, you can run the following statement to set the number of rows for a table to 1000:

ALTER TABLE table1 UPDATE STATISTICS SET ('numRows'='1000');
        

For more information, see HIVE-12730 and the Apache wiki.

SHOW CREATE DATABASE Support

Support has been added for the SHOW CREATE DATABASE command, which prints out the DDL statement that was used to create a database:

SHOW CREATE DATABASE database1;
        

For more information, see HIVE-11706

Support for Multiple-Column IN Clause

Support has been added so that the IN clause in queries operates over multiple column references. The new syntax is boldfaced in the following example:

CREATE TABLE test (col1 int, col2 int);
INSERT INTO TABLE test VALUES (1, 1), (1, 2), (2, 1), (2, 2);
SELECT * FROM test;
+------------+------------+
| test.col1  | test.col2  |
+------------+------------+
| 1          | 1          |
| 1          | 2          |
| 2          | 1          |
| 2          | 2          |
+------------+------------+
SELECT * FROM test WHERE (col1, col2) IN ((1, 1));
+------------+------------+
| test.col1  | test.col2  |
+------------+------------+
| 1          | 1          |
+------------+------------+
        

For more information, see HIVE-11600

Support for More Hive Functions

Support has been added for the following Hive UDFs:

bround chr factorial
floor_day floor_hours floor_minute
floor_month floor_quarter floor_second
floor_week floor_year grouping
mask mask_first_n mask_hash
mask_last_n mask_show_first_n mask_show_last_n
quarter replace sha1
sha shiftleft shiftright
shiftrightunsigned substring_index

All built-in Hive functions can be listed with the command SHOW FUNCTIONS; and a short description that explains what a function does is returned with the command DESCRIBE <function_name>; For more information about Hive functions, see the Apache wiki and Managing UDFs in the Cloudera enterprise documentation.

Hive on Spark

Dynamic RDD Caching for Hive on Spark

An optimization has been added to Hive on Spark that enables automatic caching of reused RDDs (Resilient Distributed Datasets). This optimization can improve query performance when the query or sub-query must scan a table multiple times. For example, TPC-DS query 39 is a query that requires multiple table scans. This optimization is disabled by default in CDH 6.0, but can be enabled by setting the hive.combine.equivalent.work.optimization property to true in the hive-site.xml file.

To configure this property in Cloudera Manager:

  1. In the Admin Console, select the Hive service.
  2. Click the Configuration tab.
  3. Search for the HiveServer2 Advanced Configuration Snippet (Safety Valve) for hive-site.xml.
  4. Enter the following property configuration information:

    • Name: hive.combine.equivalent.work.optimization
    • Value: true
    • Description: Enables dynamic RDD caching for HoS

    To disable this configuration, set the Value field to false.

    To set this configuration property in the XML editor, enter the following code:

    <property>
         <name>hive.combine.equivalent.work.optimization</name>
         <value>true</value>
         <description>Disables dynamic RDD caching for HoS</description>
    </property>
                   
  5. Click Save Changes, and restart the service.

For more information see HIVE-10844 and HIVE-10550.

Optimized Hash Tables Enabled for Hive on Spark

Support has been added for optimized hash tables for Hive on Spark to reduce memory overhead. This feature is enabled by default in CDH 6.0, but can be disabled by setting the hive.mapjoin.optimized.hashtable property to false in the hive-site.xml file. To configure this property in Cloudera Manager:

  1. In the Admin Console, select the Hive service.
  2. Click the Configuration tab.
  3. Search for the HiveServer2 Advanced Configuration Snippet (Safety Valve) for hive-site.xml.
  4. Enter the following property configuration information:

    • Name: hive.mapjoin.optimized.hashtable
    • Value: false
    • Description: Disables optimized hash tables for HoS

    To enable this configuration, set the Value field to true.

    To set this configuration property in the XML editor, enter the following code:

    <property>
         <name>hive.mapjoin.optimized.hashtable</name>
         <value>false</value>
         <description>Disables optimized hash tables for HoS</description>
    </property>
                   
  5. Click Save Changes, and restart the service.

For more details, see HIVE-11182 and HIVE-6430.

Hue

There are no notable new features in this release.

Apache Impala

The following are some of the significant new features in this release of Impala.

List of Reserved Words Updated with SQL:2016 Reserved Words

The list of in Impala was updated in CDH 6.0 to be inline with the SQL:2016 standard.

See Impala Reserved Words for the updated list of reserved words.

Decimal V2 Used by Default

In Impala, two different behaviors of DECIMAL types are supported. In CDH 6.0, the V2 behavior of DECIMAL is used by default. See DECIMAL Type for detail information.

Behavior of Column Aliases Changed

To conform to the SQL standard, Impala no longer performs alias substitution in the subexpressions of GROUP BY, HAVING, and ORDER BY. See Overview of Impala Aliases for detail information.

Default PARQUET_ARRAY_RESOLUTION Changed

The default value of the PARQUET_ARRAY_RESOLUTION was changed to THREE_LEVEL in CDH 6.0.

The PARQUET_ARRAY_RESOLUTION setting controls the path-resolution behavior for Parquet files with nested arrays. See PARQUET_ARRAY_RESOLUTION Query Option for the information about the option.

Clustering Hint for Inserts Enabled by Default

In CDH 6.0, the CLUSTERED hint is enabled by default. The hint inserts a local sort by the partitioning columns to a query plan.

The clustered hint is only effective for HDFS and Kudu tables.

Apache Kafka

The following are some of the notable new features in this release of Kafka CDH 6.0.0.

Kafka Bundled with CDH

As of CDH 6.0.0, Apache Kafka is now included in the CDH bundle. You can choose to install Kafka from the list of custom services or you can add Kafka later to an existing cluster.

The Kafka version in CDH 6.0.0 is based on Apache Kafka 1.0.1.

Cloudera documentation for Apache Kafka in CDH 6.0.0 is part of the CDH documentation; see the Kafka Guide.

New Metrics

Base Metrics

  • Global Topic Count
  • In-Sync Replica Counts
    • Under Min ISR Partition Count
    • Under Min ISR

Broker Metrics

  • Health Checks: Error rates
    • Message conversion rate and time
    • Request size and temporary memory size
    • Authentication success and failure rates
    • ZooKeeper status and latency

Security Improvements

This release includes improved diagnostics for SASL and SSL authentication failures as described in KIP-152.

This release includes an upgrade to PrincipalBuilder interface that improves SSL authentication and extends the functionality to SASL authentication. See KAFKA-5783 for details.

Log Context added to logger messages

This release includes using a LogContext object that automatically adds a log prefix to every message written by loggers constructed from it.

  • The consumer log messages now contain the consumer group and client IDs, which is very helpful when multiple consumers are run on the same instance.
  • The producer log messages are automatically prefixed with client ID and transactional ID.
  • The NetworkClient log messages now include client IDs.

Default Behavior Changes

  • KAFKA-3356 - Removed ConsumerOffsetChecker, which was deprecated in Apache Kafka version 0.9.
  • KAFKA-5384 - KIP-162: Enabled topic deletion by default. This release changes the default of delete.topic.enable to true.

Apache Kudu

The following are some of the notable new features in this release of Kudu 1.6 / CDH 6.0.
  • Cloudera Manager now supports installing Kudu as part of the Express Wizard installation.
  • Cloudera Manager has improved monitoring of Kudu node health. Cloudera Manager will indicate whether the health of the service is concerning or bad according to the number of master or tablet server roles that are not running.
  • The disk failure tolerance feature is enabled by default on tablet servers, and the capability was extended to handle data directory failures at runtime. In the event of a runtime disk failure, any tablets with data on a failed disk will be shut down and restarted on another tablet server. By default, tablets are striped across all available disks. Note that the first configured data directory and the WAL directory cannot currently tolerate disk failures. This will be further improved in future Kudu releases.

  • Kudu servers can now adopt new data directories via the kudu fs update_dirs tool. The new directory will be used by new tablet replicas only. Removing directories is not yet supported (KUDU-2202).

  • Kudu servers have two new flags to control the webui TLS/HTTPS settings: --webserver_tls_ciphers and --webserver_tls_min_protocol. These flags allow the advertised TLS ciphers and TLS protocol versions to be configured. Additionally, the webserver now excludes insecure legacy ciphers by default (KUDU-2190).

  • The strategy Kudu uses for automatically healing tablets which have lost a replica has been improved. When a tablet loses a replica due to server or disk failures, the new re-replication strategy, or replica management scheme, first adds a replacement tablet replica before evicting the failed one. With the previous replica management scheme, the system first evicts the failed replica and then adds a replacement. The new replica management scheme allows for much faster recovery of tablets in scenarios where one tablet server goes down and then returns back shortly after 5 minutes or so. The new scheme also provides substantially better overall stability on clusters with frequent server failures (KUDU-1097).

The following are the notable optimizations improvements made in Kudu.

  • Kudu servers can now tolerate short interruptions in NTP clock synchronization. NTP synchronization is still required when any Kudu daemon starts up. If NTP synchronization is not available, diagnostic information is now logged to help pinpoint the issue (KUDU-1578).

  • Tablet server startup time has been improved significantly on servers containing large numbers of blocks.

  • The log block manager now performs disk data deletion in batches. This optimization can significantly reduce the time taken to delete data on a tablet.

  • The usage of sensitive data redaction flag has been changed. By setting the --redact=log flag, redaction will be disabled in the web UI but retained for server logs. Alternatively, --redact=none can be used to disable redaction completely.

  • The Spark DataSource integration now can take advantage of scan locality for better scan performance. The scan will take place at the closest replica instead of going to the leader.

  • Various optimizations were made to reduce the 99th percentile latency of writes on tablet servers. This can also improve throughput on certain write workloads, particularly on larger clusters.

  • You can configure Kudu to ignore the system-wide auth_to_local mappings configured in /etc/krb5.conf by setting the configuration flag --use_system_auth_to_local=false (KUDU-2198).

  • The performance of the compaction scheduler has been improved. In previous versions, certain types of time series workloads could cause compaction scheduling to take tens of seconds. These workloads now schedule compactions an order of magnitude more efficiently.

  • The compaction scheduler has been improved to avoid running a compaction when the benefit of that compaction is extremely small.

  • Tablet servers now consider the health of all replicas of a tablet before deciding to evict one. This can improve stability of the Kudu cluster after experiencing multiple simultaneous daemon failures (KUDU-2048).

  • Several performance improvements have been made to the Kudu master, particularly in concurrency of clients opening tables. This should improve performance in highly concurrent workloads.

  • The on-disk size metric for a tablet now includes all data and metadata. Previously, it excluded WAL segments and consensus metadata (KUDU-1755).

  • Added a verbose mode for the Kudu cluster ksck command to enable output of detailed information on the cluster’s metadata, even when no errors are detected.

Apache Oozie

The following new features and improvements are available in this release of Oozie:
  • Major revamp of the Hadoop components: Hadoop 3, HBase 2, Hive 2, Spark 2
  • Launch actions are now available via YARN (instead of MapReduce)
  • Switch from Tomcat 6 (end-of-life) to Jetty 9
  • Local ShareLib improvements:
    • New option to use local paths in the ShareLib instead of HDFS paths
    • Usability improvement: users don’t have to upload .jar files to HDFS
    • After first-time cluster deployment, high volume of HDFS roundtrips are reduced on container resource localization
    • Improved version management
    • Automatic updates .jar files on cluster machines
  • Supportability improvements:
    • Transient database outage resiliency
    • Spark action wrap-up:
      • Extended parameter parsing
      • Spark 2 support
    • Database migration tool can cope with large databases
    • Integration test framework improvements
    • Diagnostic bundle collector
    • Coordinator improvements

Apache Parquet

The new features and improvements in Apache Parquet versions 1.6.0 through 1.9.0 are included in the CDH 6.0 release. For the complete list of new features and enhancements that are available upstream in Parquet, see Apache Parquet Release Notes.

In addition to the contents of the upstream Parquet 1.9.0 release, the Parquet in CDH 6.0 includes new features and bug fixes that were added to the upstream Parquet after the latest release. The following are the noteworthy new features added on top of the Parquet 1.9.0.

PARQUET-386: parquet-tools Prints Column Metadata Statistics

When you run parquet-tools with schema as an argument, the tool will print the statistics data of columns.

See Using the Parquet File Format with Impala Tables for using parquet-tools to examine the structure and data of Parquet files.

PARQUET-1025: New min-max Statistics in parquet-mr

The new min-max statistics is supported in parquet-mr. The old min and max statistics were only applicable to integer and floating point types. Additionally, the new min_value and max_value fields allow efficient filtering of decimal and string values as well. You can read more about these new statistics fields in Cloudera Engineering blog post.

PARQUET-1115: Warning Message in parquet-tools merge

To prevent users from using the parquet-tools merge command inappropriately, a warning was added to the help text of the parquet-tools command and to the output of the parquet-tools merge command if the size of the original row groups are below a threshold.

Many users are tempted to use the new parquet-tools merge functionality, because they want to achieve good performance and historically that has been associated with large Parquet files. However, in practice Hive performance won't change significantly after using parquet-tools merge, but Impala performance will be much worse. The reason for that is that good performance is not a result of large files but large rowgroups instead (up to the HDFS block size).

However, parquet-tools merge does not merge rowgroups, it just places them one after the other. It was intended to be used for Parquet files that are already arranged in row groups of the desired size. When used to merge many small files, the resulting file will still contain small row groups and one loses most of the advantages of larger files (the only one that remains is that it takes a single HDFS operation to read them).

Additional Features and Enhancements

The additional new features and enhancements added to CDH 6.0 are:

  • PARQUET-321: Default maximum block padding to 8MB.
  • PARQUET-801: Allow UserDefinedPredicates in DictionaryFilter.
  • PARQUET-822: Upgrade the Java dependencies.
  • PARQUET-1026: Allow unsigned binary stats when the min value equals the max value.
  • PARQUET-1170: String values are represented based on logical type and represented properly in tools/logs.

Apache Pig

Cloudera Search

Cloudera Search in CDH 6.0 is rebased on Apache Solr 7.0, which has added many new features since the 4.10 version of Apache Solr used in recent CDH 5 releases.

For information on the new features added in Solr 5, Solr 6, and Solr 7, see the upstream release notes:

Solr JSON Request API

You can now create search queries in JSON format, which is much easier to read and write than the previous Solr query syntax.

More information:

JSON Facet API

You can now write facet queries in JSON format. Facet structure is inherently nested, and is much easier to navigate in JSON format.

More information:

Nested Faceting

You can now switch the facet domain between parents and children for the purposes of calculating facets.

More information:

Per-Collection clusterstate.json

In CDH 5, Solr stores the states of all cores and collections in a single file in ZooKeeper, named clusterstate.json. This creates a single point of failure and a point of contention in large clusters with many actively changing collections. In CDH 6, Solr stores the state of cores in a single clusterstate.json file for each collection.

More information:

Improved Integration with Apache Sentry

Cloudera Search in CDH 6 supports pluggable authorization modules, including Apache Sentry. The Sentry privilege model for Cloudera Search has been improved to support more granular control over privileges.

More information:

HyperLogLog Based Distributed Cardinality

This is a faster (but less accurate) way to calculate the number of distinct values on high cardinality fields. This implementation uses less memory than the previous one.

More information:

Improved Support for DocValues

More information:

Distributed IDF

In CDH 5, Solr calculates TF-IDF at the shard level. This results in misleading relevancy when TF-IDF varies a lot among shards. In CDH 6, Solr can use global TF-IDF statistics.

More information:

Jetty 9

In CDH 6, Solr runs inside a Jetty container instead of Tomcat.

Apache Sentry

There are no notable new features in this release.

Apache Spark

The following list describes what's new and changed in Apache Spark for CDH 6. Because CDH 6 includes features from Spark 2.x, this list includes the features across several releases of the CDS Powered By Apache Spark parcel, which was available separately for CDH 5.

  • More flexibility to interpret TIMESTAMP values written by Impala. Setting the spark.sql.parquet.int96TimestampConversion configuration setting to true makes Spark interpret TIMESTAMP values, when reading from Parquet files written by Impala, without applying any adjustment from the UTC to the local time zone of the server. This behavior provides better interoperability for Parquet data written by Impala, which does not apply any time zone adjustment to TIMESTAMP values when reading or writing them.
  • The Spark "blacklisting" feature is now turned on by default. This feature takes hosts out of the pool of executors when they exhibit unreliable behavior, even if those hosts are not completely down. See the Blacklisting in Apache Spark blog post for details.
  • Hive On Spark now runs on top of Apache Spark 2.x.
  • Spark On HBase now runs on top of Apache Spark 2.x.
  • The Oozie Spark action now runs on top of Apache Spark 2.x.
  • Support for using Spark 2 jobs to read and write data on the Azure Data Lake Store (ADLS) cloud service.

  • Version 2.2 or higher of CDS Powered By Apache Spark requires JDK 8.

  • New direct connector to Kafka that uses the new Kafka consumer API. See Using Kafka with Apache Spark Streaming for Stream Processing for details.

  • New SparkSession object replaces HiveContext and SQLContext.
    • Most of the Hive logic has been reimplemented in Spark.
    • Some Hive dependencies still exist:
      • SerDe support.
      • UDF support.
  • Added support for the unified Dataset API.
  • Faster Spark SQL achieved with whole stage code generation.
  • More complete SQL syntax now supports subqueries.
  • Adds the spark-csv library.
  • Backport of SPARK-5847. The root for metrics is now the app name (spark.app.name) instead of the app ID. The app ID requires investigation to match to the app name, and changes when streaming jobs are stopped and restarted.

Apache Sqoop

The following are the new features in this release of Sqoop. All the new features available upstream in Sqoop 1.4.7 have been included in the CDH 6.0 release. For more information on Sqoop 1.4.7, see Release Notes for Sqoop 1.4.7.

Sqoop connector parcel locations

Starting in CDH 6.0, Sqoop Connector parcels can be found in two separate parcel repositories: https://archive.cloudera.com/sqoop-teradata-connector1/latest/ and https://archive.cloudera.com/sqoop-netezza-connector1/latest/.

SQOOP-816: Add support for external Hive tables

You can specify the importing of data into an external table in Hive (instead of a managed table, which is the default behavior). In order to do this, use the --external-table-dir option to specify the path. Include the --hive-import flag in the command line arguments if you use the import tool.

Example commands:

  • # Importing from Oracle RDBMS server into external HIVE table:
    sqoop import --hive-import --connect $CONN --table $TABLENAME --username $USER --password $PASS --external-table-dir /tmp/external_table_example
  • # Creating a hive table with a different name than in the database:
    sqoop create-hive-table --connect $CONN --table $TABLENAME --username $USER --password $PASS --external-table-dir /tmp/foobar_example --hive-table foobar

SQOOP-1904: Add support for DB2's XML data type

Added support for DB2's XML data type when importing to HDFS. Avro and Parquet files are not supported. The --as-textfile and --as-sequencefile options work as expected.

Example commands:

  • # the import command works as expected:
    sqoop import --connect $CONN --table $TABLENAME --username $USER --password $PASS
    # export back to table in the database:
    sqoop export --connect $CONN --table TEST_FOOBAR --username $USER --password $PASS --export-dir path/to/imported/data

SQOOP-1905: Add --schema option for import-all-tables and list-tables for DB2 connector

Sqoop now supports the --schema tool option for DB2. If the option is not present, then the schema of the current user will be used as default.

Example commands:

  • sqoop list-tables --connect $CONN --username $USER --password $PASS -- --schema DB2INST2
  • sqoop import-all-tables --connect $CONN --username $USER --password $PASS -- --schema DB2INST2

SQOOP-2936: Provide Apache Atlas integration for hcatalog-based exports

Sqoop already supported publishing information about its data operations via the SqoopJobDataPublisher class before this change. However, this class is now extended to allow Apache Atlas to identify data lineage when importing or exporting from or to Hive, by publishing the Hive database and table names along with the previously published data.

SQOOP-2976: Flag to expand decimal values to fit Avro schema

The sqoop.avro.decimal_padding.enable flag was added to Sqoop to allow padding decimal values when importing data into Avro files. This flag must be used together with sqoop.avro.logical_types.decimal.enable set to true. The padding enables the user to import decimal and number types (with a declared precision and scale) into Avro files when using databases that do not store the decimal values padded, such as the Oracle RDBMS.

Example command:

sqoop import -Dsqoop.avro.decimal_padding.enable=true -Dsqoop.avro.logical_types.decimal.enable=true --connect $CON --username $USER --password $PASS --query "select * from table_name where \$CONDITIONS" --target-dir hdfs://nameservice1//etl/target_path --as-avrodatafile --verbose -m 1

SQOOP-3178: Incremental Merging for Parquet File Format

Sqoop now supports the incremental merging of parquet files. The merge tool combines two datasets where entries in one dataset should overwrite entries of an older dataset.
# As with avro data file, the merge tool requires two already existing parquet imports:
sqoop import --connect $CONN --table $TABLENAME --username $USER --password $PASS --target-dir incr_merge_test_1 --as-parquetfile
sqoop import --connect $CONN --table $TABLENAME --username $USER --password $PASS --target-dir incr_merge_test_2 --as-parquetfile
# the merge command itself looks like this:
sqoop merge --new-data incr_merge_test_2 --onto incr_merge_test_1 --target-dir merged --jar-file datatypes.jar --class-name Foo --merge-key id

The feature also works with incremental imports (--incremental lastmodified).

SQOOP-3216: Expanded Metastore support for MySql, Oracle, Postgresql, Microsoft SQL Server, and DB2

In addition to HSQLDB, Sqoop can now store metastore metadata in MySql, Oracle, Postgresql, Microsoft SQL Server, and DB2.

The JDBC connection string of the database storing the metastore has to be configured with the --meta-connect option or using the sqoop.metastore.client.autoconnect.url property in sqoop-site.xml.

The following table includes the supported connection string formats:
Service Connect String Format
MySQL jdbc:mysql://<server>:<port>/<dbname>
HSQLDB jdbc:hsqldb:hsql://<server>:<port>/<dbname>
PostgreSQL jdbc:postgresql://<server>:<port>/<dbname>
Oracle jdbc:oracle:thin:@//<server>:<port>/<SID>
DB2 jdbc:db2://<server>:<port>/<dbname>
Microsoft SQL Server jdbc:sqlserver://<server>:<port>;database=<dbname>

If the --meta-connect option is present, then Sqoop tries to connect to the metastore database specified in this parameter value. Sqoop uses the username and password specified in --meta-username and --meta-password parameters. If they are not present, Sqoop uses empty username and password values. If the database in the connection string is not supported, Sqoop throws an exception.

If the --meta-connect parameter is not present, Sqoop checks if sqoop.metastore.client.enable.autoconnect configuration parameter is set to false. The default value is true. If the parameter is false, Sqoop throws an error, since there are no applicable metastore implementations.

Next, Sqoop checks the sqoop.metastore.client.autoconnect.url configuration parameter. In this parameter, users can specify a JDBC connection string to a metastore database. Supported databases are HSQLDB, MySQL, Oracle, PostgreSQL, DB2 and SQL Server. If this parameter is not present, Sqoop creates an embedded HSQLDB metastore in the home directory of the OS user that executes the sqoop command. The username and password can be specified in sqoop.metastore.client.autoconnect.username (default value: SA) and sqoop.metastore.client.autoconnect.password (default value is the empty string). If the connection string in sqoop.metastore.client.autoconnect.url is not supported, Sqoop throws an exception.

As in previous versions, the metastore tool can only start a shared HSQLDB instance. To use a metastore backed by a database other than HSQLDB, start and configure it manually and provide the correct JDBC connection string to Sqoop. Upon the first connection, Sqoop creates the necessary metadata tables, so the database user specified in the --meta-username option has to have CREATE TABLE permission.

Example commands

  • Listing available jobs in the metastore:
    sqoop job --list --meta-connect jdbc:oracle:thin:@//myhost:1521/ORCLCDB --meta-username ms_user --meta-password ms_password
    
  • Creating a new job in the metastore:
    sqoop job --create myjob1 --meta-connect jdbc:oracle:thin:@//myhost:1521/ORCLCDB --meta-username ms_user --meta-password ms_password -- import --connect jdbc:mysql://mysqlhost:3306/sqoop --username sqoop --password sqoop --table "TestTable" -m 1
    
  • Executing an existing job:
    sqoop job --exec myjob1 --meta-connect jdbc:oracle:thin:@//myhost:1521/ORCLCDB --meta-username ms_user --meta-password ms_password
    
  • Showing the definition of an existing job:
    sqoop job --show myjob2 --meta-connect jdbc:oracle:thin:@//myhost:1521/ORCLCDB --meta-username ms_user --meta-password ms_password
    
  • Deleting an existing job:
    sqoop job --delete myjob1 --meta-connect jdbc:oracle:thin:@//myhost:1521/ORCLCDB --meta-username ms_user --meta-password ms_password
    

Apache Zookeeper

There are no notable new features in this release.