What's New in CDH 5.12.x

Apache Hadoop

Apache Hive / Hive-on-Spark

  • Support for Microsoft Azure Data Lake Store (ADLS) as a secondary filesystem for both Hive on MapReduce2 (YARN) and Hive-on-Spark. You can now use both Hive on MapReduce2 and Hive-on-Spark to read and write data stored on ADLS.

    See Configuring ADLS Connectivity for CDH

  • The Hive schematool is integrated with Cloudera Manager where you can use it to upgrade or validate the Hive metastore schema.

    See Using the Hive Schema Tool in CDH for details.

  • HIVE-1575: Added support for JSON arrays at the root level by the get_json_object function. For example:
    SELECT get_json_object('[1,2,3]', '$[0]')...        


Hue 4 is out and jam-packed with great new features.

New Layout in Hue 4!
  • Apps are consolidated under blue button–set your favorite as default landing page
  • Top search bar lets you search for saved queries and other data
  • Left and right assist panels let you search and filter schema objects
  • Cursor position determines which of multiple queries to run
  • New Pig editor, Job Designer, and Job Browser
  • Access old Hue 3 layout under user drop down or remove "hue" from URL.

Load Balancer Added by Default

During a new installation of CDH/Hue, one Load Balancer is automatically promoted to ensure optimal performance–it can reduce the Hue server load by up to 90%! In existing clusters, administrators are prompted to add a load balancer role and users are then guided on how to enable it. See the Cloudera Blog on Automatic HA.

Test LDAP Configuration

Verify your LDAP configuration, on-the-fly, with this new feature in Cloudera Manager under Hue > Actions > Test LDAP Configuration. See Authenticate Hue Users with LDAP.

Navigator Optimizer Integrated (Phase 1)

With Navigator Optimizer enabled in Hue, popular tables, columns, joins, filters are displayed in the autocompleter. Risky statements, such as missing filters on partitioned tables, trigger an alert.

Navigator Search & Tag Enabled by Default

With Navigator enabled in Hue, you can search and tag metadata. This feature is now enabled by default (with Cloudera Navigator installed). See How to Use Governance-Based Data Discovery.

Other Cool Features

  • You can create partitioned tables from files
  • Impala metadata is refreshed automatically
  • SQL autocompleter handles more advanced corner cases
  • Remote Load balancer works with SSL
  • Query history is paginated!

Apache Impala

The following are some of the most significant new Impala features in this release:

  • Impala can now read and write data stored on the Microsoft Azure Data Lake Store (ADLS). See Using Impala with the Azure Data Lake Store (ADLS) for details.

  • New built-in functions:

    • A new string function, replace(), which is faster than regexp_replace() for simple string substitutions. See Impala String Functions for details.

    • A new conditional function, nvl2(), which offers more flexibility than the nvl() function. It lets you return one value for NOT NULL arguments, and a different value for NULL arguments. See Impala Conditional Functions for details.

    • A new date/time function, last_day(), for finding the date corresponding to the last day of a particular month. See Impala Date and Time Functions for details.

  • New syntax, REFRESH FUNCTIONS db_name, lets Impala recognize newly added functions, such as UDFs created through Hive. Impala scans the metadata for a specified database to locate the new functions, which is faster and more convenient than doing a full INVALIDATE METADATA operation.

  • The new TABLESAMPLE clause for the SELECT statement lets you specify a percentage of the overall data volume within a table to process during a query.

  • Startup flags for the impalad daemon, is_executor and is_coordinator, let you divide the work on a large, busy cluster between a small number of hosts acting as query coordinators, and a larger number of hosts acting as query executors. By default, each host can act in both roles, potentially introducing bottlenecks during heavily concurrent workloads. See How to Configure Impala with Dedicated Coordinators for details.

  • A new query option, DEFAULT_JOIN_DISTRIBUTION_MODE, lets you change the default assumption about how join queries should handle tables with no statistics. This can help to avoid out-of-memory conditions for join queries, without manual tuning to add the /* +SHUFFLE */ hint for queries on large tables with missing statistics.

  • The SORT BY clause lets you create Parquet files with more efficient compression and smaller ranges of values for specified columns, allowing Impala to apply optimizations to skip reading data from Parquet files that do not contain any values that match equality and range operators in the WHERE clause. See CREATE TABLE Statement for details.

  • The max_audit_event_log_files lets you perform log rotation for the audit event log files, similar to the rotation for regular Impala log files.

  • Kudu enhancements:

    • The ALTER TABLE statement can specify more attributes for a Kudu table with the ADD COLUMNS clause. Now you can specify [NOT] NULL, ENCODING COMPRESSION, DEFAULT, and BLOCK_SIZE. See ALTER TABLE Statement for details.

    • The TIMESTAMP type is now available for Kudu tables.
    • The INSERT and CREATE TABLE AS SELECT statements are more efficient when writing to Kudu tables. Formerly, the overhead for the write operations could result in timeouts when writing large numbers of rows in a single operation.

Apache HBase

  • Apache HBase now has ADLS support and recommendations for Azure deployment.
  • Outside of cloud, HBase now has support for long-lived Spark applications via token renewal.

Apache Spark

Spark can read and write data on the Azure Data Lake Store (ADLS) cloud service. See Accessing Data Stored in Azure Data Lake Store (ADLS) through Spark for details.

S3Guard List-After-Delete Consistency

This release adds Delete Tracking for S3Guard. Running CDH 5.11.0 clusters against the same DynamoDB table as clusters running CDH5.12.0 or higher is not supported. S3Guard in CDH5.11.0 does not work with the newer schema. CDH 5.11.1 and higher releases are compatible with the newer tables when running CDH 5.12.0 and higher.

S3Guard does not track deletes unless you are running a version with Delete Tracking support (CDH 5.12.0 and higher).