CDP Public Cloud: August 2023 Release Summary

Data Hub

This release of the Data Hub service introduces the following changes:

Support for Local SSDs in GCP

Data Hub now supports using Local SSDs as storage in GCP. During Data Hub cluster creation in CDP, you can navigate to advanced Hardware and Storage options and select to use “Local scratch disk (SSD)” with certain instance types. Prior to using Local SSDs with CDP, make sure to review Google Cloud Platform documentation related to Local SSDs and familiarize yourself with the applicable restrictions and limitations (such as 12 TB maximum capacity, limited configurations).

Note: Stopping and restarting Data Hub clusters using local SSDs is not supported; When you stop or suspend a VM, all data on the attached Local SSD is discarded.

Data Warehouse

Cloudera Data Warehouse Public Cloud 1.7.1-b755 changes, described in more detail below:

Amazon Machine Image updates

This release supports dynamically updating the Amazon Machine Image (AMI) to prevent potential problems running workloads on an old AMI. You can update the AMI of the Cloudformation stack while keeping the current Elastic Kubernetes Service (EKS) version.

AWS Elastic Kubernetes Service 1.26 support

New CDW clusters using AWS environments you activate in this release 1.7.1-b755 (released August 30, 2023) of Cloudera Data Warehouse will use Amazon Kubernetes (EKS) version 1.26. For more information, see Upgrading Amazon Kubernetes Service.

Amazon EKS 1.23 and from EKS 1.23 to 1.24 upgrades

The release includes reduced permissions mode support for the following Amazon Elastic Kubernetes Service upgrades:

AWS restricted policy updates

The AWS restricted policy has been updated to conform to AWS file size requirements. The update divides the policy into two parts as described in the documentation.

Automatic backup and restoration of Hue

The CDW backup and restore processes for Hue have been automated. Manually backing up and restoring Hue is still available, but optional.

Correcting the Virtual Warehouse Size

After creating an Impala Virtual Warehouse, you can tune, or correct the T-shirt size of executor groups that drive the Impala Virtual Warehouse. The size of the executor groups is critical for achieving cost and performance goals.

Deprecation of DAS

Hue now replaces Data Analytics Studio (DAS). DAS has been deprecated and is no longer available in CDW Public Cloud. DAS features to support Hive and Tez such as running queries, defining HPL/SQL, the Job Browser, query explorer, query compare, and more, have been migrated to Hue, and the Hue Query Processor. After you upgrade to this release, you will not see the option to launch DAS from your Virtual Warehouse. Cloudera recommends you use Hue for all use cases where you might have previously used DAS.

Instance types available for compute nodes

In this release, when you activate an AWS environment, you can select the compute instance type you want to use. This release adds additional instance types you can select when you activate an Azure environment.

Java tool options now configurable with limitations

After creating an Impala Virtual Warehouse, you can change the XMX Java tool option.

Managing high partition loads

In this release, you can identify an error related to high partition workloads and tune your Hive Virtual Warehouse to run successfully.

Private Cluster control removed from UI

Enable Private Cluster has been removed from the environment activation dialog. Use CDP CLI for advanced configurations.

Query results cache support

Unified Analytics now supports the caching of Hive/Impala query results. Caching results of repetitive queries can reduce the load.

Cloudera Data Warehouse Public Cloud Runtime 2023.0.15.0-243 changes:

Histogram statistics

In this release, when you generate column statistics in a Hive Virtual Warehouse in Unified Analytics mode, histogram statistics are used to generate efficient query plans. This feature can improve performance.

Hive to Iceberg table migration from Impala

In this release, you can use Impala, as well as Hive, to migrate a Hive table to Iceberg tables. You use the ALTER TABLE statement. Syntax is described in Migrate Hive table to Iceberg feature and a step-by-step procedure is covered in Migrating a Hive table to Iceberg.

Iceberg position delete feature support

In this release, Impala, in addition to Hive, can delete Iceberg V2 tables using position delete files, a format defined by the Iceberg Spec. A position delete query evaluates rows from one table against a WHERE clause, and delete all the rows that match WHERE conditions.

Impala support for complex types

Complex types are now supported in the SELECT list. Although collections and structs were previously supported, nesting and mixing of complex types was not. For more information, including limitations, see “Allowing embedding complex types into other complex types” in Complex types.

Support ORDER BY for collections of fixed length types in SELECT list

This release supports collections of fixed length types in the sorting tuple. However, you cannot sort by these collection columns, but they can be in the SELECT list along with other column(s) by which you sort.

Support collections of fixed length types as non-passthrough children of unions

This release adds support for collections of fixed length types as non-passthrough children of unions. Plain UNIONs are not supported yet for any collections, but UNION ALL operations are supported.

Example:

select id, int_array from complextypestbl
union all select cast(id as tinyint), int_array from complextypestbl

Allow implicit casts between numeric and string types when inserting into table

The current implementation requires explicit casts for numeric and string-based literals. However, this release relaxes the implicit casting rules for these cases. This is controlled through a query option allow_unsafe_casts and turned off by default. This query option allows implicit casting between some numeric types and string types.

Improved memory estimation for aggregates and pre aggregates

This release introduces new query options to improve memory estimation for aggregation nodes. Also introduces better cardinality estimates to help in capping memory limits early on during query planning.

Improved CPU costing

This release introduces some changes to the query planner to improve parallel sizing and resource estimation. These changes are done for workload-aware autoscaling and will be available as query options. These additional query options are added for tuning purposes. This new functionality will allow more customers to enable multi-threaded queries globally for improved performance.

Added processing cost to control scan parallelism

Before this release, when a user executed a query with COMPUTE_PROCESSING_COST=1, Impala relied on the MT_DOP option to decide the degree of parallelism of the scan fragment. This release introduces the scan node’s processing cost as another factor to consider raising scan parallelism beyond MT_DOP.

Scan node cost now includes the number of effective scan ranges. Each scan range is given a weight of (0.5% * min_processing_per_thread), which roughly means that one scan node instance can handle at most 200 scan ranges. This release also introduces a new query option MAX_FRAGMENT_INSTANCES_PER_NODE to cap the maximum number of fragment instances per node. This newly introduced query option works in conjunction with PROCESSING_COST_MIN_THREADS.

Impala WebUI improvements

This release enhanced the Impala daemon’s Web UI to display the following additional details:

  • Backends start time and version: In a large cluster, you can now use the Impala daemon’s Web UI to view the start time and version for all the backends.

  • Query performance characteristics: For a detailed report on how a query was executed and to understand the detailed performance characteristics of a query, you can use the built-in web server’s UI and look at the timeline shown in the Gantt chart. This chart is an alternative to the PROFILE command and is a graphical display in the WebUI that renders timing information and dependencies.

  • Export query plan and timeline: To understand the detailed performance characteristics for a query, you issue the PROFILE command in impala-shell immediately after executing a query. As an alternative to the profile download page, this release added support for exporting the graphical query plan and also for downloading the timeline in SVG/HTML format. Once you export the query plan or the timeline, memory resources consumed from the ObjectURLs get cleared.

  • Historical/in-flight query performance: You can now use the query list and query details page to analyze historical or in-flight query performance by viewing the memory consumed, the amount of data read, and other information about the query.

  • Aggregate CPU node utilization: You can now see the recent aggregate CPU node utilization samples for the different nodes.

  • Scaling of timeticks and fragment timing diagram for better accessibility: You can now use the query timeline display to scroll horizontally through the fragment timing diagram and utilization chart. You can also zoom by horizontally scaling through mouse wheel events in addition to increasing/decreasing the precision of timetick values.

Skip reloading file metadata for some ALTER_TABLE events

Before this release, EventProcessor ignored trivial ALTER_TABLE events that only modify tblproperties like “transient_lastDdlTime,” “totalSize,” “numFilesErasureCoded,” and “numFiles”. For other non-rename ALTER_TABLE events, it triggered a full refresh on the table, which becomes expensive for tables with a large number of partitions or files.

From this release, to be more cost-efficient, the event processor skips reloading file metadata for some ALTER_TABLE events.

The following list contains the events that skip reloading file metadata:

  • changing table comment
  • adding/dropping columns
  • changing column definition (name/type/comment)
  • changing ownership
  • setting customized tblproperties

For interoperability purposes, this release introduces a new start-up flag file_metadata_reload_properties to list the table properties that need the file metadata reloaded when the properties are changed.

Note: To disable this optimization (in case of any unexpected issues), set file_metadata_reload_properties to an empty string.

JWT auth for Impala

Impala clients, such as Impala shell, can now authenticate to Impala using a JWT instead of a username/password. To connect to Impala using JWT authentication, specify JWT command-line options to the impala-shell command interpreter and enter the password when prompted.

Native High Availability (HA) for Impala Catalog Service

The High Availability (HA) mode of catalog service in CDW reduces the outage duration of the Impala cluster when the primary catalog service fails. Before this release, catalog HA was supported using the K8s leader election mechanism, and now it is natively supported in Impala.

Codegen for STRUCT type

Codegen uses query-specific information to generate specialized machine code for each query. As an Impala user, when you run a standard query, the query optimizer generates an optimized query plan and passes it to the executor for processing. With the codegen capability for STRUCT type in the SELECT list, the query specific information is converted to machine code for faster execution.

Before this release, having structs in the select list was only supported with codegen turned off. This release lifts this restriction, adding full codegen support for structs in the select list.

Example:

select small_struct from complextypes_structs

DataFlow

This release (2.5.0-h2-b1) of Cloudera DataFlow (CDF) on CDP Public Cloud resolves an issue with NiFi cluster autoscaling behavior that can cause unnecessary flow disruptions. The release includes no new features.

Management Console

This release of the Management Console service introduces the following changes:

Data Lake database upgrade and default major version change

Newly deployed Data Lake clusters on AWS or GCP with Cloudera Runtime 7.2.7 or above are now configured to use a PostgreSQL version 14 database by default.

Newly deployed Data Lake clusters on Azure with Cloudera Runtime 7.2.7 or above will continue to use a PostgreSQL version 11 database by default.

The database for Data Lake clusters on AWS and GCP can now be upgraded to PostgreSQL version 14. If your AWS or GCP cluster requires an upgrade to PostgreSQL 14, you will receive a notification in the Management Console UI.

Cloudera strongly recommends that you perform the database upgrade to PostgreSQL 14 for AWS and GCP clusters on all clusters running PostgreSQL version 11 by November 9, 2023.

A database upgrade to PostgreSQL 14 for Azure Data Lakes will be available in the future. Any Data Lake clusters on Azure that require a database upgrade will be upgraded from PostgreSQL 10 to PostgreSQL 11.

For more information, see Upgrading Data Lake/Data Hub database.

Configure proxy for TLS interception and Deep Packet Inspection

After setting up a web proxy server in CDP, you can further configure it to perform TLS interception and Deep Packet Inspection (DPI). For instructions, see Setting up a web proxy for TLS inspection.

Setting a default identity provider in CDP

You can optionally set a default identity provider (IdP) in CDP. If you do so, CDP will use the default IdP instead of the oldest IdP. For instructions, see Setting a default identity provider in CDP.

Operational Database

Cloudera Operational Database (COD) 1.34 version supports different JDK versions during COD creation and deploying COD on GCS.

COD supports creating an operational database using JDK8 and JDK11

COD now added a new CLI option, –java-version which can be used to configure a major Java version on your COD cluster. The new CLI option can be used along with the create-database command to specify the Java version. The supported Java versions are JDK8 and JDK11. In case the parameter is not specified, JDK8 is used. Following is a sample command.

cdp opdb create-database --environment-name <environment_name> --database-name <database_name> --java-version <value>

cdp opdb create-database --environment-name cod7215 --database-name testenv --java-version 11

For more information, see CDP CLI beta.

COD is available as a Technical Preview feature on Google Cloud Storage (GCS)

COD on Google Cloud Platform (GCP) can now be deployed by using Google Cloud Storage (GCS) easily, similar to what is available for Amazon Web Services (AWS) S3 storage and Microsoft Azure blob storage. The use of GCS for such a setup requires the COD_ON_GCS entitlement.

COD also now supports a large ephemeral block cache while deploying on GCP. The use of ephemeral storage along with any cloud storage still requires the OPDB_USE_EPHEMERAL_STORAGE entitlement.

COD has removed the COD_ON_GCP entitlement

COD_ON_GCP entitlement has been removed from COD because it is not needed anymore. From this version onwards, customers can create COD clusters on Google Cloud Platform (GCP) without it.

Cloudera Operational Database (COD) 1.33 version provides enhancements to the CDP CLI as well as on COD UI.

COD drops support of the Cloudera runtime versions CDP Runtime 7.2.8 and CDP Runtime 7.2.9

COD has stopped supporting the Cloudera runtime versions CDP Runtime 7.2.8 and CDP Runtime 7.2.9 because they have reached the end of life.

COD supports faster rolling restarts on COD clusters

The default value of Cloudera Manager > HBase > Configuration > Region Mover Threads is changed to 30. This speeds up the rolling restart functionality for HBase.

For more information see Rolling Restart.

COD supports rolling runtime upgrades of a COD cluster

COD now supports upgrading the Cloudera Runtime version of the database using the rolling restart mode. This ensures continuous service availability during an upgrade operation. A new CLI parameter --rolling-upgrade | --no-rolling-upgrade is added to the upgrade-database command. Following is a sample command:

cdp opdb upgrade-database --environment <environment-name> --database <database-name> --runtime <runtime-version> [--rolling-upgrade | --no-rolling-upgrade]

For more information, see Performing a rolling Cloudera Runtime upgrade.

Enhancements to the --scale-type CDP CLI option in the create-database command

In CDP CLI, the --scale-type option now supports all three options --scale-type (string) <MICRO, LIGHT, HEAVY> for both the --master-node-type and --gateway-node-type.

  • --scale-type LIGHT (--master-node-type LITE, --gateway-node-type LITE)
  • --scale-type HEAVY (--master-node-type HEAVY, --gateway-node-type HEAVY)

If the --scale-type option is not defined, by default --scale-type LIGHT is considered for both the --master-node-type and --gateway-node-type. However, you can overwrite the --scale-type for a --gateway-node-type using the --gateway-node-type <value> option.

For more information, see CDP CLI Beta.

Enabling a consolidated view of COD metrics using Grafana dashboards

In CDP CLI, the create-database command now provides a new option --enable-grafana which allows you to enable the Grafana URL under the GRAFANA DASHBOARD option inside your COD database. When you click on the Grafana URL, it takes you to the Grafana dashboard which provides a consolidated view of the COD metrics.

Following is an example of the create-database command:

cdp opdb create-database --environment <environment_name> --database <database_name> --enable-grafana

For more information, see Monitoring metrics in COD with Grafana.

Replication Manager

This release of the Replication Manager service introduces the following new features:

Replicate HBase data in all the tables in a database

You can replicate HBase data (existing tables and future tables in a database) using the Select Source > Replicate Databases option during the HBase replication policy creation process. You can use this option only if the minimum target Cloudera Manager version is 7.11.0 and the minimum source cluster versions are CDH 5.16.2 (after you upgrade the source cluster Cloudera Manager), CDP Private Cloud Base 7.1.6, or COD 7.2.17.

For more information, see HBase replication policy and Creating HBase replication policies.

Provide a unique replication policy name

You must provide a unique name to the replication policy during the replication policy creation process if the Cloudera Manager API version is 51 or higher.

For other Cloudera Manager API versions, you can continue to use an existing replication policy with an empty name. However, if you edit the replication policy and provide a name for the replication policy, ensure that the name conforms to the validation rules.

For more information, see Managing HDFS replication policy, Managing Hive replication policy, or Managing HBase replication policy.

Load replication policies and their job history on the Replication Policies page

You can choose one of the following options to Load policies faster by delaying to load their job history:

  • Delay loading job history when history is too long
  • Always load job history
  • Never load job history

By default, the replication policies are loaded only partially on the Replication Policies page, therefore the page might display incomplete statistics about a job status and replication policies with failed jobs might take a longer time to load. You can change the behavior depending on your requirements.

For more information, see Replication Policies page.