CDP Public Cloud: May 2023 Release Summary

Data Engineering

This release (1.19) of the Cloudera Data Engineering (CDE) service on CDP Public Cloud introduces a new improvement that is described in this topic.

[Preview] Cloudera Data Engineering Sessions

CDE 1.19 introduces Sessions as an interactive short-lived development environment for running Spark commands to help you iterate upon and build your Spark workloads.

  • The Sessions feature is available via CDE user interface and the CDE CLI.

  • Python, Scala, and Java are supported session types.

For more information, see Creating Sessions in Cloudera Data Engineering(Preview) and Managing Sessions in Cloudera Data Engineering using the CLI.

[Preview] Lock and unlock

CDE supports locking of a CDE Service which freezes the configuration of a Service and its corresponding Virtual Clusters. When a Service is locked, you are unable to edit, add, or delete a Service and its Virtual Clusters. This will assist during planned upgrades to ensure changes are not made to the Service. For more information, see Locking and unlocking a CDE Service.

[Preview] Airflow file based resource using the CDE CLI

CDE supports Airflow file based resources using the CDE CLI. By creating a pipeline in CDE using the CLI, you can add custom files that are available for tasks. For more information, see Creating an Airflow pipeline with custom files using CDE CLI (Preview).

Support for new AWS regions

CDE now supports the Hong Kong and Jakarta AWS regions. See updated Supported AWS regions.

Support for multiple Spark 3 runtimes

CDE supports Spark 3.3. Note that Spark 3.3 must use Data Lake 7.2.16. For more information, see Cloudera Data Engineering and Data Lake compatibility.

Creating and using multiple profiles using CDE CLI

You can now add a collection of CDE CLI configurations grouped together as profiles, to the config.yaml file. You can use these profiles while running commands. You can set the configurations either at a profile level or at a global level. For more information, see Creating and using multiple profiles using CDE CLI.

Using spark-submit drop-in migration tool for migrating Spark workloads to CDE

CDE provides a command line tool cde-env to help migrate your CDP Spark workloads running on CDP Private Cloud Base (“spark-on-YARN”) to CDE without having to completely rewrite your existing spark-submit command-lines. For more information, see Using spark-submit drop-in migration tool for migrating Spark workloads to CDE.

New Virtual Cluster types

CDE now provides a choice between two tiers during Virtual Cluster creation. Administrators will see the following two options:

  • Core (Tier 1) - Batch-based transformation and engineering options.
  • All Purpose (Tier 2) - Develop using interactive sessions and deploy both batch and streaming workloads.
    For more information, see Creating virtual clusters.

Default setting for external.table.purge property for migrating Hive tables

When migrating Iceberg tables in Spark 3, the external.table.purge property is now set to FALSE by default. For more information, see Importing and migrating Iceberg table in Spark 3.

New exit codes for the CDE CLI

CDE now provides exit codes for the CDE CLI. The exit codes help users better identify the error. For more information, see Cloudera Data Engineering CLI exit codes.

Support for Data Lake

CDE 1.19 supports Data Lake 7.2.14 through 7.2.16. For more information, see Cloudera Data Engineering and Data Lake compatibility.

Data Warehouse

This release of the Cloudera Data Warehouse (CDW) service on CDP Public Cloud introduces these changes.

Azure AKS 1.26 upgrade

The new kuberentes version 1.26 has breaking changes; thus, starting earlier workload (Impala, Hive, Hue) versions is not supported, and this option is now disabled. In this release 1.6.4-b161 (released May 30, 2023), when you activate an environment, CDW automatically provisions Azure Kubernetes Service (AKS) 1.26. To upgrade to AKS 1.26 from 1.6.3 or earlier, you must backup and restore CDW. To avoid compatibility issues between CDW and AKS, upgrade to version 1.26.

Using the Azure CLI or Azure portal to upgrade the AKS cluster is not supported. Doing so can cause the cluster to become unusable and can cause downtime. For more information about upgrading, see Upgrading an Azure Kubernetes Service cluster for CDW.

Azure AKS managed identity now mandatory

You must now specify a user-assigned, managed identity for the Azure Kubernetes Service (AKS) cluster when you activate the Azure environment from the CDP CLI in 1.6.4-b161 (released May 30, 2023). In 1.6.3 (released May 5, 2023), specifying the managed identity is also mandatory when you activate the Azure environment from the UI. A managed identity is required because AKS does not support certificate authentication for service principals. The cluster creation operation will fail if the userAssignedManagedIdentity is not set to your managed identity.

New Backup/Restore CDP CLI commands

The following CDP CLI 0.9.87, including Beta DW CLI commands are available and support backing up and restoring CDW:

Synchronized metadata across Impala Virtual Warehouses

Using Impala Virtual Warehouses that share a Database Catalog is easier in this CDW release. In past releases, after making changes to data and then refreshing tables or invalidating metadata from your Virtual Warehouse, only the catalog metadata and coordinator metadata for that particular Virtual Warehouse were affected. You had to rerun the commands from each Virtual Warehouse to synchronize metadata across multiple Virtual Warehouses that share a Database Catalog.

This release introduces an enhancement that raises events in the Hive metastore. Catalog daemons process events synchronously across all Virtual Warehouses that share metadata. Metadata is refreshed/invalidated in parallel across all your Virtual Warehouses. You need to run the commands only once in any one of your Virtual Warehouse.

To get this feature, you must upgrade your Virtual Warehouse to this release 1.6.3-b319 that has runtime 2023.0.14.0-155 (released May 5, 2023). You must also set the Impala catalogd flagfile key enable_reload_events to true. Newly created Virtual Warehouses use Impala version 2023.0.14.0-155, which has this feature enabled by default. For more information, including how to disable this feature, see Disabling metadata synchronization.

This enhancement does not synchronize metadata when you refresh tables or invalidate metadata from a Data Hub cluster.

AWS Kubernetes version 1.24 support

New CDW clusters using AWS environments you activate in this release 1.6.3-b319 (released May 5, 2023) of Cloudera Data Warehouse will use Amazon Kubernetes (EKS) version 1.24.

Selective support for AWS Kubernetes version 1.22 upgrade

If you activated your pre-existing AWS environment in CDW version 1.4.2-b118 (released Aug 4, 2022) or later, you can upgrade to EKS 1.22 in this release 1.6.3-b319 (released May 5, 2023). AWS environments activated in earlier versions are not supported for upgrade to EKS 1.22 due to incompatibility of some components with EKS 1.22. To determine the activation version of your pre-existing environment, in the Data Warehouse service, expand Environments. In Environments, search for and locate the environment that you want to view. Click Edit. In Environment Details, you see the CDW version.

Changes to the standard required IAM permissions and AWS restricted policy

In this release, as an AWS environment user, you must make the following IAM policy updates:

Synchronized metadata across Impala Virtual Warehouses

Using Impala Virtual Warehouses that share a Database Catalog is easier in this CDW release. In past releases, after making changes to data and then refreshing tables or invalidating metadata from your Virtual Warehouse, only the catalog metadata and coordinator metadata for that particular Virtual Warehouse were affected. You had to rerun the commands from each Virtual Warehouse to synchronize metadata across multiple Virtual Warehouses that share a Database Catalog.

This release introduces an enhancement that raises events in the Hive metastore. Catalog daemons process events synchronously across all Virtual Warehouses that share metadata. Metadata is refreshed/invalidated in parallel across all your Virtual Warehouses. You need to run the commands only once in any one of your Virtual Warehouse.

To get this feature, you must upgrade your Virtual Warehouse to this release 1.6.3-b319 that has runtime 2023.0.14.0-155 (released May 5, 2023). You must also set the Impala catalogd flagfile key enable_reload_events to true. Newly created Virtual Warehouses use Impala version 2023.0.14.0-155, which has this feature enabled by default. For more information, including how to disable this feature, see Disabling metadata synchronization.

This enhancement does not synchronize metadata when you refresh tables or invalidate metadata from a Data Hub cluster.

Upgrade and rebuild functionality

The generic Database Catalog and Virtual Warehouse version upgrade process has changed. When you click the Upgrade button to upgrade your Virtual Warehouse or Database Catalog version, the upgrade that occurs includes the rebuild functionality, which cleans up and redeploys your workload. For more information, Rebuilding a Database Catalog and Rebuilding a Virtual Warehouse.

New requirements in Azure to provide managed identity

During the Azure environment activation and private AKS activation, you must provide the resource ID of a user-assigned managed identity to create the Azure Kubernetes cluster resource.

Ranger authorization (RAZ) in CDW

In this release, if your Data Lake enables Ranger Authorization (RAZ), CDW will be RAZ-enabled to provide authentication in Hue, mainly. If you have an AWS environment that predates the option to enable RAZ for Cloudera Data Warehouse, you might want to manually enable RAZ in CDW.

Accessing buckets in a RAZ environment

In an environment that enables Ranger Authorization (RAZ), you must configure permissions to access an S3 bucket. Accessing buckets in a RAZ environment describes how to configure the permissions depending on the AWS account that owns the bucket.

Configuring token-based authentication in CDW

Using a JSON web token (JWT), your Virtual Warehouse client user can sign on to your Virtual Warehouse for a period of time instead of entering single-sign on (SSO) credentials every time your user wants to run a query. You use Impyla to access an Impala Virtual Warehouse and a JDBC connection to access a Hive Virtual Warehouse. For more information, see Configuring token-based authentication.

Impala coordinator shutdown in Unified Analytics

Cloudera Data Warehouse has supported Impala coordinator shutdown for some time, but not in Unified Analytics until this release. When you create an Impala Virtual Warehouse, and enable Unified Analytics, you can configure Impala coordinator shutdown in Unified Analytics.

Anti join support in Unified Analytics

In this release, Unified Analytics enables an optimization for converting a join having a null filter to perform a SQL anti join.

Debugging enhancement: HiveServer OOM dump

In this release, when a hive-server pod out-of-memory (OOM) event occurs, a corresponding dump will be available in the HiveServer dumps directory:

<external-bucket>/clusters/<env-id>/<dbc-id>/warehouse/debug-artifacts/hive/<compute-id>/hive-dumps/<node-name>/heapdump/dump.hprof

For example, the full path to the dump file looks something like this:

sharecdwdev-bucket-7cv9-dwx-external/clusters/env-9d7cv9/warehouse-1675918732-q9m2/warehouse/debug-artifacts/hive/compute-1675918927-q8hs/hive-dumps/ip-192-168-223-136.us-west-2.compute.internal/heapdump/hiveserver2-oom-dump.hprof 
In the example above, ip-192-168-223-136.us-west-2.compute.internal is the name of the HiveServer node on which the pod was running.

Kubernetes dashboard (Preview)

A technical preview of the K8S dashboard is available in this release. You can view the state of resources, such as CPU and memory usage, see the status of pods, and download logs. The dashboard can provide insights into the performance and health of a CDW cluster.

Workload Aware Auto-Scaling (Preview)

Workload Aware Auto-Scaling (WAAS) is available as a technical preview. WAAS allocates Impala Virtual Warehouse resources based on the workload that is running. You choose an executor group set, analogous to a range of nodes instead of the fixed size of the previous auto-scaling implementation. Configuring WAAS requires an entitlement. Contact your account team if you are interested in previewing the feature.

Iceberg manifest caching

Apache Iceberg provides a mechanism to cache the contents of Iceberg manifest files in memory. The manifest caching feature helps to reduce repeated reads of small Iceberg manifest files by Impala Coordinators and Catalogd. You configure manifest caching in CDW. You can enable or disable caching and set a few other caching properties.

Iceberg Execute Rollback feature from Impala

In the event of a problem with your table, you can reset a table to a good state as long as the snapshot of the good table is available. You can roll back the table data based on a snapshot id or a timestamp. The Describe History feature output includes information that helps you use snapshot-related features.

Incremental rebuild of Iceberg materialized views

In this release, the materialized view is automatically updated under certain conditions. After inserting data into a source table of a materialized view, query rewriting is automatically enabled. An incremental rebuild, which updates only the changed part of the view, occurs. Incremental rebuilds tend to update the view much faster than full rebuilds.

Query hint for Impala table cardinality

This release supports a new TABLE_NUM_ROWS query hint to specify a table cardinality for cases where statistics are missing or invalid.

Improvement in compaction and Impala metadata refresh

In this release, handling of the COMMIT_COMPACTION_EVENT has been improved. During compaction, HMS raises events for ACID tables. Impala metadata is refreshed automatically.

Optimize Refresh/Invalidate event processing

Before this release, some metadata consistency issues led to query failures. These failures happen because the metadata updates from multiple coordinators cannot differentiate between self-generated events and those that are generated by a different coordinator. This issue is resolved in this release by adding a coordinator flag to each event, and when processing these events, the coordinator flag is checked to decide whether to ignore the event or consider it.

Support for custom hash schema for Kudu range tables

Impala now includes CREATE TABLE and ALTER TABLE syntax to allow a Kudu custom hash schema. HASH syntax within a partition is similar to the table-level syntax except that HASH clauses must follow the PARTITION clause, and commas are not allowed within a partition. You can use the SHOW HASH SCHEMA statement to view the hash schema information for each partition.

CREATE TABLE t1 (id int, c2 int, PRIMARY KEY(id, c2))
       PARTITION BY HASH(id) PARTITIONS 3 HASH(c2) PARTITIONS 4
       RANGE (c2)
       (
       PARTITION 0 <= VALUES < 10
       PARTITION 10 <= VALUES < 20
       HASH(id) PARTITIONS 2 HASH(c2) PARTITIONS 3
       PARTITION 20 <= VALUES < 30
       )
       STORED AS KUDU;
       ALTER TABLE t1 ADD RANGE PARTITION 30 <= VALUES < 40
       HASH(id) PARTITIONS 3 HASH(c2) PARTITIONS 4;   
       

New shell option hs2_fp_format

This new option sets the printing format specification for floating point values when using HS2 protocol. The default behavior makes the values handled by Python’s str() built-in method. Use ‘16G’ to match Beeswax protocol’s floating-point output format.

Support for complex types

A SELECT * statement did not expand to complex types to be compatible with earlier versions of Impala that did not support complex types in the result set. In the older versions of Impala, queries using SELECT * skip complex types by default and only expanded to primitive types even when the table contained complex-typed columns. This release adds a new query option EXPAND_COMPLEX_TYPES to include complex types in the SELECT * list.

Ability to view Impala query details in Hue

You can now view Impala query details, query plan, execution summary, and query metrics on the new Impala Queries tab on the Job Browser page using Hue in CDW, and use this information to tune and optimize your queries. You can also compare two queries and view all query details side-by-side. For more information, see Viewing Impala query details.

Ability to access S3 buckets from Hue with RAZ is GA

Granting fine-grained access to the per-user home directories in Amazon S3 and accessing them from the S3 File Browser in Hue using RAZ is GA. See Accessing S3 bucket from Hue in CDW with RAZ.

Ability to access ADLS Gen2 containers from Hue with RAZ is GA

Granting fine-grained access to the per-user home directories on ADLS Gen2 containers and accessing them from the ABFS File Browser in Hue using RAZ is GA. See Accessing ADLS Gen2 containers from Hue in CDW with RAZ.

The enable_queries_list configuration has been removed from Hue jobbrowser safety valve section

The enable_queries_list configuration in the CDW Hue safety valve was used to display or hide the Queries tab on the Job Browser page. This configuration has been removed. The Queries tab is displayed by default based on the type of Virtual Warehouse, whether it is Hive or Impala. You can override the query_store configuration and hide the Queries tab. For more information, see Hue configurations in Cloudera Data Warehouse.

DataFlow

This release (2.4.1-h1-b1) of Cloudera DataFlow (CDF) on CDP Public Cloud introduces various improvements and bug fixes for Flow Designer and DataFlow enablement. There are no new features.

Machine Learning

Version 2.0.38-H1 introduces the following new features and improvements:

  • Private Cluster - Private clusters can now be created on AWS. For more information, see Private cluster support.
  • Spark - INSERT events are now fired via HMS after a DML action from Spark.

Version 2.0.38-H2 introduces the following new features and improvements:

  • Embedding applications - CML applications can now be embedded in i-frames in another domain. See Embed a CML application in an external website.
  • Team collaborator - It is now possible to add a team account as a collaborator on a project. This feature is supported via UI as well as API. For more information, see Adding project collaborators.
  • Project owner - It is now possible for an administrator to change the owner of a project. This feature is supported via UI as well as API. For more information, see Modifying project settings.
  • Load Balancer - You can now select from the environment’s endpoint access gateway subnets in the Subnets for Load Balancer field.

Operational Database

Cloudera Operational Database (COD) 1.30 version supports scaling up the COD clusters vertically and also provides an UI option to create smaller COD clusters.

COD supports scaling up the clusters vertically

COD now allows you to vertically scale up the COD clusters from a Light Duty to a Medium Duty instance type. You can upgrade the instance type of a COD cluster that belongs to a Master or Gateway node. To learn more about the vertical scaling, see Scaling COD instances vertically.

COD UI supports creating a smaller cluster using a predefined Data Lake template

COD now allows you to create a smaller cluster with one Gateway node and one Worker node using a new scale type Micro Duty while creating an operational database through COD UI. The Micro database is a two-node cluster in which the Gateway node processes the activities of the Master and Leader nodes, thereby removing the need of these nodes. You can use a Micro database for testing and development purposes. For more information, see Creating a database using COD.

Cloudera Operational Database (COD) 1.31 version provides enhancements to the CDP CLI options.

Enhancements to the –scale-type CDP CLI option

In CDP CLI, the --scale-type option now supports all the three options --scale-type (string) <MICRO, LIGHT, HEAVY>. COD has added the support for additional parameters, LIGHT and HEAVY.

--scale-type LIGHT (master-node-type LITE) --scale-type HEAVY (master-node-type HEAVY)

Additionally, COD has removed the --master-node-type (string) <LITE,HEAVY> CDP CLI option because this option is not needed anymore. For more information, see CDP CLI Beta.