CDP Public Cloud: November 2022 Release Summary

Data Engineering

This release (1.18) of the Cloudera Data Engineering Service on CDP Public Cloud introduces new features and improvements:

Updated CDE user interface

The user interface for CDE 1.17 and above has been updated with easy access to commonly used pages, a new Home page, and a Virtual Cluster drop-down menu that allows you to view relevant content related to each Virtual Cluster that you select. Only users who have a CDE Service on 1.18 and create new Virtual Clusters on 1.18 will see the changes. Users on older versions will continue have access to the old UI. The following user interface changes were made:

Left-hand menu displays the following:

  • Home - New landing page that displays Virtual Clusters and convenient quick-access links.
  • Jobs - Displays jobs for the Virtual Cluster that you select from the drop-down menu in the upper left-hand corner.
  • Job Runs - Displays the run history of all jobs within a selected Virtual Cluster.
  • Resources - Displays resources created within a selected Virtual Cluster.
  • Administration - Displays services and Virtual Clusters that can be customized (previously known as the Overview page.

If you’re using a browser in incognito mode, you’ll need to allow all cookies in your browser settings so that you can view the following CDE pages: Pipelines, Spark, and Airflow.

Airflow performance

Airflow scaling improvements include support for 1500 DAGs on AWS and about 300 to 500 DAGs when deploying on Azure. For more information, see Apache Airflow scaling and tuning considerations.

Support for the eu-1 (Germany) and ap-1 (Australia) regional Control Plane

The eu-1 (Germany) and ap-1 (Australia) regional Control Plane now supports CDE. The CDP administrator must create a new environment within the eu-1 and ap-1 regional Control Plane before using CDE.

Support for workload secrets using API

CDE now provides a secure way to create and store workload secrets for Cloudera Data Engineering (CDE) Spark Jobs. This is a more secure alternative to storing credentials in plain text embedded in your application or job configuration. For more information, see Managing workload secrets with Cloudera Data Engineering Spark jobs using API.

Java Virtual Machine Debugger (Tech preview)

Attaching a remote debugger (Java virtual machine (JVM) debugger) to a CDE Spark job is now supported as a technical preview feature. For more information, see Using Java virtual (JVM) debugger with Apache Spark jobs in Cloudera Data Engineering (Preview)

Hive Warehouse Connector tables

Hive Warehouse Connector (HWC) tables are now supported in Spark 3 of CDE.

Backup & Restore in object storage

Remote backup storage (object store) is now supported. Previously, only backup to and restore from local storage was supported. This is supported through the CLI and API only. For more information, see Backing up Cloudera Data Engineering jobs and Restoring Cloudera Data Engineering jobs from backup.

Limitations for raw Scala code in CDE

Limitations have been added to the raw Scala code. For limitation details, see Running raw Scala code in Cloudera Data Engineering.

Cloudera Data Warehouse

This release of the Cloudera Data Warehouse (CDW) service on CDP Public Cloud introduces these changes:

AWS EKS Kubernetes version upgrade

The CDW application uses Kubernetes (K8S) clusters to deploy and manage Hive and Impala in the cloud. Kubernetes versions are updated every 3 months on average. When the version is updated, minor versions are deprecated.

To avoid compatibility issues between CDW and AWS resources, you must upgrade the version of Kubernetes that supports your existing CDW clusters to version 1.21.

AWS environments you activate using this release of Cloudera Data Warehouse, and later, will use version 1.22.

Important - Check to make sure your AWS environment has been migrated from Helm 2 to Helm 3 before you begin upgrading the Kubernetes version.

If a MIGRATE icon appears in the upper right corner of the environment tile, your AWS environment has not been migrated from Helm 2 to Helm 3. Update the Helm package manager for your environment before you attempt to upgrade its Kubernetes version.

For information about upgrading the AWS EKS Kubernetes version, see Upgrade CDW for EKS upgrade.

Support for Amazon Elastic Kubernetes Service 1.22 cluster

This release 1.5.1-b110 (released November 22, 2022) automatically uses and provisions Amazon Elastic Kubernetes Service (EKS) 1.22 when you activate an environment in CDW. In this release, upgrading a cluster to EKS 1.22 is not supported.

Forward logs to your observability system

In this release, you can forward logs from environments activated in Cloudera Data Warehouse (CDW) to observability and monitoring systems such as Datadog, New Relic, or Splunk. You configure a CDW environment for these systems using the UI to provide a fluentd configuration.

Rebuild a Database Catalog or Virtual Warehouse

Rebuilding a Database Catalog and rebuilding a Virtual Warehouse, you clean up resources and redeploy your Database Catalog and Virtual Warehouse using your existing DBC or VW image, or the latest available image. You might consider rebuilding to get a feature in a later DBC or VW image, to perform housekeeping, or to troubleshoot a problem.

AWS compute instance type selection is available using the CDP CLI

This release of CDW supports selection of a single AWS compute instance type and additional fallback instance types. From the CDP CLI, you can select the following options:

  • computeInstanceTypes An array of AWS compute instance types from which you can select a single instance type that the environment of your Virtual Warehouse will use.

  • additionalInstanceTypes An array of additional AWS instance types for fallback in order of priority to fail over in the event the compute instance type is unavailable. You can specify one of the additional instance types as the default if you do not configure a compute instance type.

Use list-clusters and describe-cluster to see cluster settings. CLI documentation link.

Cloudera Data Warehouse audit events

You can use the CDP CLI to retrieve audit events that occur in CDW. The CDW audit events are listed here.

For more information about experimental CLI commands for Cloudera Data Warehouse, go to Version Mapping. Click the CDP CLI Reference link for your CDW version. Scroll to Available Commands, and click dw.

Data Analytics Studio (DAS) is deprecated

DAS has been deprecated and is disabled by default. It will be removed in future releases. Cloudera encourages you to use Hue to run Hive LLAP workloads. If you need to use DAS, then you can enable it by setting the das.event-pipeline.enabled property to true in the Database Catalog configurations. For more information, see Enabling Data Analytics Studio in CDW Public Cloud.

Hue Query Processor scan frequency decreased to 5 minutes

The Hue Query Processor scans the event processor pipeline to retrieve the Hive query history and query details and display them on the Job Browser page. The scan frequency has been decreased from 2 milliseconds to 5 minutes to optimize resource utilization. As a result, you may notice a delay in viewing the query history and query details on the Job Browser page for queries that finish executing in less than 5 minutes. However, you can still view the query history from the Query history tab below the query editor.

Cloudera Machine Learning

Release notes and fixed issues for version 2.0.34 (released November 29, 2022):

New Features and improvements

  • Suspend/Resume Workspace - Administrators can optimize cloud costs by suspending CML Workspaces for weekends and resume operation for business hours. See Suspend and resume ML workspaces for details.
  • PBJ Runtimes - PBJ Workbench Runtimes are now GA, providing a more consistent experience with the Jupyter ecosystem. With the elimination of proprietary code you can now build a new Runtime from scratch, on your custom base image with your selected kernel, with no need to build on top of a Cloudera image anymore.
  • Experiment tracking - A new Experiments feature built on MLFlow is now available. You can now track your experiments in CML sessions with the preinstalled mlflow library and visualize and compare them in the CML application. This feature is enabled by default.
  • Customizable Scratch Space - You can now configure the amount of ephemeral storage space (also known as scratch space) a CML session, job, application or model can use. This feature helps in better scheduling of CML pods, and provides a safety valve to ensure runaway computations do not consume all available scratch space on the node. By default, each user pod in CML is allowed to use up to 10 GB of scratch space.
  • Iceberg support - Iceberg v2 is supported via Spark Runtime Addon, based on the CDE 1.17-h1 Runtime Addon version.
  • Jobs UI - The Jobs List page and Job Details page now display the job ID and Created At time.
  • Projects UI - When you select a runtime kernel on the project creation page, only the latest standard version of the selected kernel is added to the project.
  • Data tab - The Cloudera Data Visualization application that provides the Data tab experience got upgraded to v7.0.2. You can review the changes here.
  • Applications - Custom polling endpoints for Applications can now be specified in the UI.
  • Register new runtimes via APIv2 - Site administrators can register new ML Runtimes using the APIv2.
  • Runtime addon management - Site administrators can now also disable or deprecate specific Runtime Addons.
  • Environmental variables - The new project environmental variable PROJECT_OWNER holds the username of the project owner.

Fixed issues

  • Python packages (DSE-21313) - Fixed an issue where Python packages installed via pip install may not be imported correctly until a new session is started.
  • Jobs (DSE-21771) - Fixed an issue where Jobs scheduled with PBJ Runtimes may terminate with a Success state even when there were errors reported in job runs.
  • PBJ Runtimes (DSE-19971) - Fixed an issue where comment blocks may not be rendered correctly in sessions or jobs launched with PBJ Runtimes.
  • Models (DSE-22527) - Fixed an issue where the Model Monitoring Chart was displaying data from other projects or deployments.

Operational Database

Cloudera Operational Database (COD) 1.25 version (November 10, 2022) supports creating and updating an operational database using a custom image.

COD supports custom images for deploying COD clusters

COD now allows you to create or update a database using a custom image. Custom images can be used for various purposes, such as compliance or security requirements. An image catalog is used to hold one or more custom images. You can inherit pre-installed packages or software from the custom image while creating or updating an operational database.

You can also switch an image catalog of an existing operational database. For more information, see Managing custom images in COD.

Management Console

This release of the Management Console service introduces the following changes:

Data Lake scaling

You can now scale up a light duty Data Lake to the medium duty form factor, which has greater resiliency than light duty and can service a larger number of clients. You can trigger the scale-up in the CDP UI or through the CDP CLI. For more information, see Data Lake scaling.

Data Lake backup and restore for GCP

Backing up and restoring a GCP Data Lake is now supported. For more information, see Backup and restore for the Data Lake.