CDP Public Cloud: February 2022 Release Summary

Data Catalog

Data Catalog introduces the following addition:

  • Deleting profiler clusters with multiple scenarios

Data Engineering

This release (1.14) of the Cloudera Data Engineering (CDE) service on CDP Public Cloud introduces the new features and improvements.

Improved handling of job resources to reduce EFS utilization

Recursive copying of frequently used and large file resources can result in very high I/O throughput and can exhaust cloud storage burst credits, leading to poor performance. To avoid excessive file copying, CDE now uses hard linking in AWS by default.
Additional enhancements are planned to improve efficiency in upcoming releases.

Apache Iceberg support (Preview)

Apache Iceberg tables are now supported with Spark 3 virtual clusters on AWS. Use tables at petabyte scale without impacting query planning, while benefiting from efficient metadata management, snapshotting, and time-travel.
Run multi-analytic workloads by accessing those same tables in Cloudera Data Warehouse (CDW) with Hive and Impala for BI and SQL analytics (Expected in an upcoming CDW release).

Remote Shuffle Service (Preview)

You can now store Spark shuffle data on remote servers. This improves resilience in case of executor loss.

Unified diagnostic bundle

A single click now generates one unified bundle containing both service logs and summary status. The bundles are stored securely in the object storage of the environment. A historical list of previously generated bundles are available for access.

Guardrails to prevent submitting jobs that do not fit resource capacity

  • CDE now automatically prevents execution of jobs that do not fit on the available resources.
  • CDE takes into account Kubernetes and system reserved resources, daemonset utilized resources, and Spark overhead factors.
  • The API returns an error with run failed to start: requested ``[TYPE AND AMOUNT OF RESOURCE] is more than [THE MAXIMUM AMOUNT OF AVAILABLE RESOURCES OF THAT TYPE] allocatable per cluster node`. You can either reduce the Spark executor and driver CPU and/or memory requirements, or deploy on a larger cluster.

Notification email configuration can now be verified

When configuring the optional email alerts feature (Preview) during virtual cluster creation, you can now verify the SMTP settings before creating the virtual cluster.

Streamlined resource creation and re-use during job creation

You can now create a resource on the fly when creating a job. Alternatively, you can select from a list of existing resources, if any, to upload your application or DAG file. This promotes re-usability of project artifacts across jobs.

Support for Kubernetes v1.21

CDE now supports Kubernetes 1.21.

Data Hub

Customer managed encryption keys for Azure disks and databases

By default, local Data Hub disks attached to Azure VMs and the PostgreSQL server instance used by the Data Lake and Data Hubs are encrypted with server-side encryption (SSE) using Platform Managed Keys (PMK), but during environment registration you can optionally configure SSE with Customer Managed Keys (CMK). For more information, refer to Adding a customer managed encryption key for Azure. For more information, refer to Adding a customer managed encryption key for Azure.

New supported AWS and Azure instance types

The following new AWS and Azure GPU-based instances are supported in Data Hub:

AWS:

  • c5.12xlarge
  • c5a.12xlarge

Azure:

  • NC6sv3
  • NC24sv3

DataFlow

Support for Kubernetes v1.21

Version 1.21 is now the default Kubernetes version for CDF. When CDF is enabled, it creates AKS/EKS clusters based on version 1.21.

In previous versions of CDF, deployments and enabled DataFlow services showed disk capacity and disk usage metrics as part of their system metrics. You were also able to define KPIs and alerts on these metrics. Due to issues with the underlying metrics collection framework, the following metrics have been removed starting with CDF 1.1.0-h2-b1:

  • Disk Capacity (DF Service Metric)
  • Disk Capacity (Deployment System Metric)
  • Disk Usage (Deployment System Metric)

Documentation updates

Removed disk related metrics documentation.

Management Console

Runtime 7.2.14

Runtime 7.2.14 is now available and can be used for registering an environment with a 7.2.14 Data Lake and creating Data Hub clusters. See Cloudera Runtime.

Customer managed encryption keys on Azure

By default, local Data Lake, FreeIPA, and Data Hub disks attached to Azure VMs and the PostgreSQL server instance used by the Data Lake and Data Hubs are encrypted with server-side encryption (SSE) using Platform Managed Keys (PMK), but you can optionally configure SSE with Customer Managed Keys (CMK). For more information, refer to Adding a customer managed encryption key for Azure.

Support for CDW in ap-1 regional Control Plane

Cloudera Data Warehouse (CDW) is now supported in the ap-1 (Australia) regional Control Plane. To use CDW in this regional Control Planes, your CDP administrator must create a new environment.

Machine Learning

ML Discovery and Exploration

Data Connections and Snippets are now Generally Available. CML workspaces now automatically discover data connections within the CDP environment and offer connection snippets for users. For more information, see ML Discovery and Exploration.

Filter ML Runtimes

You can now filter the list of ML Runtimes that can be used in a given project.

Model technical metrics visualization (Preview).

Model technical metrics visualization is now available in CML as preview.

API v2

You can now specify an input data example when you create a model build.

Backup and Restore (Preview)

CLI-based Backup and Restore of CML workspaces is now available as a preview feature for AWS only.

Support for Kubernetes v1.21 on Azure

Kubernetes 1.21 is now supported on Azure.

Replication Manager

Creating HDFS and HBase snapshot policies (Preview)

A snapshot is a set of metadata information, a point-in-time backup of HDFS files and HBase tables. You can create snapshot policies for HDFS directories and HBase tables in registered classic clusters and SDX Data Lake clusters to take snapshots at regular intervals. Before you create an HDFS snapshot policy for an HDFS directory, you must enable snapshots for the directory in Cloudera Manager.

After a snapshot policy takes a snapshot of the HDFS directory or HBase table, you can perform the following tasks:

  • Restore the snapshot to the original directory or table using the Restore Snapshot option.
  • Restore a directory or table to a different directory or table using the Restore Snapshot as option.

For more information about snapshot policies, see Snapshot policies in Replication Manager.

Note: Creating an HDFS or HBase snapshot policy is a technical preview feature. Access to preview features is provided upon request to customers for trial and evaluation. The components are provided ‘as is’ without warranty or support. Further, Cloudera assumes no liability for the use of preview components, which should be used by customers at their own risk. Contact your Cloudera account team to have this preview feature enabled in your CDP account.

Override default storage location for replicated Hive external tables in the target cluster

Administrators can override the default storage location for replicated Hive external tables in the target cluster when they create a Hive replication policy.

Before you add another path to override the default storage location, ensure that the following steps are complete in the Ranger UI:

  1. Alter the ranger policy Default: Hive warehouse locations in cm_s3 service to allow the Hive service to access the updated locations of S3 bucket path.
  2. Manually update the Ranger and Sentry permissions.

Cannot suspend HBase replication policies

If you create an HBase replication policy, you can no longer suspend the policy. However, you can resume any existing suspended HBase replication policy.

Generate and download diagnostic bundles for replication policies

You can generate and download a diagnostic bundle for an HDFS or Hive replication policy. You can use the bundle to troubleshoot failed replication jobs or to view replication-specific diagnostic data for an HDFS or Hive replication policy.

Operational Database

COD through Phoenix-Spark connector supports Spark transactional tables using Apache OMID

COD supports Apache OMID transactional framework that allows Big Data applications to execute ACID transactions on top of Phoenix tables.

The transaction support in COD enables you to perform complex distributed transactions and run atomic database operations, meaning your database operations must either be completed or terminated. A transaction ensures adhering to the ACID properties.

COD is now bundled with the HBase version 2.4.6

COD is now bundled and shipped along with the HBase version 2.4.6 when the CDP Runtime version is 7.2.14.

For a smooth and better functionality, COD is now bundled with the HBase version 2.4.6. You need to upgrade the HBase client version for seamless connectivity.

COD supports custom table coprocessors

COD supports custom table coprocessors, which you can implement and extend from HBase coprocessors’ interfaces.

You can add table coprocessors so that HBase can run custom code on the server side against the stored data and filter local minimum or maximum value during ingestion without scanning the entire table. You can use built-in table coprocessors from the upstream HBase releases. For more information, refer to Working with custom table coprocessors.

COD supports RAZ integration from the Runtime version 7.2.11.0

COD supports RAZ integration from the Runtime version 7.2.11.0. You can grant fine-grained access to directories.

The Ranger Authorization Service (RAZ) is a fine grained authorization service for cloud storage. As a regular individual user or as an HBase user, you can limit the authorization levels in the cloud storage to a directory level. For more information, refer to COD integration with RAZ.

Storefile Tracking (SFT) is available as an optional feature delivered through the Cloudera Operational Database (COD) service

COD now supports the “Storefile Tracking” (SFT) as an optional feature in Runtime 7.2.14.0.

Storefile Tracking (SFT) changes how HBase manages its files to avoid operations which are known to be suboptimal when using object stores. COD enables this feature for COD databases deployed on AWS which use S3 for HBase storage which will address performance issues known around flushes, compactions, and other HBase operations. For more information, refer to HBase Storefile Tracking.

COD allows to disable the Kerberos authentication temporarily for HBase clients

COD allows to disable the Kerberos authentication temporarily for HBase clients that run on Cloudera legacy products.

If your client applications are running on Cloudera legacy products, they usually do not have Kerberos authentication enabled. When you try to connect to any COD instance, the connection fails because the COD instances have Kerberos enabled, by default. Now, you can disable Kerberos authentication in your COD instances so that HBase or Phoenix clients can connect seamlessly. For more information, refer to Disabling Kerberos authentication for HBase clients.