Release Summaries

CDP Public Cloud: February 2022 Release Summary

Data Catalog introduces the following addition:

  • Deleting profiler clusters with multiple scenarios

This release (1.14) of the Cloudera Data Engineering (CDE) service on CDP Public Cloud introduces the new features and improvements.

Recursive copying of frequently used and large file resources can result in very high I/O throughput and can exhaust cloud storage burst credits, leading to poor performance. To avoid excessive file copying, CDE now uses hard linking in AWS by default.
Additional enhancements are planned to improve efficiency in upcoming releases.

Apache Iceberg tables are now supported with Spark 3 virtual clusters on AWS. Use tables at petabyte scale without impacting query planning, while benefiting from efficient metadata management, snapshotting, and time-travel.
Run multi-analytic workloads by accessing those same tables in Cloudera Data Warehouse (CDW) with Hive and Impala for BI and SQL analytics (Expected in an upcoming CDW release).

You can now store Spark shuffle data on remote servers. This improves resilience in case of executor loss.

A single click now generates one unified bundle containing both service logs and summary status. The bundles are stored securely in the object storage of the environment. A historical list of previously generated bundles are available for access.

  • CDE now automatically prevents execution of jobs that do not fit on the available resources.
  • CDE takes into account Kubernetes and system reserved resources, daemonset utilized resources, and Spark overhead factors.
  • The API returns an error with run failed to start: requested ``[TYPE AND AMOUNT OF RESOURCE] is more than [THE MAXIMUM AMOUNT OF AVAILABLE RESOURCES OF THAT TYPE] allocatable per cluster node`. You can either reduce the Spark executor and driver CPU and/or memory requirements, or deploy on a larger cluster.

When configuring the optional email alerts feature (Preview) during virtual cluster creation, you can now verify the SMTP settings before creating the virtual cluster.

You can now create a resource on the fly when creating a job. Alternatively, you can select from a list of existing resources, if any, to upload your application or DAG file. This promotes re-usability of project artifacts across jobs.

CDE now supports Kubernetes 1.21.

By default, local Data Hub disks attached to Azure VMs and the PostgreSQL server instance used by the Data Lake and Data Hubs are encrypted with server-side encryption (SSE) using Platform Managed Keys (PMK), but during environment registration you can optionally configure SSE with Customer Managed Keys (CMK). For more information, refer to Adding a customer managed encryption key for Azure. For more information, refer to Adding a customer managed encryption key for Azure.

The following new AWS and Azure GPU-based instances are supported in Data Hub:

AWS:

  • c5.12xlarge
  • c5a.12xlarge

Azure:

  • NC6sv3
  • NC24sv3

Version 1.21 is now the default Kubernetes version for CDF. When CDF is enabled, it creates AKS/EKS clusters based on version 1.21.

In previous versions of CDF, deployments and enabled DataFlow services showed disk capacity and disk usage metrics as part of their system metrics. You were also able to define KPIs and alerts on these metrics. Due to issues with the underlying metrics collection framework, the following metrics have been removed starting with CDF 1.1.0-h2-b1:

  • Disk Capacity (DF Service Metric)
  • Disk Capacity (Deployment System Metric)
  • Disk Usage (Deployment System Metric)

Removed disk related metrics documentation.

Runtime 7.2.14 is now available and can be used for registering an environment with a 7.2.14 Data Lake and creating Data Hub clusters. See Cloudera Runtime.

By default, local Data Lake, FreeIPA, and Data Hub disks attached to Azure VMs and the PostgreSQL server instance used by the Data Lake and Data Hubs are encrypted with server-side encryption (SSE) using Platform Managed Keys (PMK), but you can optionally configure SSE with Customer Managed Keys (CMK). For more information, refer to Adding a customer managed encryption key for Azure.

Cloudera Data Warehouse (CDW) is now supported in the ap-1 (Australia) regional Control Plane. To use CDW in this regional Control Planes, your CDP administrator must create a new environment.

Data Connections and Snippets are now Generally Available. CML workspaces now automatically discover data connections within the CDP environment and offer connection snippets for users. For more information, see ML Discovery and Exploration.

You can now filter the list of ML Runtimes that can be used in a given project.

Model technical metrics visualization is now available in CML as preview.

You can now specify an input data example when you create a model build.

CLI-based Backup and Restore of CML workspaces is now available as a preview feature for AWS only.

Kubernetes 1.21 is now supported on Azure.

A snapshot is a set of metadata information, a point-in-time backup of HDFS files and HBase tables. You can create snapshot policies for HDFS directories and HBase tables in registered classic clusters and SDX Data Lake clusters to take snapshots at regular intervals. Before you create an HDFS snapshot policy for an HDFS directory, you must enable snapshots for the directory in Cloudera Manager.

After a snapshot policy takes a snapshot of the HDFS directory or HBase table, you can perform the following tasks:

  • Restore the snapshot to the original directory or table using the Restore Snapshot option.
  • Restore a directory or table to a different directory or table using the Restore Snapshot as option.

For more information about snapshot policies, see Snapshot policies in Replication Manager.

Note: Creating an HDFS or HBase snapshot policy is a technical preview feature. Access to preview features is provided upon request to customers for trial and evaluation. The components are provided ‘as is’ without warranty or support. Further, Cloudera assumes no liability for the use of preview components, which should be used by customers at their own risk. Contact your Cloudera account team to have this preview feature enabled in your CDP account.

Administrators can override the default storage location for replicated Hive external tables in the target cluster when they create a Hive replication policy.

Before you add another path to override the default storage location, ensure that the following steps are complete in the Ranger UI:

  1. Alter the ranger policy Default: Hive warehouse locations in cm_s3 service to allow the Hive service to access the updated locations of S3 bucket path.
  2. Manually update the Ranger and Sentry permissions.

If you create an HBase replication policy, you can no longer suspend the policy. However, you can resume any existing suspended HBase replication policy.

You can generate and download a diagnostic bundle for an HDFS or Hive replication policy. You can use the bundle to troubleshoot failed replication jobs or to view replication-specific diagnostic data for an HDFS or Hive replication policy.

COD supports Apache OMID transactional framework that allows Big Data applications to execute ACID transactions on top of Phoenix tables.

The transaction support in COD enables you to perform complex distributed transactions and run atomic database operations, meaning your database operations must either be completed or terminated. A transaction ensures adhering to the ACID properties.

COD is now bundled and shipped along with the HBase version 2.4.6 when the CDP Runtime version is 7.2.14.

For a smooth and better functionality, COD is now bundled with the HBase version 2.4.6. You need to upgrade the HBase client version for seamless connectivity.

COD supports custom table coprocessors, which you can implement and extend from HBase coprocessors’ interfaces.

You can add table coprocessors so that HBase can run custom code on the server side against the stored data and filter local minimum or maximum value during ingestion without scanning the entire table. You can use built-in table coprocessors from the upstream HBase releases. For more information, refer to Working with custom table coprocessors.

COD supports RAZ integration from the Runtime version 7.2.11.0. You can grant fine-grained access to directories.

The Ranger Authorization Service (RAZ) is a fine grained authorization service for cloud storage. As a regular individual user or as an HBase user, you can limit the authorization levels in the cloud storage to a directory level. For more information, refer to COD integration with RAZ.

COD now supports the “Storefile Tracking” (SFT) as an optional feature in Runtime 7.2.14.0.

Storefile Tracking (SFT) changes how HBase manages its files to avoid operations which are known to be suboptimal when using object stores. COD enables this feature for COD databases deployed on AWS which use S3 for HBase storage which will address performance issues known around flushes, compactions, and other HBase operations. For more information, refer to HBase Storefile Tracking.

COD allows to disable the Kerberos authentication temporarily for HBase clients that run on Cloudera legacy products.

If your client applications are running on Cloudera legacy products, they usually do not have Kerberos authentication enabled. When you try to connect to any COD instance, the connection fails because the COD instances have Kerberos enabled, by default. Now, you can disable Kerberos authentication in your COD instances so that HBase or Phoenix clients can connect seamlessly. For more information, refer to Disabling Kerberos authentication for HBase clients.