CDP Public Cloud: October 2021 Release Summary

DataFlow

Documentation updates

The following documentation was updated to include CLI instructions:

Data Catalog

Data Catalog introduces the following additions:

  • Support for deleting the profiler cluster

  • Updated asset details

  • Bug fixes and refreshed UI

  • Available location paths for the Backup and Restore scripts based on the Cloudera Runtime version.

Data Engineering

Apache Airflow 2

With this release, Apache Airflow 2.1 is the new default managed scheduler in CDE. It comes with governance, security and compute autoscaling enabled out-of-the-box, along with integration with CDE’s job management APIs giving users the flexibility to deploy custom DAGs that tap into Cloudera Data Platform (CDP) data services like Spark in CDE and Hive in CDW. For more information on what’s new in Airflow 2, see the upstream documentation.

[Technical Preview] Airflow pipeline authoring UI

With the CDE Pipeline Authoring UI, any CDE user irrespective of their level of Airflow expertise can create multi-step pipelines with a combination of out-of-the-box operators (CDEOperator, CDWOperator, BashOperator, PythonOperator). Nevertheless, you can still deploy your own customer Airflow DAGs (Directed Acyclic Graphs) as before, or use the Pipeline Authoring UI to bootstrap your projects for further customization. This feature is in Technical Preview and available on new CDE services only. When creating a Virtual Cluster, a new option allows you to enable the Airflow Authoring UI. For best user experience, Cloudera suggests using Google Chrome for this feature.

[Technical Preview] Email alerts

You can now configure email alerts during Virtual Cluster setup and schedule them in custom Airflow DAGs.

Kubernetes update

CDE now supports K8s 1.20.

Data Hub

Encrypting EBS volumes with a CMK generated from AWS CloudHSM

New documentation with steps for encrypting Data Hub’s EBS volumes with a CMK generated from AWS CloudHSM is available. See Encrypting EBS volumes with a CMK generated from AWS CloudHSM.

Data Hub maintenance upgrades

You can perform a maintenance (or “hotfix”) upgrade on certain Data Hub clusters. For more information, see Upgrading Data Hubs.

Data Warehouse

Hive compaction observability

An alert notification about compaction status, issues, and suggested actions appear in the overview of your Database Catalog, which leads to an informative message about the issue. The following list describes a few of more than 25 notifications:

  • Oldest initiated compaction passed threshold

  • Large number of compaction failures

  • More than one host is initiating compaction

  • Upgrade your Virtual Warehouse to pick up this feature.

For more information, see Compaction Observability.

AWS EKS Kubernetes version upgrade

The CDW application uses Kubernetes (K8S) clusters to deploy and manage Hive and Impala in the cloud. Kubernetes versions are updated every 3 months on average. When the version is updated, minor versions are deprecated. Amazon Elastic Kubernetes Service (EKS) is updating to Kubernetes version 1.20 and is ending support for version 1.17.

AWS environments you activate using this release of Cloudera Data Warehouse, and later, will use version 1.20. To avoid compatibility issues between CDW and AWS resources, you must update the version of Kubernetes that supports your existing CDW clusters to version 1.20.

If a MIGRATE icon appears in the upper right corner of the environment tile, your AWS environment has not been migrated from Helm 2 to Helm 3. Update the Helm package manager for your environment before you attempt to upgrade its Kubernetes version.

For information about upgrading the AWS EKS Kubernetes version, see Upgrade CDW for EKS upgrade.

Choosing a CDW version for a Database Catalog or Virtual Warehouse

When creating a Database Catalog or Virtual Warehouse, you can choose the latest version of CDW or an earlier version. Version mapping lists the CDW versions from 2021.0.0-b33 released on August 9, 2021 to the latest release. You can ensure backward compatibility for your scripts, for example, by running your jobs on the same version of CDW. The ability to choose a version of Database Catalog or Virtual Warehouse is available under entitlement CDW_VERSIONED_DEPLOY. For more information about this capability, see Adding a new Database Catalog and Adding a new Virtual Warehouse.

Prevention of ineffective scaling in Impala Warehouses

CDW takes advantage of IMPALA-10244 to detect the case of queries being queued because of a resource bottleneck on the Impala Coordinator. In this case, adding more executor groups does not result in the ability to run more queries, and so, the warehouse does not scale up.

Retaining PostgreSQL backups in the cluster

When you create a Cloudera Data Warehouse cluster using the CDP CLI create-cluster command, any PostgreSQL backup retention period you set on your Cloud provider side is observed by CDP. For example, if you configure backupRetentionDays in Azure or BackupRetentionPeriod in AWS, and then create a cluster using the CDP CLI, the cluster will retain the PostgreSQL backups according to your configuration.

AWS EKS Kubernetes version upgrade

AWS environments that you activate using this release of Cloudera Data Warehouse, and later, will use version Amazon Elastic Kubernetes Service (EKS) version 1.20

Machine Learning

  • API v2 - A new API for operations on projects, jobs, models, and applications is now generally available.

  • ML Workflow Discovery and Exploration (Preview) - This feature accelerates the ML development workflow with preconfigured data connections and readily available snippets. See ML Discovery and Exploration for more information.

  • Kubernetes 1.19 - Kubernetes 1.19 support has been added on AWS.

  • Kubernetes 1.20 - Kubernetes 1.20 is now supported on both AWS and Azure.

Management Console

Runtime 7.2.12

Runtime 7.2.12 is now available and can be used for registering an environment with a 7.2.12 Data Lake and creating Data Hub clusters. See Cloudera Runtime.

Medium duty Data Lakes for GCP

Medium duty Data Lakes for GCP have added an additional gateway node to provide failure resilience for UI and API clients. Load-balanced UI and API access are now available without interruption. For more information see Data Lake scale.

AWS Reference Network Architecture

The purpose of this document is to clearly articulate the networking requirements needed for setting up a functional AWS environment in CDP into which the Data Lakes and compute workloads of different types can be launched and used. See CDP Public Cloud Reference Network Architecture for AWS.

Reduced AWS permissions for the cross-account role

The reduced AWS IAM policy definitions can optionally be used for the cross-account role created for the AWS provisioning credential. See Cross-account access IAM role.

Operational Database

COD supports modified auto-scaling criteria

COD now supports an improved auto-scaling algorithm that considers the latency of the user operations. COD now prioritises user operations over system operations that results in a reduced cost of infrastructure with a minor increase in replication. COD does not trigger scaling events on the skewed latency when you process a large batch of replication edits in a single cell.

COD supports a built-in coprocessor called AggregateImplementation

COD now supports a built-in coprocessor called AggregateImplementation that facilitates aggregation function computations (min, max, sum,avg, median, std) at the region level. This yields better performance because you need not get all the data to perform these calculations. ‘AggregateImplementation’ is enabled by default, and you can use the AggregationClient in HBase to perform RegionServer side aggregation, for example, row count.