CDP Public Cloud: May 2022 Release Summary

Data Hub

Cloudera Runtime 7.2.15

Cloudera Runtime 7.2.15 is now available and can be used for registering an environment with a 7.2.15 Data Lake and creating Data Hub clusters. For more information about the new Runtime version, see Cloudera Runtime. For all supported CDP Public Cloud Runtime versions, see Supported Cloudera Runtime and Cloudera Manager versions. If you need to upgrade your existing CDP environment, refer to Data Lake upgrade and Data Hub upgrade documentation.

Resilient scaling improvements

It is possible to upscale a Data Hub cluster even if there are unhealthy nodes in the stack. An upscale operation will finish successfully even if one of the new nodes fails to come up. Nodes that fail during an upscale will be marked as “zombie” nodes. For more information, see Resize a cluster.

Extended upgrade version support for RAZ-enabled environments

Data Hub major/minor version upgrades for RAZ-enabled clusters are now available for Runtime versions 7.2.10-7.2.12 to 7.2.14+.

Support for new AWS instance types

The following new AWS instance types are now supported in Data Hub:

  • m5ad.xlarge
  • m5ad.2xlarge
  • m5ad.4xlarge
  • m5ad.8xlarge
  • m5ad.12xlarge
  • m5ad.16xlarge
  • m5ad.24xlarge
  • m5d.xlarge
  • m5d.2xlarge
  • m5d.4xlarge
  • m5d.8xlarge
  • m5d.12xlarge
  • m5d.16xlarge
  • m5d.24xlarge
  • r5ad.12xlarge
  • r5ad.24xlarge
  • r5d.24xlarge
  • r5d.12xlarge

Data Visualization

The following new features and improvements have been added in May 2022:

  • New CML engine: docker.repository.cloudera.com/cloudera/cdv/cmldataviz:6.4.0-b12

  • New CDSW engine: docker.repository.cloudera.com/cloudera/cdv/cdswdataviz:6.4.0-b12

  • Extended SAML authentication with CDP group syncing for CDW and fixed a backward-compatibility issue with legacy SAML backends. Additionally, user permissions are now fully managed by CDP groups for CDW deployments.

  • Updated the Machine Learning Runtime version of CDV to support additional metadata stores.

  • Users can now clone a dashboard from the View mode.

  • It is now easier to upload files including images, fonts, and JavaScript packages for use in custom styles, emails, or rich text visuals.

  • Direct access to data through SQL is now available on additional connection types, and to make using the query editor easier, users can send a query with cmd + enter keyboard shortcut.

  • Users can now move a variety of “picklist” filter types to the dashboard canvas.

  • A new site setting is available, which disables descendant information for dashboards for CDV instances that wish to optimize data tab performance.

  • In CDV 7.0.0, some major updates have been made to the Bootstrap and jQuery libraries. Bootstrap was upgraded from 3.3.7 to 4.6.0 and jQuery was upgraded from 2.2.4 to 3.6.0. This change has a potential impact for customers who have created Custom Styles in earlier versions of CDV, since they would potentially contain code that references legacy Bootstrap classes or jQuery methods in their CSS and/or JS respectively. For more information, see the Custom Styles Migration Guide.

  • In CDV 7.0.0, you can query your data directly through a new SQL tab in the top navigation bar, and a new homepage layout has been introduced so you can quickly add and access recent queries, data connections, and datasets alongside their dashboards and applications. From the SQL tab, you can now create a new dataset or dashboard from a query with a single click. Additionally, when editing a dashboard, updating a SQL-based dataset is even easier with the new Edit Dataset SQL option from the in-visual options menu.

  • Data Extracts are now available as a preview. Data Extracts help with managing the analytical capabilities, performance, concurrency, and security of data access across data connections. For more information, contact your Cloudera representative.

DataFlow

This release (2.0.0-b302) of Cloudera DataFlow (CDF) on CDP Public Cloud introduces the following new features and improvements:

Inbound Connection Endpoints for Flow Deployments

You can now deploy NiFi flows containing processors like ListenHTTP, ListenTCP, ListenUDP, ListenSyslog, and so on, and let CDF take care of exposing a stable and secure endpoint to connect your clients to.

Flow Deployments support for Apache NiFi 1.16

Flow Deployments now support the latest Apache NiFi 1.16 release.

New ReadyFlows

Seven new ReadyFlows are available in the ReadyFlow Gallery:

  • Azure Event Hub to ADLS ReadyFlow
  • Confluent Cloud to S3/ADLS ReadyFlow
  • Confluent Cloud to Snowflake ReadyFlow
  • JDBC to S3/ADLS ReadyFlow
  • ListenHTTP to CDP Kafka ReadyFlow
  • Non-CDP ADLS to CDP ADLS ReadyFlow
  • Non-CDP S3 to CDP S3 ReadyFlow

Single AZ support for DataFlow clusters

CDF Kubernetes nodes are now always deployed in a single Availability Zone (AZ). When enabling CDF, you still have to provide subnets in at least two Availability Zones to satisfy Kubernetes control plane requirements.

Value of javax.net.ssl.trustStore property no longer set to existing trust store on NiFi node

The javax.net.ssl.trustStore system property is no longer set by default to the existing truststore on the NiFi node. Flows which rely on a truststore being available by default via the JVM should be updated to leverage the SSLContextService.This change may cause errors affecting older existing processors or custom processors that do not have a property to reference an SSL Context Service.

CDP CLI support for DataFlow

CDF-PC integration into the CDP CLI is now generally available. Make sure you run the latest CDP CLI version to use cdp df commands.

Note: When release 2.0.0-b302 becomes generally available, the CDP CLI version supporting cdp df commands may not be available yet. In this case, use the Beta CDP CLI.

Other features and changes:

  • Flow Deployments on Azure now use Premium SSD disks to store NiFi’s various repositories.

  • Python 3.9 is now installed on NiFi containers allowing users to run python scripts through NiFi processors. Requests and urllib3 libraries are also installed by default.

  • Information level alerts are now displayed in the Dashboard to inform users about an action they should perform for a Flow Deployment without changing the Flow Deployment status to Concerning.

  • Deployments can now be restarted from the Deployment Manager to perform actions that require a NiFi process restart.

  • Filters that have been set in the Dashboard are now persisted after a user has navigated to the Deployment Manager and returned back to the Dashboard.

  • When enabling DataFlow, you can now specify CIDRs as a comma separated list to configure Load Balancer and Kubernetes API Server Endpoint Access.

Management Console

Cloudera Runtime 7.2.15

Cloudera Runtime 7.2.15 is now available and can be used for registering an environment with a 7.2.15 Data Lake and creating Data Hub clusters. For more information about the new Runtime version, see Cloudera Runtime. For all supported CDP Public Cloud Runtime versions, see Supported Cloudera Runtime and Cloudera Manager versions. If you need to upgrade your existing CDP environment, refer to Data Lake upgrade and Data Hub upgrade documentation.

New permissions were added to the default cross-account AWS policy

The cross-account access IAM role that is used for the CDP credential has been changed to include a set of new permissions required for Cloudera Data Engineering (CDE), Cloudera DataFlow (CDF), and Cloudera Machine Learning (CML). The new AWS permissions are required to simplify the creation of the Kubernetes cluster on AWS. As a result of this change, all customers using or planning to use CDE, CDF, or CML in CDP Public Cloud on AWS must update their existing cross-account permissions to ensure that these three data services can be created, enabled, or updated.

If you are using or planning to use CDE, CDF, or CML, add the following permissions to the cross-account role:

{ 
"Effect": "Allow",
 "Action": [
 "ssm:DescribeParameters",
 "ssm:GetParameter",
 "ssm:GetParameters",
 "ssm:GetParameterHistory",
 "ssm:GetParametersByPath"
 ],
 "Resource": [
 "arn:aws:ssm:*:*:parameter/aws/service/eks/optimized-ami/*"
 ]
}

Data Lake scaling (Preview)

Data Lake scaling is the process of scaling up a light duty Data Lake to the medium duty form factor, which has greater resiliency than light duty and can service a larger number of clients. You can trigger the scale-up in the CDP UI or through the CDP CLI. See Data Lake scaling (Preview).

Note: You need to contact Cloudera to have this feature enabled.

Support for Replication Manager in ap-1 and eu-1 regional Control Planes

Cloudera Replication Manager is now supported in the ap-1 (Australia) and eu-1 (Germany) regional Control Planes. For the list of all supported services for all supported Control Plane regions, see CDP Control Plane regions.

Bring your own Azure private DNS zone

CDP supports using a private endpoint for Azure Postgres with an existing Azure private DNS zone. The private DNS zone can now be pre-created and provided by you, or created by CDP. Previously, CDP always created the private DNS zone when a private endpoint was created. See updated Azure requirements for using a private endpoint for Azure Postgres and Enabling a private endpoint for Azure Postgres in CDP.

Extended upgrade version support for RAZ-enabled environments

Data Lake major/minor version upgrades for RAZ-enabled environments are now available for Runtime versions 7.2.10-7.2.12 to 7.2.14+.

Replication Manager

Updates to HBase replication policy

If you are using Cloudera Manager version 7.6.0 or higher, you can use the following options while creating an HBase replication policy:

  • Rolling HBase Service Restart on Source - This option appears if you select COD or Data Hub as the source cluster. Select this option to enable automatic rolling restart of HBase service on the source cluster after the HBase replication policy first-time setup steps are complete. Otherwise, Cloudera Manager performs a full restart of the service.
  • Rolling HBase Service Restart on Destination - Select this option to enable automatic rolling restart of HBase service on the target cluster as a rolling restart after the HBase replication policy first-time setup steps are complete. Otherwise, Cloudera Manager performs a full restart of the service.
  • Validate Policy - Select to notify Replication Manager to verify the policy details after the policy creation is complete.

For more information, see Creating HBase replication policy.

A warning message appears when you choose a cluster that is part of an existing cluster pairing. If there are HBase replication policies for the existing pairing, policy creation is not allowed to continue. If there are no policies, the existing pairing is removed and policy creation continues. For more information, see Using HBase replication policy.

Support for Replication Manager in ap-1 and eu-1 regional Control Planes

Cloudera Replication Manager is now supported in the ap-1 (Australia) and eu-1 (Germany) regional Control Planes. For the list of all supported services for all supported Control Plane regions, see CDP Control Plane regions.

Migrating HDFS data from cloud storage to CDH clusters

After you register a CDH cluster as a classic cluster in the Management Console, and register the cloud credentials for AWS or Azure in Replication Manager, you can migrate HDFS data from the cloud storage to the registered classic cluster (CDH cluster) using HDFS replication policies. For more information about CDP CLI for HDFS and Hive replication policies, see Creating HDFS replication policy.

Tracking and monitoring the performance of HDFS and Hive replication policies

You can monitor and track the performance of the following replication policies:

  • HDFS replication policies - Download the CSV reports in the HDFS Replication Report field on the Replication Policies > {HDFS replication policy} > Job History panel.
  • Hive replication policies - Download the CSV reports in the HDFS Replication Report field and Hive Replication Report field on the Replication Policies > {Hive replication policy} > Job History panel.

For more information, see Managing HDFS replication policy and Managing Hive replication policy.

Modifying HDFS and Hive replication policy options

After you create an HDFS replication policy or Hive replication policy, you can change the policy options as required to meet any changing requirements. Optionally, on the Replication Policies page, you can expand an HDFS or Hive replication policy to edit the policy description, frequency (start time cannot be modified if the policy has already started) to run the policy, YARN queue name to submit the replication job, maximum bandwidth for each map task, and maximum map slots or tasks per replication job. To optimize the replication policy performance, you can configure the queue name, maximum bandwidth, and maximum map slots as necessary.

For more information, see Managing HDFS replication policy and Managing Hive replication policy.

Restart HBase service when you use on-premises cluster as source cluster in HBase replication policy

After you create an HBase replication policy, you must restart the HBase service in the on-premises source cluster when the policy status on the Replication Policies page shows Manual restart (src) / restarting (dest) or Manual HBase restart needed on source. After the service restart is complete, the setup continues automatically for the replication policy.

However, if the source cluster Cloudera Manager version is 7.6.0 or lower and you are using an on-premises cluster as the source cluster, you must perform the following steps to complete the HBase replication policy setup:

  1. Restart the HBase service on the on-premises source cluster when the policy status on the Replication Policies page shows Waiting for ‘Continue Setup’ action call.
  2. Click Continue setup for the policy on the Replication Policies page after the service restart is complete. This action informs Replication Manager to continue the replication policy setup.

For more information, see Managing HBase replication policy.

Workload Manager

Analyze your cluster, job, and query costs with the Workload XM Chargeback feature

Easily define customized cost centers with the Chargeback feature. Once defined Workload Manager visually displays a cluster’s current and historical costs. This new feature enables you to create Workload Manager cost centers for your workload clusters based on user or pool resource criteria and CPU and memory consumption. Once created you can then plan and forecast budgets and future workload environments and/or justify current user groups and resources using your Workload cluster’s cost insights. For more information, see Analyzing Your Workload Cluster Costs with Workload Manager Cost Centers.

Trigger action events across your jobs and queries with the Workload Manager Auto Actions - Notifications feature

This new feature triggers an action event in real-time based on your criteria. When a job or query meets your action’s criteria and its conditions exist the action’s event is triggered. For example, memory exhaustion can cause nodes to crash or jobs to fail. Knowing when available memory is falling below a specific threshold enables you to take steps before a potential problem occurs. With the Auto Actions feature, you can create an action that informs you through an email when a job is consuming too much memory so that you can take steps to alleviate a potential problem. For more information, see Triggering an Action across Jobs and Queries

Display and monitor your cluster and its services memory consumption with the new resource consumption widgets in the Summary page

Resource Consumption By Services displays the CPU and memory consumption for each service across the time range you selected. Hover your mouse over the time line, to display the amount of CPU or memory, as a percentage, that is consumed by each of the cluster’s services.

Resource Consumption By Nodes displays the CPU and memory consumption for each node in the cluster. Hover your mouse over the time line, to display the amount of CPU or memory, as a percentage, that is consumed by each node and its services.