CDP Public Cloud: August 2021 Release Summary

Data Engineering

Kubernetes version updated to 1.19

The Kubernetes version has been updated to 1.19.

Dynamic allocation enabled by default

Dynamic allocation is now enabled by default. You can configure the initial number of executors as well as a range of executors per job.

Dynamic allocation scales job executors up and down as needed for running jobs. This can provide large performance benefits by allocating as many resources as needed by the running job, and by returning resources when they are not needed so that concurrent jobs can potentially run faster.

Resources are limited by the job configuration (executor range) as well as the virtual cluster auto-scaling parameters. By default, the executor range is set to match the range of CPU cores configured for the virtual cluster. This improves resource utilization and efficiency by allowing jobs to scale up to the maximum virtual cluster resources available, without manually tuning and optimizing the number of executors per job.

This is a change from the default behavior (static allocation) in older releases. If you restore job configuration from an older release, the restored jobs will use dynamic allocation.

Support for Amazon AWS S3 URLs in jobs

A previous issue with S3 URLs in job configurations has been fixed. You can now specify S3 URLs for your application code and Jar files. For jobs using this functionality, you must also add the following Apache Spark configuration option:

spark.hadoop.fs.s3a.delegation.token.binding=org.apache.knox.gateway.cloud.idbroker.s3a.IDBDelegationTokenBinding

On-demand Python virtual environments

You can now submit a job with a Python requirements.txt file, as follows:

cde spark submit my_job.py --python-requirements /path/to/requirements.txt

This builds the Python virtual environment resource for the user, attaches it to the job, and sets it to be cleaned up when the job run terminates.

Ranger Authorization Service (RAZ) support is available as preview for CDE

Ranger Authorization Service (RAZ) support for CDE is available as a preview feature. To enable RAZ support, ensure that your Data Lake is RAZ enabled. Cloudera does not recommend that you run this preview feature in production environments.

Data Visualization

Data Visualization introduces the following new features and improvements:

  • Display formats are now applied as expected on CSV and XLS downloads when the data connection type is Impala or Arcadia.
  • You can drill-into your data from the context menu to discover more granular information for a particular data point by examining other dimensions of your data.
  • New icons have been added to the username display based on user type — an asterisk appears for admin users and a checkmark appears for LDAP/SSO-signed users.
  • Usability improvements have been implemented to the dashboard builder sidebar.
  • The full dataset name can now be viewed on hover in dashboard builder menus.
  • You can now save a segment from a filter definition of a visual.
  • Creating and editing HTML extension visuals is now easier with syntax highlighting, autocompletion, and side-by-side preview.
  • Additional details on job parameters, including trigger conditionals and active states can now be viewed from the Job Status page.

DataFlow

This is the GA release of the Cloudera DataFlow service. Cloudera DataFlow (CDF) is a CDP Public Cloud service that enables self-serve deployments of Apache NiFi data flows from a central catalog to auto-scaling Kubernetes clusters managed by CDP. Flow deployments can be monitored from a central dashboard with the ability to define KPIs to keep track of critical data flow metrics. CDF eliminates the operational overhead that is typically associated with running Apache NiFi clusters and allows users to fully focus on developing data flows and ensuring they meet business SLAs.

Management Console

Send a diagnostic bundle to Cloudera support

CDP introduces a web interface for sending a diagnostic bundle to Cloudera support. Currently diagnostics can be collected for Data Lake, FreeIPA, and Data Hub. See Send a diagnostic bundle to Cloudera support.

New permission for GCP Logger service account

In addition to the previously documented permissions, if you would like to use a bucket path (gs:///) instead of a bucket (gs://) for the Logs or Backups bucket, you should assign the storage.objects.list permission to the custom role. See [Minimum setup for cloud storage](https://docs.cloudera.com/cdp/latest/requirements-gcp/topics/mc-gcp_minimum_setup_for_cloud_storage.html).

“Create public IPs” is disabled with CCM

The Create public IPs option available on the UI during Azure and GCP environment registration is now disabled by default when CCM is enabled.

New Shared Resources navigation menu item

Management options for provisioning credentials, proxies, and Data Hub cluster templates, recipes, and image catalogs can now be easily accessed from the new Shared Resources item in the navigation menu.

Operational Database

  • HBase CopyTable command now works between 2 COD databases w/o kerberos x-realm trust. Read What’s New.
  • A database can now be created using data from a previously dropped database. Read What’s New.
  • COD can now deploy using high performance storage for customers that experience unacceptable latency associated with object storage. Read What’s New.
  • Replication Manager dramatically simplifies setup of CDP Operational Database (COD) replication across cloud regions. Read What’s New.
  • CDP Operational Database (COD) provides increased visibility into performance with graphs on P99 get latency, block cache hit ratio, compaction queue size. Read What’s New.

Replication Manager

Enhancements to HBase replication policies

The following enhancements are available for an HBase replication policy:

  • You do not need to create a schema similar to the source cluster on the destination cluster.
  • Replication Manager performs the first-time setup configuration steps including HBase service restart on both clusters.
  • Optionally, you can also create or use an existing HBase replication machine user during policy creation and then validate the existing username with the CDP User Management System (UMS). The username and password will be automatically synchronized to the destination cluster’s environment (and to the source’s as well if the source is Cloudera Operational Database (COD).

Other

cdpctl

With the cdpctl command line utility, cloud administrators can verify the compatibility of existing cloud setup with CDP prior to handing over to the CDP administrator. This will greatly reduce the probability of errors during onboarding into an existing cloud account. See What’s New.