CDP Public Cloud: June 2021 Release Summary

Data Engineering

Python virtual environment improvements

Python virtual environments are now built in dedicated pods, and support C-based Python libraries.

Open source CDE/CDW operators for Apache Airflow

You can now use CDE and CDW operators with your existing Apache Airflow deployment. For more information and instructions, see Using CDE with an external Apache Airflow deployment.

CDE jobs in different virtual clusters within the same DAG file

Airflow DAG files can now trigger CDE jobs in different virtual clusters. For more information, see Automating data pipelines using Apache Airflow in Cloudera Data Engineering.

CIDR notation support for IP whitelist

You can now add IP ranges to the whitelist using CIDR notation.

Subnet selection option

You can now select a subnet to use for CDE when enabling a CDE service.

Data Visualization

New features and improvements

  • New CML engine:
  • CDP Data Visualization applications running in CML can be configured with unauthenticated access.
  • When configuring email jobs, you can specify a reply-to email address that is different from the sender’s email address.
  • Users with manage users and roles permissions can now assign roles as expected.
  • You can toggle between filter types when creating a filter for a date or for a numeric field.
  • Field comments are shown by default when viewing a dataset.

Fixed issues

  • The Solr connection type no longer permits expression creation that may cause visual failure.
  • Fixed a bug where the use of advanced site settings along with a non-default metastore resulted in an unexpected error.
  • Disabled legacy connection types by default, which had bugs that could resulted in an unexpected connection error.
  • Smart acceleration now works as expected on the Arcadia connection type when the environmental variable ENABLE_LEGACY_DATACONN is set to true.
  • Fixed a bug where visual titles did not substitute the correct parameter in edit mode.
  • In-product help text is now shown as expected when building a list visual.
  • Fixed a bug where the scrollbar in an application’s navigation bar did not appear as expected on narrow screens.
  • The Search button in an application’s navigation bar now appears as expected on narrow screens.
  • Area graphs that contain a date range now draw as expected.
  • Fixed a bug where editing a shelf element’s expression may have resulted in an unexpected error.
  • Editing an email job without an attachment now works as expected.

Data Warehouse

S3 Guard disabled for custom Database Catalogs and default Database Catalogs on AWS environments

During new AWS environment activation in CDW:

  • If you have not specified a DynamoDB table name during environment registration in Management Console, the S3 Guard feature is disabled for all Database Catalogs and Virtual Warehouses.
  • If you have specified a DynamoDB table name during environment registration in Management Console, the S3 Guard feature is enabled for default Database Catalogs and Virtual Warehouses. However, it is disabled for non-default (custom) Database Catalogs.

S3 buckets created by CDW on AWS environments are now encrypted by default with SSE

When you use the latest cross-account role from Management Console to register an environment, S3 buckets created by CDW during AWS environment activation, are now encrypted by default with SSE-S3. This provides server-side object encryption (SSE), which protects data at rest. Before this enhancement, S3 buckets were created with only object-level encryption for every object in the bucket during environment activation.

Impala Virtual Warehouses can now be recovered after going into an error state

Impala resources are not deleted when the Virtual Warehouse goes into an error state. Consequently, if an Impala Virtual Warehouse goes into an error state, you can now use the Edit option to recover the Virtual Warehouse with its original configurations.

Impala Virtual Warehouses now use node memory more efficiently

Impala coordinators now can use ~13GiB more of available memory on a node to better support concurrent query execution.

Machine Learning

  • Engine Deprecation - Cloudera ML Runtimes are the default and recommended solution to run user workloads. New projects will be created with ML Runtimes configured by default and we recommend migrating existing projects to use ML Runtimes. Legacy Engines are deprecated and will be removed in a future release but workloads running on them remain fully supported.
  • Register customized Runtime - Administrators can register an externally built Runtime to provide Data Scientists with a customized environment.
  • Support user API key rotation - Administrators can rotate keys for all users, or users can rotate their own keys, just by clicking on a button.
  • AMP specification for Runtimes - You can specify which Runtimes to use in AMPs.
  • New base engine released - Engine:14-cml2021.05-1 is now available.
  • Configurable engine image for Jobs - You can specify which engine to use for a Job. Jobs use the Project engine by default.
  • Web session timeouts - Timeout limits for User and Admin User web sessions were changed to be a period of inactivity, instead of a set time limit.

Fixed issues

  • Spark configuration (DSE-16422) - Fixed an issue where user-added Spark configurations in the ~/spark-defaults.conf file may not be populated to /etc/spark/conf/spark-defaults.conf correctly.
  • Hive table (DSE-16108) - Fixed an issue where saving data to a managed Hive table using Hive Warehouse Connector via CDSW from CML may fail with a StreamCorruptedException.
  • API keys (DSE-15678) - Fixed an issue where user API keys may be accessible through non-private projects.
  • Web sessions (DSE-11394) - Fixed an issue where user web session may not be refreshed even when the user is actively using CML.

Management Console

Runtime 7.2.10

Runtime 7.2.10 is now available and can be used for registering an environment with a 7.2.10 Data Lake and creating Data Hub clusters. See Cloudera Runtime.

FreeIPA backup location

During AWS, Azure, or GCP environment registration via CDP web interface or CDP CLI, you can optionally specify a separate cloud storage location (FreeIPA “Backup Location Base”) for FreeIPA backups. If no separate location is specified, FreeIPA backups are stored in Logs Location Base. For more information, see:

FreeIPA HA enabled by default for new AWS and Azure environments

FreeIPA HA is now enabled for all newly created AWS and Azure environments in CDP. The number of nodes used for the FreeIPA server depends on the Data Lake scale selected: light duty uses 2 nodes and medium duty uses 3. The FreeIPA HA toggle button has been removed from the environment registration UI, but, if needed, it is possible to customize FreeIPA node count when registering an AWS or Azure environment via CDP CLI. For more information, see Managing FreeIPA.

S3Guard removal

S3Guard is no longer used with newly registered AWS environments using Runtime version 7.2.2 or newer. Consequently, the ‘‘Enable S3Guard” environment registration option has been removed and there is no need to create a DynamoDB table for your environment when planing to use Runtime version 7.2.2 or newer. Environments created prior to this change continue to use S3Guard.