CDP Public Cloud: April 2022 Release Summary

Data Catalog

Data Catalog introduces the following additions:

  • Profiler cluster launch requires specific instance type for the Master node.

Data Engineering

Support for Iceberg 0.13 (Preview)

  • You can use Cloudera Data Engineering virtual clusters running Spark 3 to interact with the latest version (0.13) of Apache Iceberg tables.
  • CDE now supports row level updates via copy-on-write merge, update, and delete operations. Copy-on-write is helpful for bulk updates in read heavy use-cases.
  • For more information, see Using Apache Iceberg in Cloudera Data Engineering.

In-place upgrades (Preview)

CDE supports upgrades from CDE 1.14 on both AWS and Azure. The upgrades can be triggered by an Admin from CDE UI. Users need to manually pause/backup/restore each Virtual Cluster to account for upgrade failures. Upgrades of CDE core components include: EKS, AKS Services, and Application Services. Upgrades of dependencies include: Helm, K8s versions, and YuniKorn.

Job email alerts

SLA miss and job failure conditions can be configured for email notifications.

Job runtime notices

Active jobs now provide notification to the user when certain conditions are met and jobs are not behaving as expected, making it easier to understand why jobs might be stuck or not making progress. For more information, see Running jobs in Cloudera Data Engineering.

Spark 3.2

CDE now supports Apache Spark 3.2.

Data Hub

Data Hub OS upgrades

OS upgrades of Cloudera Runtime are generally available. See Performing a Data Hub OS upgrade.

Data Warehouse

Metering fix

An important fix related to Cloudera Data Warehouse (CDW) metering has been added. This fix accurately communicates the status of your Virtual Warehouses to the Cloudera Control Plane, necessary for proper functioning of CDW. To include this fix, upgrade your Virtual Warehouses. Cloudera recommends performing this upgrade as soon as possible.

Unified Analytics

This is the GA release of Unified Analytics that brings SQL equivalency without syntax changes to CDW SQL engines. Unified Analytics brings significant optimization equivalency to these engines, unifying common techniques such as subquery processing, join ordering, and materialized views. Unified Analytics is available as a CDW Kubernetes service in AWS. Unified Analytics documentation includes features, limitations, and how to use Unified Analytics.

Ability to enable a private CDW environment in Azure Kubernetes Service (AKS)

You can now activate CDW clusters in private mode which enables you to create all cloud resources without the need of public IP addresses. For more information, see Enabling a private CDW environment in Azure Kubernetes Service.

You can now download the Beeline CLI tarball from CDW and use the Beeline client to connect to a Hive Virtual Warehouse and run queries. The archive file contains all the dependent JARs and libraries that are required to run the Beeline script.

Support for uploading non-UDF JARs

CDW enables administrators to upload additional non-UDF JARs to the Hive classpath that might be required to support dependency JARs, third-party Serde, or any Hive extensions.

Enhanced complex types support for Impala

This release allows you to use:

  • Array in SELECT list
  • UNNEST function for arrays in SELECT list

See Querying arrays for more information.

Ability to run Hive LLAP workloads using Hue without the “CDW_HUE_LLAP” entitlement

You can now use Hue to run Hive workloads in CDW without enabling the “CDW_HUE_LLAP” entitlement. Hue packs the combined features of Data Analytics Studio (DAS) and Hue in this new version. If you do not have the “CDW_HUE_LLAP” entitlement enabled in your environment, then you can get Hue by creating a new Hive Virtual Warehouse.

Ability to select the Hue image version

CDW allows you to select a Hue Runtime version from the Hue Image Version drop-down menu while creating a Virtual Warehouse. The compatible Hue image version is automatically selected based on the Hive or Impala image version that you select. If there is no corresponding Hue version for the selected Hive or Impala version, then the latest Hue version is automatically selected. For example, if you select Hive or Impala version 2022.0.7.0-70, then 2022.0.7.0-70 is selected as the Hue image version.

You can also check the Hue image version from the Virtual Warehouse Details page.

Hue Query Processor runs at the Database Catalog level

The Hue Query Processor indexes and retrieves Hive query history by reading query event data from a Hadoop Compatible File System (HCFS), such as S3, and writing them into a relational database, such as PostgreSQL. Previously, the Hue Query Processor process ran at the Virtual Warehouse level. Therefore, there was one Query Processor pod running for each Virtual Warehouse associated with a Database Catalog. Now, the Query Processor pod runs at the Database Catalog level. This helps in reducing the read cost in the cloud.

A new tab called “Hue query processor” has been added under the Database Catalog > Edit > CONFIGURATIONS section. The “hue-query-processor.json” and “hue-event-processor.json” files are no longer available under the Virtual Warehouse > Edit > CONFIGURATIONS > Hue > Configuration files drop-down menu.

If you are on an older version of Virtual Warehouse, then you must create a new Database Catalog and Virtual Warehouse to activate this feature.

To tune the data refresh rate, see Configuring the Hue Query Processor scan frequency.

Ability to grant fine-grained access to S3 buckets from the S3 File Browser in Hue (Preview)

For better security and ease of use for users, you can create per-user home directories within your S3 bucket and grant fine-grained access to these user directories from the S3 File Browser in Hue. To enable fine-grained access from the Hue S3 File Browser, you must enable Ranger Authorization Service (RAZ) when you register your AWS environment with CDP, and add the path to your S3 bucket in the hue-safety-valve field in CDW. For more information, see Enabling Fine-grained Access Control for Hue in Cloudera Data Warehouse for AWS environments.

Note: You need to contact Cloudera to have this feature enabled.

Ability to grant fine-grained access to ADLS Gen2 containers from the ABFS File Browser in Hue (Preview)

For better security and ease of use for users, you can create per-user home directories within your ADLS Gen2 container and grant fine-grained access to these user directories from the ABFS File Browser in Hue. To enable fine-grained access from the Hue ABFS File Browser, you must enable RAZ when you register your Azure environment with CDP, and add the path to your ADLS Gen2 container in the hue-safety-valve field in CDW. For more information, see Enabling Fine-grained Access Control for Hue in Cloudera Data Warehouse for Azure environments.

Note: You need to contact Cloudera to have this feature enabled.

Performance and security enhancements in Hue

Python 2 has reached the end of life and is no longer supported. Hue in CDW Public Cloud now uses Python 3 which makes use of critical bug fixes and Common Vulnerabilities and Exposures (CVE) fixes for many third-party software dependencies.

The following changes have been made in the Hue codebase in this release of CDW Public Cloud:

  • The Hue container now runs in the UBI8 image, which is upgraded from UBI7.
  • Python libraries have been upgraded from Python 2.7 to Python 3.8.
  • The Django server has been upgraded from version 1.11.29 to 3.2.12.
  • Python modules such as django-auth-ldap, django-axes, djangorestframework-simplejwt, Mako, Markdown, python-ldap, django-babel, django-mako, django-cors-headers, djangorestframework, eventlet, sqlparse, and so on have also been upgraded.
  • Hue now uses Gunicorn as a front-end server. Previously, Hue used the CherryPy server.

These upgrades bring in significant performance improvement and stability in query execution, uploading, and importing files to S3 or ADLS Gen2. Operating System, Python version, and Python module upgrades have resulted in a stable environment and fixed more than 800 security vulnerabilities.

Also, the UBI8 image uses the latest Apache HTTP Server (httpd). This no longer requires you to limit the fully qualified domain name of your Virtual Warehouse to 64 characters.

Machine Learning

ML Discovery and Exploration, SQL and Visualization (Preview)

This feature enables Data Scientists to understand their data using a SQL editor and drag-and-drop Visual Dashboards within CML. Users can start with their pre-configured Data Connections and create Datasets that they can rely on for model development. For more information, see ML Discovery and Exploration.

Model metrics visualization

This feature allows Data Scientists and Machine Learning Engineers to monitor technical metrics relating to their running models, such as resource consumption and request throughput, within Cloudera Machine Learning.

Azure Files - Support for Azure Files NFS

Public Load Balancer

This feature allows MLAdmins to configure a Public Load Balancer for a CML workspace with a fully private (Private EKS API Server endpoint) EKS deployment. This feature is available on AWS only.

Management Console

Changed FreeIPA Azure VM type

The Azure VM type used for the FreeIPA server was changed from Standard_D3_v2 to Standard_DS3_v2 so that FreeIPA nodes can be encrypted at host. Standard_D3_v2 doesn’t support encryption at host while Standard_DS3_v2 does. See updated Overview of Azure resources used by CDP.

Setting IDBroker mappings in a RAZ environment is disabled

If a CDP environment has RAZ enabled, setting IDBroker mappings is disabled during environment creation and when the environment is already running. If your environment has RAZ enabled, you should be using Ranger for authorizing user and group access to the S3 or ADLS Gen 2 cloud storage used by the Data Lake.

Disabling Azure Load Balancers

If a public egress load balancer cannot be used in an environment due to security rules or network configuration, load balancers can be disabled using CDP CLI. See Disable load balancers in an Azure environment.

The Azure Load Balancer is used in multiple places in CDP Data Lakes and Data Hubs. It is used as a frontend for Knox in both Data Lakes and Data Hubs, and for Oozie HA in HA Data Hubs. See Azure Load Balancers in Data Lakes and Data Hubs.

Upgrading classic clusters from CCMv1 to CCMv2

You can now upgrade your CDH, HDP, or CDP Private Cloud Base clusters that were previously registered in CDP from CCMv1 to CCMv2. The upgrade is available for all such classic clusters that were registered with CCMv1. See Upgrading a classic cluster from CCMv1 to CCMv2.