CDP Public Cloud: July 2024 Release Summary

The CDP Public Cloud Release Summary summarizes major features introduced in CDP Public Cloud Management Console, Data Hub, and data services.

Data Engineering

This release (1.22.0) of the Cloudera Data Engineering service on CDP Public Cloud introduces the following changes:

AWS Graviton support (GA)

AWS Graviton is a general purpose, ARM-based processor family. From the AWS Graviton family, Cloudera Data Engineering supports AWS Graviton 3. AWS Graviton delivers currently the best price performance for cloud workloads running in AWS EC2. With AWS Graviton, you can optimize costs and achieve better performance.

For more information, see AWS Graviton instances in Cloudera Data Engineering.

Support for in-place upgrade with private networking

If you use services that run on a private network, (for example, Private EKS), your Cloudera Data Engineering service is eligible for an in-place upgrade.

For more information, see In-place upgrade with Apache Airflow Operators and Libraries.

Kubernetes version upgrade to 1.28

The Kubernetes version that Cloudera Data Engineering uses is upgraded to 1.28. For more information, see Compatibility for Cloudera Data Engineering and Runtime components.

Iceberg version upgrade to 1.4.3

The Iceberg version that Cloudera Data Engineering uses is upgraded to 1.4.3. For more information, see Compatibility for Cloudera Data Engineering and Runtime components.

Support for Spark version 3.5.1

Cloudera Data Engineering supports Spark version 3.5.1. For more information, see Compatibility for Cloudera Data Engineering and Runtime components.

Virtual Cluster (VC)-level Spark configurations (Technical Preview)

With the VC-level Spark configurations option, you can create and update Spark configurations that, by default, apply to all Spark jobs that are run in a VC.

For more information, see Managing Virtual Cluster-level Spark configurations.

Azure private DNS zones (Technical Preview)

With the Azure private DNS zones feature, you can optimize costs by leveraging existing Azure private DNS zone resources. When creating the Cloudera Data Engineering service, you can use an Azure private DNS zone for the Azure Kubernetes Service (AKS), the Storage Account File Share, and for the database.

For more information, see Azure private DNS zones in a Cloudera Data Engineering service.

Fixed issues

  • Upgraded Cloudera Data Engineering Service to 1.20.3 and above with SSD Instances unable to run jobs
  • Azure and AWS service B&R does not restore all the jobs

Data Hub

This release of the Data Hub service introduces the following changes:

AWS RDS certificate rotation

AWS requires the rotation of the SSL/TLS certificates used for secure communication between CDP Public Cloud Data Lakes and certain Data Hubs and the external AWS RDS database instances that they rely on. CDP Public Cloud now provides multiple options to perform the required RDS certificate rotation.

For more information, see Rotating database certificates.

Rolling upgrade support for Streams Messaging clusters

Rolling upgrades for Streams Messaging Data Hub clusters are now available. Rolling upgrades for Data Hub clusters are limited to certain Cloudera Runtime versions. For more information, see Rolling upgrades.

Data Warehouse

The July release of Cloudera Data Warehouse service on CDP Public Cloud introduces the following changes:

What’s new in Cloudera Data Warehouse CDP Public Cloud

General availability of Virtual Warehouse and Database Catalog workload version selections

The Cloudera Data Warehouse UI now provides a list of workload versions that match your cluster from which you can select one during cluster installation. The Database Catalog list contains versions compatible with your Kubernetes version and your cluster environment (DWX version). The Virtual Warehouse list contains versions compatible with your Kubernetes version, your cluster environment (DWX version), and your Database Catalog version.

General availability of Impala workload-aware autoscaling

Workload-aware autoscaling allocates Impala Virtual Warehouse resources based on the workload that is running. You choose multiple executor group sets size based on your workload requirements, instead of the fixed executor group size of the previous auto-scaling implementation. This feature is now generally available. See Workload Aware Auto-Scaling in Impala.

Improved Impala Autoscaling Dashboard

You can now use the new Impala Autoscaling Dashboard to monitor Impala autoscaling in a warehouse that uses workload-aware autoscaling or the regular autoscaling. You can access the Impala Autoscaling Dashboard by going to the Virtual Warehouse Details page’s Web UI tab, and clicking the Impala Autoscaler Web UI option. See About the Impala Autoscaling Dashboard.

Ability to forward Prometheus metrics from Cloudera Data Warehouse to an external endpoint

In this release, you can configure Prometheus in Cloudera Data Warehouse to push its metrics to an external endpoint, such as Prometheus, Grafana, Thanos, or some other endpoint. See Forwarding Prometheus metrics from Cloudera Data Warehouse to an endpoint.

Automatically backing up and restoring Cloudera Data Warehouse

This release adds more automation to back up and restore procedures for AWS and Azure environments and clarifies the documentation of the automatic, semi-automatic, and manual procedures. To get the supported Kubernetes version for this release, you back up your old AWS or Azure environment and start up a new environment using the restoration process. The backup/restore feature saves your environment parameters, making it possible to recreate your environment with the same settings, URL, and connection strings you used in your previous environment.

Ability to configure Impala Statestore high availability

You can now configure high availability for Impala Statestore pods in a Virtual Warehouse, with active and passive modes ensuring continuity and reliability during failovers. See Configuring Impala Statestore high availability.

Downloading the UDF development package from Cloudera Data Warehouse UI

Introducing the ability to download the Impala UDF development package directly from the Cloudera Data Warehouse UI for enhanced convenience and integration, see Building and deploying UDFs.

What’s new in Cloudera Data Warehouse on Azure environments

Azure AKS 1.29 upgrade

Cloudera supports the Azure Kubernetes Service (AKS) version 1.29. In 1.9.1-b233 (released July 26, 2024), when you activate an environment, Cloudera Data Warehouse automatically provisions AKS 1.29. To upgrade to AKS 1.29 from an earlier version of Cloudera Data Warehouse, you must backup and restore Cloudera Data Warehouse. To avoid compatibility issues between Cloudera Data Warehouse and AKS, upgrade to version 1.29.

Note: Using the Azure CLI or Azure portal to upgrade the AKS cluster is not supported. Doing so can cause the cluster to become unusable and can cause downtime. For more information about upgrading, see Upgrading an Azure Kubernetes Service cluster for Cloudera Data Warehouse.

Addition of new Azure instance types

This release offers the selection of the Standard_E16pds_v5 Azure Virtual Machine, an AKS Ampere® Altra® Arm-based instance type for an Impala Virtual Warehouse. For more information about using the instance type, see Activating an Azure environment from Cloudera Data Warehouse.

What’s new in Cloudera Data Warehouse on AWS environments

Amazon EKS 1.29 upgrade

Cloudera supports the Amazon Elastic Kubernetes Service (EKS) version 1.29. In 1.9.1-b233 (released July 26, 2024), when you activate an environment, Cloudera Data Warehouse automatically provisions EKS 1.29. To upgrade to EKS 1.29 from an earlier version of Cloudera Data Warehouse, you must backup and restore Cloudera Data Warehouse. To avoid compatibility issues between Cloudera Data Warehouse and EKS, upgrade to version 1.29. See Upgrading Amazon Kubernetes Service (EKS).

Note about the impact of AWS RDS root certificate rotation in 2024

A Cloudera Data Warehouse Cluster RDS does not use certificate verification for connections to the Cloudera Data Warehouse. Therefore you are not directly impacted by certificate expiration for your Cloudera Data Warehouse Cluster RDS. You can either choose to clear the warnings or rotate the certificate.

To rotate the certificate for the Cloudera Data Warehouse Cluster RDS, follow the step outlined by AWS in Rotating your SSL/TLS certificate to update the certificate. There should be no impact on Cloudera Data Warehouse because the Cloudera Data Warehouse Cluster RDS should not be restarted, Postgres RDS has SupportsCertificateRotationWithoutRestart=true.

For the Datalake RDS, follow instructions shared by the Datalake account team to update the certificate. There maybe some impact to Cloudera Data Warehouse while restarting the Datalake, such as query failures or delays. This could happen because services such as Ranger, Knox, and FreeIPA might be unavailable during this period.

Addition of new AWS instance types

This release offers the selection of the r6gd.4xlarge and r7gd.4xlarge Arm-based instance types for an Impala Virtual Warehouse. For more information about using the instance type, see Activating an AWS environment from Cloudera Data Warehouse.

Ability to use envelope encryption for EKS secrets

Envelope encryption is now added for EKS Secrets through Cloudera Data Warehouse KMS Key by default. See Encrypt Kubernetes secrets with AWS KMS on existing clusters.

What’s new in Iceberg on Cloudera Data Warehouse Public Cloud

Support for Iceberg version 1.4.3

The Apache Iceberg component has been upgraded from 1.3.0 to 1.4.3.

Support for Iceberg data compaction

You can compact Iceberg tables and optimize them for read operations from Hive and Impala. Compaction is an essential table maintenance activity that creates a new snapshot, which contains the table content in a compact form. See Iceberg data compaction.

SQL support for querying Iceberg metadata tables

Apache Iceberg stores extensive metadata for its tables. From Hive and Impala, you can query the metadata tables as you would query a regular table. For example, you can use projections, joins, filters, and so on. See Query metadata tables feature.

Impala support for reading Iceberg equality deletes for NiFi

Cloudera supports row-level deletes, and starting with this release you can read equality deletes from Impala with support added for Apache NiFi. See the Delete data feature.

What’s new in Hue on Cloudera Data Warehouse Public Cloud

General availability (GA) of the SQL AI Assistant

Hue leverages the power of Large Language Models (LLM) to help you generate SQL queries from natural language prompts and also provides options to optimize, explain, and fix queries, promoting efficient and accurate practices for accessing and manipulating data. You can use several AI services and models such as OpenAI’s GPT service, Amazon Bedrock, and Azure’s OpenAI service to run the Hue SQL AI assistant.

Introduction of task server in Hue and significant improvement in the file upload functionality

A new Task Server page has been added to the Hue web interface. The Hue task server enables the following functionalities:

  • It improves the file-upload experience, allowing you to upload multiple files up to 5 GB each in parallel.
  • It helps you to schedule tasks to clean up Hue documents and the /tmp directory, improving cluster maintenance experience and performance.

See About the Hue task server in Cloudera Data Warehouse.

Cloudera Data Warehouse Preview Features

Enabling the Hive Virtual Warehouse to spill to an EBS volume (Preview)

To prevent failures when query data exceeds memory capacity, you spill data to an EBS volume. The data spills to the Amazon gp3 Elastic Block Store (EBS) volumes. You select the Additional LLAP Spill Disk (EBS) option when you create a Hive Virtual Warehouse. CDP automatically provisions the gp3 volume type for spilling Hive queries when you create or reactivate a Hive Virtual Warehouse on the latest CDW environment. For more information about EBS volumes, see Amazon documentation. Using the EBS volume incurs cost.

Note: You cannot enable the option to spill data to EBS volume after creating a Virtual Warehouse.

Improvements to the shared Hue service (Preview)

  • Name change from Query Editor to Shared Hue Service in the left navigation pane in the CDW UI.
  • Shared Hue service supports upgrade and rebuild operations similar to other CDW components.
  • Added a one-time option to copy saved queries and query history while creating a shared Hue service instance.

For more information, see Deploying shared Hue service in Data Warehouse Public Cloud (Preview).

Ability to log and manage Impala workloads (Preview)

CDW provides you the option to enable logging Impala queries on an existing Virtual Warehouse or while creating a new Impala Virtual Warehouse. By logging the Impala queries in Cloudera Data Warehouse (CDW), you gain increased observability of the workloads running on Impala, which you can use to improve the performance of your Impala Virtual Warehouses.

This feature represents a significant enhancement to query profiling capabilities. You can have Impala archive crucial data from each query’s profile into dedicated database tables known as the query history table and live query table. These tables are part of the sys database and are designed to store valuable information that can later be queried using any Impala client, providing a consolidated view of reports from previously executed queries.

For more information, see Impala workload management in Data Warehouse Public Cloud (Preview).

Introducing AI-enhanced UDF development package in Impala (Preview)

  • A built-in AI function, ai_generate_text, enabling direct access to Large Language Models (LLMs) from SQL queries by inputting a prompt and retrieving the response.
  • This integration into existing workflows simplifies the process, reducing complexity and enhancing the user experience, allowing for quicker setup and deployment of UDFs in Impala.

For more information, see Advantages and use cases of Impala AI functions (Preview).

Support for Impala external JDBC data sources (Preview)

Apache Impala now supports reading from external JDBC data sources. An external JDBC table represents a table or a view in a remote RDBMS database or another Impala cluster. Using external JDBC tables, you can connect Impala to a database, such as MySQL, PostgreSQL, or another Impala cluster and read the data in the remote tables.

For more information, see Using Impala to query external JDBC data sources (Preview).

AWS environment permissions support for the EKS start/stop feature (Preview)

AWS permissions have been expanded to support the Elastic Kubernetes Service (EKS) start/stop feature:

  • rds:StartDBInstance
  • rds:StopDBInstance
  • rds:DescribeDBInstances
  • autoscaling:DescribeAutoScalingGroups

Ability to select an instance type for Virtual Warehouses (Preview)

You can now specify AWS or Azure instance types, such as r6id.4xlarge or Standard_E16_v3, that you want to use for your Virtual Warehouse while creating a Virtual Warehouse. You are no longer confined to use the instance types that were specified while activating the environment in CDW.

Note:

  • In CDW 1.9.1-b233 (released July 26, 2024), instance type selection is only supported using CDP CLI. See CDP CLI documentation. You must install and use CDP CLI version 0.9.119.
  • Selecting the instance type while activating an environment has been deprecated and will be removed in future releases.

Impala support for reading Iceberg equality deletes for NiFi (Preview)

Cloudera supports row-level deletes, and starting with this release you can read equality deletes from Impala with support added for Apache NiFi. See the Delete data feature.

Machine Learning

The July 2024 release (2.0.45-b86) of Cloudera Machine Learning on CDP Public Cloud introduces bug fixes only.

For the list of fixed issues, see the Cloudera Machine Learning release notes.

Management Console

AWS RDS certificate rotation

AWS requires the rotation of the SSL/TLS certificates used for secure communication between CDP Public Cloud Data Lakes and certain Data Hubs and the external AWS RDS database instances that they rely on. CDP Public Cloud now provides multiple options to perform the required RDS certificate rotation.

For more information, see Rotating database certificates.

Replication Manager

Fine-grained permission to access CDP Public Cloud Replication Manager

You can choose to restrict user access to view and use CDP Public Cloud Replication Manager with RBAC entitlement. To enable or disable the role-based access control (RBAC) entitlement, contact your Cloudera account team. For more information, see Fine-grained permission to access CDP Public Cloud Replication Manager.

HBase replication policy enhancements

Specify custom username

Starting from CDP Public Cloud 7.2.18.200, you can specify a custom username in the Select Source > Export snapshot user field during the HBase replication policy creation process. The option appears after you choose the Select Source > Perform Initial Snapshot option. Replication Manager uses the specified username on the source cluster to export the initial snapshot to the target.

Replication Manager support for AWS temporary credentials (technical preview)

You can use temporary AWS credentials, through the IDBroker service, to replicate HDFS data, Hive external tables, and HBase data from 7.1.9 SP1 Kerberized CDP Private Cloud Base clusters using Cloudera Manager 7.11.3 CHF7 or higher versions to CDP Public Cloud S3 clusters. You can also use the temporary AWS credentials to replicate HDFS data from S3 buckets to 7.1.9 SP1 Kerberized CDP Private Cloud Base clusters or higher using Cloudera Manager 7.11.3 CHF7 or higher versions.

This is a technical preview feature. It is not recommended for production deployments. Cloudera recommends that you try this feature in development or test environments. To enable this feature, contact your Cloudera account team. For more information, see Add IDBroker to use temporary AWS session credentials.