What's new in Cloudera Data Engineering Private Cloud

This release of Cloudera Data Engineering (CDE) on CDP Private Cloud 1.5.1 includes the following features:

Support for Apache Iceberg V2 (Technical Preview)

Iceberg table format version 2 (v2) is now available in CDE. The latest specifications include the following key updates:
  • UPDATE and DELETE operations follow the Iceberg format v2 row-level position delete specification and enforces snapshot isolation.
  • DELETES, UPDATES, and MERGE operations use the merge-on-read function by default. Merge-on-read is more efficient than the copy-on-write function because it does not rewrite file data.

Apache Iceberg V1 and V2 is in technical preview in the Private Cloud 1.5.1 release and is not recommended for production deployments. Cloudera recommends that you try this feature in test or development environments.

For more information, see Using Apache Iceberg in Cloudera Data Engineering.

Ozone integration using Data Connectors

Apache Ozone is an object store available on the CDP Private Cloud Base cluster which enables you to optimize storage for big data workloads. You can now create and manage CDE clusters with Ozone as a backend storage provider.

For more information, see Using Ozone storage with Cloudera Data Engineering.

Sessions (Technical Preview)

A Cloudera Data Engineering (CDE) Session is an interactive short-lived development environment for running Spark commands to help you iterate upon and build your Spark workloads.

Sessions is in technical preview in this release and is not recommended for production deployments. Cloudera recommends that you try this feature in test or development environments.

For more information, see Creating Sessions in Cloudera Data Engineering.

Updates to the Spark-submit migration tool

We now offer a container image where you can run the migration tool without installing it. It enables you to take the advantage of the migration tool on any operating system.

For more information, see Using spark-submit drop-in migration tool.

Upgrading CDE Service with Endpoint Stability

You can seamlessly upgrade an old Cloudera Data Engineering (CDE) service to new version with endpoint stability. This enables you to access the CDE service of the new version with the previous endpoint. Thus, you can use the existing endpoints without changing configurations at the application level.

The CDE service endpoint migration process lets you migrate your resources, jobs, job run history, spark jobs’ logs and event logs from your old cluster to the new cluster.

For more information, see Upgrading CDE Service with Endpoint Stability.

Elastic Quota for Virtual Clusters (Technical Preview)

You can configure elastic quota to a virtual cluster (VC) to get a minimum guaranteed and maximum capacity of resources (CPU and memory) as guaranteed quota and maximum quota. The guaranteed quota dictates the minimum amount of resources available for allocation for a VC at all times. The resources above the guaranteed quota and within the VC’s maximum quota can be used by any VC on demand if the cluster capacity allows for it.

Elastic quotas allow the VC to acquire unused capacity in the cluster when their guaranteed quota limit gets exhausted. This ensures efficient use of resources in the cluster. At the same time, the maximum quota limits the threshold amount of resources a VC can claim in the cluster at any given time.

For more information, see Creating virtual clusters.

Airflow file based resource using the CDE CLI

CDE supports Airflow file based resources using the CDE CLI. By creating a pipeline in CDE using the CLI, you can add custom files that are available for tasks.

For more information, see Creating an Airflow pipeline with custom files using CDE CLI.

Creating a custom Airflow Python environment (Technical Preview)

Cloudera Data Engineering (CDE) supports a custom Python environment in Airflow to manage job dependencies using the airflow-python-env resource type.

For more information, see Creating a custom Airflow Python environment.

Pulse metrics

CDE provides Virtual Cluster-level metrics for tracking job runs, database requests, HTTP server requests, LDAP calls, Kubernetes nodes and other components related to the environment. These metrics are stored in and monitored by Prometheus.

For more information, see Monitoring metrics for Data Engineering.

Gang scheduling support for Spark applications

Previously, Yunikorn scheduled the Spark applications only when the application’s minimal resource request was satisfied. Otherwise, applications will be waiting in the queue. Applications are queued in hierarchy queues. Gang scheduling is now enabled by default and each resource queue is assigned the maximum number of applications running concurrently with minimum resources guaranteed.

Upgrade to Airflow 2.3.4

CDE PVC 1.5.1 now runs with Airflow 2.3.4. This upgrade includes several fixes to improve performance and stability.