February 09, 2022

This release (1.14) of the Cloudera Data Engineering (CDE) service on CDP Public Cloud introduces the new features and improvements that are described in this topic.

Improved handling of job resources to reduce EFS utilization

  • Recursive copying of frequently used and large file resources can result in very high I/O throughput and can exhaust cloud storage burst credits, leading to poor performance. To avoid excessive file copying, CDE now uses hard linking in AWS by default.

[Technical Preview] Apache Iceberg support

  • Apache Iceberg tables are now supported with Spark 3 virtual clusters on AWS. Use tables at petabyte scale without impacting query planning, while benefiting from efficient metadata management, snapshotting, and time-travel.
  • Run multi-analytic workloads by accessing those same tables in Cloudera Data Warehouse (CDW) with Hive and Impala for BI and SQL analytics (Expected in an upcoming CDW release).

[Technical Preview] Remote Shuffle Service

  • You can now store Spark shuffle data on remote servers. This improves resilience in case of executor loss.
  • This feature is available as a Technical Preview. Contact your Cloudera account representative to enable access to this feature.

Unified diagnostic bundle

  • A single click now generates one unified bundle containing both service logs and summary status.
  • The bundles are stored securely in the object storage of the environment.
  • A historical list of previously generated bundles are available for access.

Guardrails to prevent submitting jobs that do not fit resource capacity

  • CDE now automatically prevents execution of jobs that do not fit on the available resources.
  • CDE takes into account Kubernetes and system reserved resources, daemonset utilized resources, and Spark overhead factors.
  • The API returns an error with run failed to start: requested [***TYPE AND AMOUNT OF RESOURCE***] is more than [***THE MAXIMUM AMOUNT OF AVAILABLE RESOURCES OF THAT TYPE***] allocatable per cluster node
  • You can either reduce the Spark executor and driver CPU and/or memory requirements, or deploy on a larger cluster.

Notification email configuration can now be verified

When configuring the optional email alerts feature [Technical Preview] during virtual cluster creation, you can now verify the SMTP settings before creating the virtual cluster.

Streamlined resource creation and re-use during job creation

You can now create a resource on the fly when creating a job. Alternatively, you can select from a list of existing resources, if any, to upload your application or DAG file. This promotes re-usability of project artifacts across jobs.

Kubernetes update

CDE now supports K8s 1.21.