Cloudera Data Engineering service

Cloudera Data Engineering (CDE) is a serverless service for Cloudera Data Platform that allows you to submit batch jobs to auto-scaling virtual clusters. CDE enables you to spend more time on your applications, and less time on infrastructure.

Cloudera Data Engineering allows you to create, manage, and schedule Apache Spark jobs without the overhead of creating and maintaining Spark clusters. With Cloudera Data Engineering, you define virtual clusters with a range of CPU and memory resources, and the cluster scales up and down as needed to run your Spark workloads, helping to control your cloud costs.

The CDE service involves several components:

Environment
A logical subset of your cloud provider account including a specific virtual network. For more information, see Environments.
CDE Service
The long-running Kubernetes cluster and services that manage the virtual clusters. The CDE service must be enabled on an environment before you can create any virtual clusters.
Virtual Cluster
An individual auto-scaling cluster with defined CPU and memory ranges. Virtual Clusters in CDE can be created and deleted on demand. Jobs are associated with clusters.
Jobs
Application code along with defined configurations and resources. Jobs can be run on demand or scheduled. An individual job execution is called a job run.
Resource
A defined collection of files such as a Python file or application JAR, dependencies, and any other reference files required for a job.
Job run
An individual job run.

The CDE service differs from a Data Engineering Data Hub cluster in several ways, including the following:

Table 1. Cloudera Data Engineering Service vs. Data Engineering Data Hub
Feature Cloudera Data Engineering Data Hub DE Template
Cloud providers Amazon AWS, Microsoft Azure Amazon AWS, Microsoft Azure
Compute engines Apache Spark Apache Spark, Apache Hive
Deployment Kubernetes Virtual machines (cloud provider)
Resource management Yunikorn, Kubernetes YARN
Troubleshooting CDE deep analysis, Spark History Server Spark History Server
Portability Public/private cloud Public cloud
Job submission Managed API Apache Livy

Experimental support for Spark Streaming and Spark Structured Streaming

Support for Spark Streaming and Spark Structured Streaming in CDE is an experimental feature.

Spark Streaming and Spark Structured Streaming are supported with the following limitations:

  • Spark 3 is not supported.The native implementation of Spark Streaming checkpoint (Rename-Based Checkpointing, relying on atomic rename) is not eventually consistent in a cloud environment.

  • Hive Warehouse Conector (HWC) is not supported.
  • Metrics in WXM do not report information on Spark Streaming jobs.
  • The visual profiling and CDE deep analysis features are not supported.
  • Checkpointing with Amazon Elastic File System (EFS) is not supported.
  • Spark dynamic allocation is not supported.

    You may use spark.streaming.dynamicAllocation, but this option is only available for Discretized streams (DStreams).

  • S3 checkpoints may have issues with Runtime versions 7.2.8 or lower.
  • While running streaming with Kafka, set the auto.offset.reset parameter to "latest", to avoid out-of-memory errors.