Cloudera Data Engineering Concepts

Cloudera Date Engineering Service (CDE) is designed as a fully managed service for Spark. Among many other features, CDE streamlines and provides better Spark jobs monitoring capabilities with an enhanced Job Analysis page, that builds upon the Spark UI and Apache Airflow for orchestrating Spark pipelines.

CDE Jobs

CDE introduces the concept of a CDE job which can be of two types: Spark and Airflow.

A Spark CDE job is a Spark submit. An Airflow CDE job is an Airflow Directed Acyclic Graph (DAG) that orchestrates any Spark CDE jobs and optionally Hive CDW queries, and more.

While using Airflow is recommended for complex pipelines it is not mandatory.

CDE Resources

CDE Resources allow you to store all files related to Spark and Airflow jobs along with their dependencies (JARs, ZIP, and text/config files) in the CDE Virtual Cluster.

Resources can also manage Python Environments. In other Spark environments, files may have been pre-populated, or Python packages installed through pip, Anaconda, or a similar installer. Resources replace these concepts in CDE and simplify the manageability of working with multiple environments.

Most importantly, Resources provide better Spark and Airflow job observability. Every past run can be mapped to a specific set of dependencies.

For examples of using both file-based and Python environment CDE Resources from a CDE Spark application, see CDE CLI Demo.

Other CDE Concepts

CDE Resources and Jobs are really all you need to deploy Spark pipelines in CDE. If you are new to CDE Cloudera recommends that you get familiar with the following concepts:
  • Cloud Environment: A logical subset of your cloud provider account including a specific virtual network. For more information, see Environments.
  • CDE service: The long-running Kubernetes cluster and services that manage the virtual clusters. The CDE service must be enabled in an environment before you can create any virtual clusters.
  • CDE virtual cluster: An individual auto-scaling cluster with defined CPU and memory ranges. Virtual clusters in CDE can be created and deleted on demand. Jobs are associated with clusters.
  • CDE job run: An individual run of a CDE job. All runs are easily accessible from the CDE UI or observable through the CDE CLI and API.

Learning about how to build Spark CDE Jobs

CDE jobs of type Spark correspond to a Spark Submit CLI and are easy to build. Just like Spark Submit commands, they use Spark Jar, Python, Java files and any other Spark Submit argument such as Class, number of executors, and so on.

There are three ways to build a Spark CDE job:
  • Using the CDE web interface. For more information, see Running Jobs in Cloudera Data Engineering.
  • Using the CDE CLI tool. For more information, see Using the Cloudera Data Engineering command line interface.
  • Using CDE Rest API endpoints. For more information, see CDE API Jobs.

In addition, you can automate migrations from Oozie on CDP Public Cloud Data Hub, CDP Private Cloud Base, CDH and HDP to Spark and Airflow CDE jobs with the oozie2cde API. For information about using the API, see Migrating Oozie to CDE with the oozie2cde API .

The CDE CLI and API are equivalent in terms of functionality. The CLI requires downloading and installing a binary on your machine. On the other hand, the API requires submitting requests with a temporary token.

Generally, the CLI is more suitable for interactivity while the API is better for integrating CDE pipelines with external systems.