Cloudera Data Engineering concepts

Learn about some basic concepts behind Cloudera Data Engineering service to better understand how you can use the command line interface (CLI).

Cloudera Data Engineering has three main concepts:

job

A 'job' is a definition of something that Cloudera Data Engineering can run. For example, the information required to run a jar file on Spark with specific configurations.

job run

A 'job run' is an execution of a job. For example, one run of a Spark job on a Cloudera Data Engineering cluster.

session

A 'session' is an interactive short-lived development environment for running Spark commands to help you iterate upon and build your Spark workloads.

resource

A 'resource' refers to a job dependency that must be available to jobs at runtime. Currently the following resource types are supported:

files is a directory of files that you can upload to Cloudera Data Engineering pods into a standard location (/app/mount). This is typically for application (for example, .jar or .py files) and reference files, and not the data that the job run will operate on. Multiple files resources can be referenced in a single job.
python-env is used to provide custom Python dependencies to the job as a Python virtual environment which is automatically configured. Up to one python-env resource can be specified per job definition.

In addition, to support jobs with custom requirements, Cloudera Data Engineering also allows users to manage credentials which can be used at job run time. Currently, only custom Docker registry credentials are supported.

Submitting versus running a job

The cde spark submit and cde airflow submit commands automatically create a new job and a new resource, submit the job as a job run, and when the job run terminates they delete the job and resources.

A cde job run requires a job and all necessary resources to be created and uploaded to the Cloudera Data Engineering cluster beforehand. The advantage of creating resources and jobs ahead of time is that resources can be reused across jobs, and that jobs can be run using only a job name.