Learn about some basic concepts behind Cloudera Data Engineering (CDE) service to better understand how you can use the command line interface (CLI).
CDE has three main concepts:
- A 'job' is a definition of something that CDE can run. For example, the information required to run a jar file on Spark with specific configurations.
- job run
- A 'job run' is an execution of a job. For example, one run of a Spark job on a CDE cluster.
- A 'resource' refers to a job dependency that must be available to jobs
at runtime. Currently the following resource types are supported:
filesis a directory of files that you can upload to CDE pods into a standard location (
/app/mount). This is typically for application (for example,
.pyfiles) and reference files, and not the data that the job run will operate on. Multiple files resources can be referenced in a single job.
python-envis used to provide custom Python dependencies to the job as a Python virtual environment which is automatically configured. Up to one
python-envresource can be specified per job definition.
custom-runtime-imageis a custom Docker container image
In addition, to support jobs with custom requirements, CDE also allows users to manage credentials which can be used at job run time. Currently, only custom Docker registry credentials are supported.
Submitting versus running a job
cde spark submit and
cde airflow submit commands
automatically create a new job and a new resource, submit the job as a job run, and when the job
run terminates they delete the job and resources.
cde job run requires a job and all necessary resources to be created and
uploaded to the CDE cluster beforehand. The advantage of creating resources and jobs ahead of
time is that resources can be reused across jobs, and that jobs can be run using only a job