CDE concepts
Learn about some basic concepts behind Cloudera Data Engineering (CDE) service to better understand how you can use the command line interface (CLI).
CDE has three main concepts:
- job
- A 'job' is a definition of something that CDE can run. For example, the information required to run a jar file on Spark with specific configurations.
- job run
- A 'job run' is an execution of a job. For example, one run of a Spark job on a CDE cluster.
- session
- A 'session' is an interactive short-lived development environment for running Spark commands to help you iterate upon and build your Spark workloads.
- resource
- A 'resource' refers to a job dependency that must be available to jobs
at runtime. Currently the following resource types are supported:
files
is a directory of files that you can upload to CDE pods into a standard location (/app/mount
). This is typically for application (for example,.jar
or.py
files) and reference files, and not the data that the job run will operate on. Multiple files resources can be referenced in a single job.python-env
is used to provide custom Python dependencies to the job as a Python virtual environment which is automatically configured. Up to onepython-env
resource can be specified per job definition.
In addition, to support jobs with custom requirements, CDE also allows users to manage credentials which can be used at job run time. Currently, only custom Docker registry credentials are supported.
Submitting versus running a job
The cde spark submit
and cde airflow submit
commands
automatically create a new job and a new resource, submit the job as a job run, and when the job
run terminates they delete the job and resources.
A cde job run
requires a job and all necessary resources to be created and
uploaded to the CDE cluster beforehand. The advantage of creating resources and jobs ahead of
time is that resources can be reused across jobs, and that jobs can be run using only a job
name.