Running a Spark job using the CLI

The following example demonstrates how to run a Cloudera Data Engineering (CDE) Spark job using the command line interface (CLI).

Make sure that the Spark job has been created and all necessary resources have been created and uploaded.

Using the cde job run requires more preparation on the target environment compared to the cde spark submit command. Whereas cde spark submit is a quick and efficient way of testing a Spark job during development, cde job run is suited for production environments where a job is to be run multiple times, therefore removing resources and job definitions after every job run is neither necessary, nor viable.

To run a Spark job, run the following command:
cde job run --name <job name> [Spark flags...] [--wait] [--variable name=value...]
  • With [Spark flags...] you can override the corresponding job values. Spark flags that can be repeated replace the original list, except for --conf which only adds or replaces values for the given keys.
  • With [--variable] flags you can replace strings in job values. Currently the supported fields are:
    • Spark application name
    • Spark arguments
    • Spark configurations
    For a variable flag name=value any substring {{{name}}} in the value of the supported field gets replaced with value.
  • A custom runtime Docker image can be specified for the job using the --runtime-image-resource-name flag, which has to refer to the name of a custom image resource that has already been created.
  • GPU Acceleration (Technical Preview): Using [--enable-gpu-acceleration] you can accelerate your Spark jobs using GPUs. You can use [--executor-node-selector ""] and [--executor-node-toleration ""] options to configure selectors and tolerations if you want to run the job on specific GPU nodes. When this job is run, this particular job will request GPU resources.
    cde job run --name example-pi \
    --enable-gpu-acceleration \
    --executor-node-selector "" \
    --executor-node-toleration ""

By default the command returns the job run ID as soon as the job has been submitted.

Optionally, you can use the --wait switch to wait until the job run ends and returns a non-zero exit code if the job run was not successful.