Using Custom Spark Runtime Docker Images via API/CLI

This is a detailed user guide that demonstrates how to run Spark jobs using custom Spark runtime Docker images via API/CLI.

Custom Spark runtime Docker images are used when custom packages and libraries need to be installed and used when executing Spark jobs. These custom packages and libraries can be proprietary software packages like RPMs that need to be compiled to generate the required binaries. Docker images allow you to pre-bake these dependencies into a self-contained Docker file that can be used across multiple Spark jobs.

Entitlement for Customer Docker images

To use custom Spark runtime Docker images, your Cloudera tenant must have the DE_CUSTOM_RUNTIME entitlement enabled. If not yet in place, you can request it via your Cloudera Account Team (primarily your Solution Engineer). They will fulfill the request internally and confirm to you when the entitlement has been applied. Please allow 24 hours for fulfillment.

Docker repository credentials

To pull the base Docker image, you must have credentials and authenticate your Docker client to docker.repository.cloudera.com. To get credentials:

  1. Raise an “Admin” type case with Cloudera Support and request a License Key.
  2. Once the License Key is received, navigate to https://www.cloudera.com/downloads.html and log in with your MyCloudera credentials.
  3. Use the Credentials Generator tool - this is about half-way down the page (you will see an orange Sign In button if not already logged in).

    As directed on the page, copy and paste the entire contents of your license file into the text box and click the Get Credentials button to generate your username and password.

  1. Create a custom Docker image.
    Build the custom-spark-dex-runtime image based on the dex-spark-runtime image of the Cloudera Data Engineering version.

    The image should be based on the dex-spark-runtime of the current dex version.

    The relevant dex-spark-runtime images are as follows.

    • Spark 3 Cloudera security hardened images

      <registry-host>/cloudera/dex/dex-spark-runtime-<spark version>-<cdh version>:<CDE version>

      Example: DockerFile for DEX 1.24.0-b711, Spark 3.3.2 and Cloudera Runtime version 7.1.9.1015

      FROM
      docker.repository.cloudera.com/cloudera/dex/dex-spark-runtime-3.3.2-7.1.9
      .1015:1.24.0-b711
      USER root
      RUN apk add --no-cache git
      RUN pip3 install virtualenv-api
      USER ${DEX_UID}
    • Spark 3 Redhat (insecure and deprecated) images

      <registry-host>/cloudera/dex/dex-spark-runtime-<spark version>-<cdh version>-compat:<CDE version>

      Example: DockerFile for DEX 1.24.0-b711, Spark 3.3.2 and Cloudera Runtime version 7.1.9.1015

      FROM docker.repository.cloudera.com/cloudera/dex/dex-spark-runtime-3.3.2-7.1.9.1015-compat:1.24.0-b711
      USER root
      RUN yum install -y git && yum clean all && rm -rf /var/cache/yum
      RUN pip2 install virtualenv-api
      RUN pip3 install virtualenv-api
      USER ${DEX_UID}
    • Spark 2 Redhat (insecure and deprecated) images

      <registry-host>/cloudera/dex/dex-spark-runtime-<spark version>-<cdh version>:<CDE version>

      Example: DockerFile for DEX 1.24.0-b711, Spark 2.4.8 and Cloudera Runtime version 7.1.9.1015

      FROM
      docker.repository.cloudera.com/cloudera/dex/dex-spark-runtime-2.4.8-7.1.9
      .1015:1.24.0-b711
      USER root
      RUN yum install -y git && yum clean all && rm -rf /var/cache/yum
      RUN pip2 install virtualenv-api
      RUN pip3 install virtualenv-api
      USER ${DEX_UID}
  2. Build the Docker image by tagging it with the custom registry and push it to the custom registry.

    Example:

    mac@local:$ docker build --network=host -t
    docker.my-company.registry.com/custom-dex/dex-spark-runtime-2.4.8
    -7.2.14.0:1.15.0-b117-custom . -f Dockerfile
    mac@local:$ docker push
    docker.my-company.registry.com/custom-dex/dex-spark-runtime-2.4.8
    -7.2.14.0:1.15.0-b117-custom

    Here, the custom registry is docker.my-company.registry.com and the registry namespace is custom-dex.

  3. Create a custom runtime image resource.
    Register custom-spark-dex-runtime docker image as a resource of type Custom-runtime-image.
    1. Create a resource for the registries which do not require any authentication. If using a public Docker registry or if the Docker registry is in the same environment, for example, the same AWS account or Azure subscription where the Cloudera Data Engineering service is running, then you do not need to create credentials.

      CLI

      mac@local:$ cde resource create --name custom-image-resource
      --image
      docker.my-company.registry.com/custom-dex/dex-spark-runtime-2.4.8
      -7.2.14.0:1.15.0-b117-custom --image-engine spark2 --type
      custom-runtime-image

      REST API

      curl -X POST -k 'https://<dex-vc-host>/dex/api/v1/resources \
        -H "Authorization: Bearer ${CDE_TOKEN}" \
        -H 'accept: application/json' \
        -H 'Content-Type: application/json' \
        --data '{
        "customRuntimeImage": {
          "engine": "spark2",
          "image":
      "docker.my-company.registry.com/custom-dex/dex-spark-runtime-2.4.
      8-7.2.14.0:1.15.0-b117
      -custom"
        },
        "name": "custom-image-resource",
        "type": "custom-runtime-image"
      }'
      

      Once done, skip to step 4 to submit the job.

    2. Create a resource which requires the credentials to access the registry. Use the following command or the API request to create the credentials. These credentials are stored as a secret.

      CLI

      mac@local:$ ./cde credential create --name docker-creds --type
      docker-basic --docker-server docker-sandbox.infra.cloudera.com
      --docker-username my-username

      REST API

      curl -X POST -k 'https://<dex-vc-host>/dex/api/v1/credentials' \
        -H "Authorization: Bearer ${CDE_TOKEN}" \
        -H 'accept: application/json' \
        -H 'Content-Type: application/json' \
        --data '{
        "dockerBasic": {
          "password": "password123",
          "server": "docker-sandbox.infra.cloudera.com",
          "username": "my-username"
        },
        "name": "docker-creds",
        "type": "docker-basic"
      }'
      
    3. Register the custom-spark-dex-runtime Docker image as a resource of type custom-runtime-image by specifying the name of the credential created above.

      CLI

      mac@local:$ ./cde resource create --name custom-image-resource
      --image
      docker.my-company.registry.com/custom-dex/dex-spark-runtime-2.4.8
      -7.2.14.0:1.15.0-b117
      -custom --image-engine spark2 --type custom-runtime-image
      --image-credential docker-creds

      REST API

      curl -X POST -k 'https://<dex-vc-host>/dex/api/v1/resources \
        -H "Authorization: Bearer ${CDE_TOKEN}" \
        -H 'accept: application/json' \
        -H 'Content-Type: application/json' \
        --data ‘{
        "customRuntimeImage": {
          "credential": "docker-creds",
          "engine": "spark2",
          "image":
      "docker.my-company.registry.com/custom-dex/dex-spark-runtime-2.4.
      8-7.2.14.0:1.15.0-b117
      -custom"
        },
        "name": "custom-image-resource",
        "type": "custom-runtime-image"
      }'
      
  4. Submit a job by setting the custom-spark-dex-runtime image as a resource using the CDE CLI.

    SPARK command

    mac@local:$ ./cde --user cdpuser1 spark submit
    /Users/my-username/spark-examples_2.11-2.4.4.jar
    --class org.apache.spark.examples.SparkPi 1000
    --runtime-image-resource-name=custom-image-resource

    JOB command

    mac@local:$ ./cde --user cdpuser1 resource create --name
    spark-jar
    mac@local:$ ./cde --user cdpuser1 resource upload --name
    spark-jar --local-path spark-examples_2.11-2.4.4.jar
    mac@local:$ ./cde --user cdpuser1 job create --name
    spark-pi-job-cli --type spark --mount-1-resource spark-jar
    --application-file spark-examples_2.11-2.4.4.jar --class
    org.apache.spark.examples.SparkPi --user cdpuser1 --arg 22
    --runtime-image-resource-name custom-image-resource
  5. The Spark driver/executor pods should use this image and you can confirm it by opening a shell into those pods and verifying if the external installed libraries or files exist.