Using Custom Spark Runtime Docker Images Via API/CLI

This is a detailed usage guide to demonstrate how to run jobs using custom spark runtime with examples.

Steps

  1. Create a custom docker image.
    Build “custom-spark-dex-runtime” images based on the dex-spark-runtime image of the Cloudera Data Engineering version.

    The relevant dex-spark-runtime images are as follows.

    • Spark 3 Cloudera security hardened images

      <registry-host>/cloudera/dex/dex-spark-runtime-<spark version>-<cdh version>:<Cloudera Data Engineering version>

      Example: DockerFile for DEX 1.24.0-b711, Spark 3.3.2 and Cloudera Runtime version 7.1.9.1015

      FROM
      docker.repository.cloudera.com/cloudera/dex/dex-spark-runtime-3.3.2-7.1.9.1015:1.24.0-b711
      USER root
      RUN apk add --no-cache git
      RUN pip3 install virtualenv-api
      USER ${DEX_UID}
    • Spark 3 Redhat (insecure and deprecated) images

      <registry-host>/cloudera/dex/dex-spark-runtime-<spark version>-<cdh version>-compat:<CDE version>

      Example: DockerFile for DEX 1.24.0-b711, Spark 3.3.2 and Cloudera Runtime version 7.1.9.1015

      FROM docker.repository.cloudera.com/cloudera/dex/dex-spark-runtime-3.3.2-7.1.9.1015-compat:1.24.0-b711
      USER root
      RUN yum install -y git && yum clean all && rm -rf /var/cache/yum
      RUN pip2 install virtualenv-api
      RUN pip3 install virtualenv-api
      USER ${DEX_UID}
    • Spark 2 Redhat (insecure and deprecated) images

      <registry-host>/cloudera/dex/dex-spark-runtime-<spark version>-<cdh version>:<CDE version>

      Example: DockerFile for DEX 1.24.0-b711, Spark 2.4.8 and Cloudera Runtime version 7.1.9.1015

      FROM
      docker.repository.cloudera.com/cloudera/dex/dex-spark-runtime-2.4.8-7.1.9.1015:1.24.0-b711
      USER root
      RUN yum install -y git && yum clean all && rm -rf /var/cache/yum
      RUN pip2 install virtualenv-api
      RUN pip3 install virtualenv-api
      USER ${DEX_UID}
  2. Build the docker image tagging it with the custom registry to be used and push it to the custom registry.

    Example:

    mac@local:$ docker build --network=host -t docker.my-company.registry.com/custom-dex/dex-spark-runtime-2.4.8-7.2.14.0:1.15.0-b117-custom . -f Dockerfile
    mac@local:$ docker push docker.my-company.registry.com/custom-dex/dex-spark-runtime-2.4.8-7.2.14.0:1.15.0-b117-custom

    Here, the custom registry is docker.my-company.registry.com and the registry namespace is custom-dex.

  3. Create a custom runtime image resource.

    Register custom-spark-dex-runtime docker image as a resource of type Custom-runtime-image.

    1. Create a resource for the registries which do not require any authentication. If using a public Docker registry or if the Docker registry is in the same environment, for example, the same AWS account or Azure subscription where the Cloudera Data Engineering service is running, then you do not need to create credentials.
      mac@local:$ cde resource create --name custom-image-resource --image docker.my-company.registry.com/custom-dex/dex-spark-runtime-2.4.8-7.2.14.0:1.15.0-b117-custom --image-engine spark2 --type custom-runtime-image
      curl -X POST -k 'https://<dex-vc-host>/dex/api/v1/resources \
        -H "Authorization: Bearer ${CDE_TOKEN}" \
        -H 'accept: application/json' \
        -H 'Content-Type: application/json' \
        --data '{
        "customRuntimeImage": {
          "engine": "spark2",
          "image":
      "docker.my-company.registry.com/custom-dex/dex-spark-runtime-2.4.8-7.2.14.0:1.15.0-b117-custom"
        },
        "name": "custom-image-resource",
        "type": "custom-runtime-image"
      }'
      

      Once done, skip to step 4 to submit the job.

    2. Create a resource which requires the credentials to access the registry. Use the following command or the API request to create the credentials. These credentials are stored as a secret.
      mac@local:$ ./cde credential create --name docker-creds --type docker-basic --docker-server docker-sandbox.infra.cloudera.com --docker-username my-username
      curl -X POST -k 'https://<dex-vc-host>/dex/api/v1/credentials' \
        -H "Authorization: Bearer ${CDE_TOKEN}" \
        -H 'accept: application/json' \
        -H 'Content-Type: application/json' \
        --data '{
        "dockerBasic": {
          "password": "password123",
          "server": "docker-sandbox.infra.cloudera.com",
          "username": "my-username"
        },
        "name": "docker-creds",
        "type": "docker-basic"
      }'
      
    3. Register the custom-spark-dex-runtime docker image as a resource of type custom-runtime-image by specifying the name of the credential created earlier.
      mac@local:$ ./cde resource create --name custom-image-resource --image docker.my-company.registry.com/custom-dex/dex-spark-runtime-2.4.8-7.2.14.0:1.15.0-b117-custom --image-engine spark2 --type custom-runtime-image --image-credential docker-creds
      curl -X POST -k 'https://<dex-vc-host>/dex/api/v1/resources \
        -H "Authorization: Bearer ${CDE_TOKEN}" \
        -H 'accept: application/json' \
        -H 'Content-Type: application/json' \
        --data ‘{
        "customRuntimeImage": {
          "credential": "docker-creds",
          "engine": "spark2",
          "image":
      "docker.my-company.registry.com/custom-dex/dex-spark-runtime-2.4.8-7.2.14.0:1.15.0-b117-custom"
        },
        "name": "custom-image-resource",
        "type": "custom-runtime-image"
      }'
      
  4. Submit a job by setting the custom-spark-dex-runtime image as a resource using the CDE CLI

    mac@local:$ ./cde --user cdpuser1 spark submit /Users/my-username/spark-examples_2.11-2.4.4.jar --class org.apache.spark.examples.SparkPi 1000 --runtime-image-resource-name=custom-image-resource
    
    mac@local:$ ./cde --user cdpuser1 resource create --name spark-jar
    mac@local:$ ./cde --user cdpuser1 resource upload --name spark-jar --local-path spark-examples_2.11-2.4.4.jar
    mac@local:$ ./cde --user cdpuser1 job create --name spark-pi-job-cli --type spark --mount-1-resource spark-jar --application-file spark-examples_2.11-2.4.4.jar --class org.apache.spark.examples.SparkPi --user cdpuser1 --arg 22 --runtime-image-resource-name custom-image-resource
  5. The spark driver or spark executor pods should use this image and you can confirm it by opening a shell into those pods and verifying if the external installed libraries or files exist.

Public docker registries

Create the resource for the registries which do not require any auth. You do not need to specify the credentials.

mac@local:$ cde resource create --name custom-image-resource --image docker.my-company.registry.com/custom-dex/dex-spark-runtime-2.4.7-7.1.7.1000:1.18.2-b70-custom --image-engine spark2 --type custom-runtime-image
ccurl -X POST -k 'https://<dex-vc-host>/dex/api/v1/resources \
  -H "Authorization: Bearer ${CDE_TOKEN}"  \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  --data  ‘{
  "customRuntimeImage": {
    "engine": "spark2",
    "image":    "docker.my-company.registry.com/custom-dex/dex-spark-runtime-2.4.7-7.1.7.1000:1.18.2-b70-custom"
  },
  "name": "custom-image-resource",
   "type": "custom-runtime-image"
}’

Once done, skip to #step 5 to submit the job.

Error: Custom image resource with missing or wrong credentials

Creating a custom image resource with missing or wrong credentials should result in the below error which can be seen in the logs or in kubernetes pod events.

Example

Failed to pull image "docker.my-company.registry.com/custom-dex/dex-spark-runtime-2.4.7-7.1.7.1000:1.18.2-b70-custom": 
rpc error: code = Unknown desc = Error reading manifest 1.18.2-b70-custom in docker.my-company.registry.com/custom-dex/dex-spark-runtime-2.4.7-7.1.7.1000: 
errors: denied: requested access to the resource is denied unauthorized: authentication required