Using Custom Spark Runtime Docker Images via API/CLI

This is a detailed user guide that demonstrates how to run Spark jobs using custom Spark runtime Docker images via API/CLI.

Custom Spark runtime Docker images are used when custom packages and libraries need to be installed and used when executing Spark jobs. These custom packages and libraries can be proprietary software packages like RPMs that need to be compiled to generate the required binaries. Docker images allow you to pre-bake these dependencies into a self-contained Docker file that can be used across multiple Spark jobs.

Entitlement for Customer Docker images

To use custom Spark runtime Docker images, your Cloudera tenant must have the DE_CUSTOM_RUNTIME entitlement enabled. If not yet in place, you can request it via your Cloudera Account Team (primarily your Solution Engineer). They will fulfill the request internally and confirm to you when the entitlement has been applied. Please allow 24 hours for fulfillment.

Docker repository credentials

To pull the base Docker image, you must have credentials and authenticate your Docker client to docker.repository.cloudera.com. To get credentials:

Raise an “Admin” type case with Cloudera Support and request a License Key.
Once the License Key is received, navigate to https://www.cloudera.com/downloads.html and log in with your MyCloudera credentials.
Use the Credentials Generator tool - this is about half-way down the page (you will see an orange Sign In button if not already logged in).

As directed on the page, copy and paste the entire contents of your license file into the text box and click the Get Credentials button to generate your username and password.

Create a custom Docker image.
Build the custom-spark-dex-runtime image based on the dex-spark-runtime image of the Cloudera Data Engineering version.
The image should be based on the dex-spark-runtime of the current dex version.
The relevant dex-spark-runtime images are as follows.
- Spark 3 Cloudera security hardened images
  <registry-host>/cloudera/dex/dex-spark-runtime-<spark version>-<cdh version>:<CDE version>
  Example: DockerFile for DEX 1.24.0-b711, Spark 3.3.2 and Cloudera Runtime version 7.1.9.1015
```
FROM
docker.repository.cloudera.com/cloudera/dex/dex-spark-runtime-3.3.2-7.1.9
.1015:1.24.0-b711
USER root
RUN apk add --no-cache git
RUN pip3 install virtualenv-api
USER ${DEX_UID}
```
- Spark 3 Redhat (insecure and deprecated) images
  <registry-host>/cloudera/dex/dex-spark-runtime-<spark version>-<cdh version>-compat:<CDE version>
  Example: DockerFile for DEX 1.24.0-b711, Spark 3.3.2 and Cloudera Runtime version 7.1.9.1015
```
FROM docker.repository.cloudera.com/cloudera/dex/dex-spark-runtime-3.3.2-7.1.9.1015-compat:1.24.0-b711
USER root
RUN yum install -y git && yum clean all && rm -rf /var/cache/yum
RUN pip2 install virtualenv-api
RUN pip3 install virtualenv-api
USER ${DEX_UID}
```
- Spark 2 Redhat (insecure and deprecated) images
  <registry-host>/cloudera/dex/dex-spark-runtime-<spark version>-<cdh version>:<CDE version>
  Example: DockerFile for DEX 1.24.0-b711, Spark 2.4.8 and Cloudera Runtime version 7.1.9.1015
```
FROM
docker.repository.cloudera.com/cloudera/dex/dex-spark-runtime-2.4.8-7.1.9
.1015:1.24.0-b711
USER root
RUN yum install -y git && yum clean all && rm -rf /var/cache/yum
RUN pip2 install virtualenv-api
RUN pip3 install virtualenv-api
USER ${DEX_UID}
```

Build the Docker image by tagging it with the custom registry and push it to the custom registry.

Example:

mac@local:$ docker build --network=host -t
docker.my-company.registry.com/custom-dex/dex-spark-runtime-2.4.8
-7.2.14.0:1.15.0-b117-custom . -f Dockerfile
mac@local:$ docker push
docker.my-company.registry.com/custom-dex/dex-spark-runtime-2.4.8
-7.2.14.0:1.15.0-b117-custom

Here, the custom registry is docker.my-company.registry.com and the registry namespace is custom-dex.

Create a custom runtime image resource.

Create a resource for the registries which do not require any authentication. If using a public Docker registry or if the Docker registry is in the same environment, for example, the same AWS account or Azure subscription where the Cloudera Data Engineering service is running, then you do not need to create credentials.
CLI
```
mac@local:$ cde resource create --name custom-image-resource
--image
docker.my-company.registry.com/custom-dex/dex-spark-runtime-2.4.8
-7.2.14.0:1.15.0-b117-custom --image-engine spark2 --type
custom-runtime-image
```
note
To obtain $CDE_TOKEN to execute the REST API examples, follow the Getting a Cloudera Data Engineering API access token document.

REST API
```
curl -X POST -k 'https://<dex-vc-host>/dex/api/v1/resources \
  -H "Authorization: Bearer ${CDE_TOKEN}" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  --data '{
  "customRuntimeImage": {
    "engine": "spark2",
    "image":
"docker.my-company.registry.com/custom-dex/dex-spark-runtime-2.4.
8-7.2.14.0:1.15.0-b117
-custom"
  },
  "name": "custom-image-resource",
  "type": "custom-runtime-image"
}'
```
Once done, skip to step 4 to submit the job.

Create a resource which requires the credentials to access the registry. Use the following command or the API request to create the credentials. These credentials are stored as a secret.

CLI

mac@local:$ ./cde credential create --name docker-creds --type
docker-basic --docker-server docker-sandbox.infra.cloudera.com
--docker-username my-username

REST API

curl -X POST -k 'https://<dex-vc-host>/dex/api/v1/credentials' \
  -H "Authorization: Bearer ${CDE_TOKEN}" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  --data '{
  "dockerBasic": {
    "password": "password123",
    "server": "docker-sandbox.infra.cloudera.com",
    "username": "my-username"
  },
  "name": "docker-creds",
  "type": "docker-basic"
}'

Register the custom-spark-dex-runtime Docker image as a resource of type custom-runtime-image by specifying the name of the credential created above.

CLI

mac@local:$ ./cde resource create --name custom-image-resource
--image
docker.my-company.registry.com/custom-dex/dex-spark-runtime-2.4.8
-7.2.14.0:1.15.0-b117
-custom --image-engine spark2 --type custom-runtime-image
--image-credential docker-creds

REST API

curl -X POST -k 'https://<dex-vc-host>/dex/api/v1/resources \
  -H "Authorization: Bearer ${CDE_TOKEN}" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  --data ‘{
  "customRuntimeImage": {
    "credential": "docker-creds",
    "engine": "spark2",
    "image":
"docker.my-company.registry.com/custom-dex/dex-spark-runtime-2.4.
8-7.2.14.0:1.15.0-b117
-custom"
  },
  "name": "custom-image-resource",
  "type": "custom-runtime-image"
}'

Submit a job by setting the custom-spark-dex-runtime image as a resource using the CDE CLI.

SPARK command

mac@local:$ ./cde --user cdpuser1 spark submit
/Users/my-username/spark-examples_2.11-2.4.4.jar
--class org.apache.spark.examples.SparkPi 1000
--runtime-image-resource-name=custom-image-resource

JOB command

mac@local:$ ./cde --user cdpuser1 resource create --name
spark-jar
mac@local:$ ./cde --user cdpuser1 resource upload --name
spark-jar --local-path spark-examples_2.11-2.4.4.jar
mac@local:$ ./cde --user cdpuser1 job create --name
spark-pi-job-cli --type spark --mount-1-resource spark-jar
--application-file spark-examples_2.11-2.4.4.jar --class
org.apache.spark.examples.SparkPi --user cdpuser1 --arg 22
--runtime-image-resource-name custom-image-resource

The Spark driver/executor pods should use this image and you can confirm it by opening a shell into those pods and verifying if the external installed libraries or files exist.