Using Custom Spark Runtime Docker Images via API/CLI
This is a detailed user guide that demonstrates how to run Spark jobs using custom
Spark runtime Docker images via API/CLI.
Custom Spark runtime Docker images are used when custom packages and libraries need
to be installed and used when executing Spark jobs. These custom packages and
libraries can be proprietary software packages like RPMs that need to be compiled to
generate the required binaries. Docker images allow you to pre-bake these
dependencies into a self-contained Docker file that can be used across multiple
Spark jobs.
Entitlement for Customer Docker images
To use custom Spark runtime Docker images, your Cloudera tenant must have the
DE_CUSTOM_RUNTIME entitlement enabled. If not yet in place, you
can request it via your Cloudera Account Team (primarily your Solution Engineer).
They will fulfill the request internally and confirm to you when the entitlement has
been applied. Please allow 24 hours for fulfillment.
Docker repository credentials
To pull the base Docker image, you must have credentials and authenticate your Docker
client to docker.repository.cloudera.com. To get credentials:
Raise an “Admin” type case with Cloudera
Support and request a License Key.
Use the Credentials Generator tool - this is about half-way down the page (you
will see an orange Sign In button if not already logged
in).
As directed on the page, copy and paste the entire contents of your
license file into the text box and click the Get
Credentials button to generate your username and
password.
Create a custom Docker image.
Build the custom-spark-dex-runtime image based on the
dex-spark-runtime image of the Cloudera Data Engineering version.
The image should be based on the
dex-spark-runtime of the current dex version.
The
relevant dex-spark-runtime images are as follows.
Example: DockerFile for DEX 1.24.0-b711,
Spark 3.3.2 and Cloudera Runtime
version
7.1.9.1015
FROM
docker.repository.cloudera.com/cloudera/dex/dex-spark-runtime-3.3.2-7.1.9
.1015:1.24.0-b711
USER root
RUN apk add --no-cache git
RUN pip3 install virtualenv-api
USER ${DEX_UID}
Example: DockerFile for DEX 1.24.0-b711,
Spark 3.3.2 and Cloudera Runtime
version
7.1.9.1015
FROM docker.repository.cloudera.com/cloudera/dex/dex-spark-runtime-3.3.2-7.1.9.1015-compat:1.24.0-b711
USER root
RUN yum install -y git && yum clean all && rm -rf /var/cache/yum
RUN pip2 install virtualenv-api
RUN pip3 install virtualenv-api
USER ${DEX_UID}
Example: DockerFile for DEX 1.24.0-b711,
Spark 2.4.8 and Cloudera Runtime
version
7.1.9.1015
FROM
docker.repository.cloudera.com/cloudera/dex/dex-spark-runtime-2.4.8-7.1.9
.1015:1.24.0-b711
USER root
RUN yum install -y git && yum clean all && rm -rf /var/cache/yum
RUN pip2 install virtualenv-api
RUN pip3 install virtualenv-api
USER ${DEX_UID}
Build the Docker image by tagging it with the custom registry and push it to
the custom registry.
Here, the custom registry is docker.my-company.registry.com
and the registry namespace is custom-dex.
Create a custom runtime image resource.
Register custom-spark-dex-runtime docker image as a resource
of type Custom-runtime-image.
Create a resource for the registries which do not require any
authentication. If using a public Docker registry or if the Docker
registry is in the same environment, for example, the same AWS account
or Azure subscription where the Cloudera Data Engineering
service is running, then you do not need to create credentials.
Create a resource which requires the credentials to access the
registry. Use the following command or the API request to create the
credentials. These credentials are stored as a secret.
The Spark driver/executor pods should use this image and you can confirm it by
opening a shell into those pods and verifying if the external installed
libraries or files exist.