Creating Extended Engine Images
Cloudera Data Science Workbench site administrators and project administrators can add libraries and other dependencies to the Docker image in which their engines run. Site
administrators can whitelist specific images for use in projects, and project administrators can select which of these white-listed images is installed for their projects.
Use the following basic MeCab example as a guide on how you can extend the Cloudera Data Science Workbench base engine image to include the libraries you want.
Related Resources:
- This Cloudera Engineering Blog post on Customizing Docker Images in Cloudera Data Science Workbench describes an end-to-end example on how to build and publish a customized Docker image and use it as an engine in Cloudera Data Science Workbench.
- For an example of how to extend the base engine image to include Conda, see Creating an Extensible Engine With Conda.
Example: MeCab
The following Dockerfile shows how to add MeCab, a Japanese text tokenizer, to the base Cloudera Data Science Workbench engine.
# Dockerfile FROM docker.repository.cloudera.com/cdsw/engine:3 RUN apt-get update && \ apt-get install -y -q mecab \ libmecab-dev \ mecab-ipadic-utf8 && \ apt-get clean && \ rm -rf /var/lib/apt/lists/* RUN cd /tmp && \ git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git && \ /tmp/mecab-ipadic-neologd/bin/install-mecab-ipadic-neologd -y -n -p /var/lib/mecab/dic/neologd && \ rm -rf /tmp/mecab-ipadic-neologd RUN pip install --upgrade pip RUN pip install mecab-python==0.996
To use this image on your Cloudera Data Science Workbench project, perform the following steps.
- Build a new image with the Dockerfile.
docker build --network=host -t <company-registry>/user/cdsw-mecab:latest . -f Dockerfile
- Push the image to your company's Docker registry.
docker push <company-registry>/user/cdsw-mecab:latest
- Whitelist the image, <company-registry>/user/cdsw-mecab:latest. Only a site administrator can do this.
- Log in as a site administrator.
- Click Admin.
- Go to the Engines tab.
- Add <company-registry>/user/cdsw-mecab:latest to the list of whitelisted engine images.
- Make the whitelisted image available to your project. Only a project administrator can do this.
- Go to the project Settings page.
- Click Engines.
- Select company-registry/user/cdsw-mecab:latest from the dropdown list of available Docker images. Sessions and jobs you run in your project will now have access to this custom image.