Creating Extended Engine Images

Cloudera Data Science Workbench site administrators and project administrators can add libraries and other dependencies to the Docker image in which their engines run. Site administrators can whitelist specific images for use in projects, and project administrators can select which of these white-listed images is installed for their projects.

Use the following basic MeCab example as a guide on how you can extend the Cloudera Data Science Workbench base engine image to include the libraries you want.

Related Resources:

Example: MeCab

The following Dockerfile shows how to add MeCab, a Japanese text tokenizer, to the base Cloudera Data Science Workbench engine.

# Dockerfile

FROM docker.repository.cloudera.com/cdsw/engine:3
RUN apt-get update && \
    apt-get install -y -q mecab \
                          libmecab-dev \
                          mecab-ipadic-utf8 && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*
RUN cd /tmp && \
    git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git && \
    /tmp/mecab-ipadic-neologd/bin/install-mecab-ipadic-neologd -y -n -p /var/lib/mecab/dic/neologd && \
    rm -rf /tmp/mecab-ipadic-neologd
RUN pip install --upgrade pip
RUN pip install mecab-python==0.996
To use this image on your Cloudera Data Science Workbench project, perform the following steps.
  1. Build a new image with the Dockerfile.
    docker build --network=host -t <company-registry>/user/cdsw-mecab:latest . -f Dockerfile
  2. Push the image to your company's Docker registry.
    docker push <company-registry>/user/cdsw-mecab:latest
  3. Whitelist the image, <company-registry>/user/cdsw-mecab:latest. Only a site administrator can do this.
    1. Log in as a site administrator.
    2. Click Admin.
    3. Go to the Engines tab.
    4. Add <company-registry>/user/cdsw-mecab:latest to the list of whitelisted engine images.
  4. Make the whitelisted image available to your project. Only a project administrator can do this.
    1. Go to the project Settings page.
    2. Click Engines.
    3. Select company-registry/user/cdsw-mecab:latest from the dropdown list of available Docker images. Sessions and jobs you run in your project will now have access to this custom image.