End-to-End Example: MeCab

This topic walks you through a simple end-to-end example on how to build and use custom engines.

This section demonstrates how to customize the Cloudera Data Science Workbench base engine image to include the MeCab (a Japanese text tokenizer) library.

This is a sample Dockerfile that adds MeCab to the Cloudera Data Science Workbench base image.

# Dockerfile

FROM docker.repository.cloudera.com/cdsw/engine:5
RUN apt-get update && \
    apt-get install -y -q mecab \
                          libmecab-dev \
                          mecab-ipadic-utf8 && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*
RUN cd /tmp && \
    git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git && \
    /tmp/mecab-ipadic-neologd/bin/install-mecab-ipadic-neologd -y -n -p /var/lib/mecab/dic/neologd && \
    rm -rf /tmp/mecab-ipadic-neologd
RUN pip install --upgrade pip
RUN pip install mecab-python==0.996
To use this image on your Cloudera Machine Learning project, perform the following steps.
  1. Build a new image with the Dockerfile.
    docker build --network=host -t <company-registry>/user/cdsw-mecab:latest . -f Dockerfile
  2. Push the image to your company's Docker registry.
    docker push <your-company-registry>/user/cdsw-mecab:latest
  3. Whitelist the image, <your-company-registry>/user/cdsw-mecab:latest. Only a site administrator can do this.

    Go to Admin > Engines and add <company-registry>/user/cdsw-mecab:latest to the list of whitelisted engine images.



  4. Ask a project administrator to set the new image as the default for your project. Go to the project Settings, click Engines, and select company-registry/user/cdsw-mecab:latest from the dropdown.



    You should now be able to run this project on the customized MeCab engine.