Requirements for using a PBJ Workbench

Learn about the prerequisites and preparation steps for setting up a PBJ Workbench.

PBJ Workbench setup: Python installation

PBJ Runtimes must have Python installed, even if the Runtime is designed to run another kernel in Cloudera AI, for example, R kernel. The minimum supported Python version is 3.7. Python can be installed by using the package manager of the base image or can be compiled by the user.

The custom PBJ Runtime image must meet the following essential requirements:

  • The actual Python binary or a symlink file pointing to the custom PBJ Runtime image must be located at the following path: /usr/local/bin/python3.
  • The Python binary must be included in the PATH environment variable under the python name, ensuring that executing the python command in a terminal successfully launches Python.
  • Executing python --version must return the result of a Python version higher than version 3.7.
  • If the Runtime is configured to run a Python kernel in Cloudera AI, both the python and the /usr/local/bin/python3 commands must launch the same Python process that is registered as a Jupyter kernel.

If the chosen method for installing Python does not place the Python binary under /usr/local/bin/python3, or does not create the python command, create the appropriate symlink files.

Installing Jupyter dependencies and registering your kernel

  1. Install the Jupyter kernel Gateway 2.5.2 version into the Docker image.

    You might need to modify this example command depending on the filename and path of the pip executable in the image.

    RUN pip3 install "jupyter-kernel-gateway==2.5.2"
  2. Ensure you document the path to the Jupyter executable file installed by the pip package manager. Incorporate the command to run Jupyter kernel Gateway into the ML_RUNTIME_JUPYTER_KERNEL_GATEWAY_CMD environment variable within the Docker image:

    ENV ML_RUNTIME_JUPYTER_KERNEL_GATEWAY_CMD="/path/to/jupyter kernelgateway"

    When launching the Runtime in Cloudera AI, the correct IP address, port configuration for Jupyter kernel Gateway is set automatically by Cloudera AI.

  3. Register the Jupyter kernel.

    Each instance of the PBJ Workbench communicates with the Jupyter kernel installed in the Runtime image by using the Jupyter protocol. Kernels are available for a wide variety of languages and versions. Install the kernel of your choice to the image by following its installation instructions. A kernel named python3 is registered by default when installing jupyter-kernel-gateway using pip package manager. Installed Jupyter Kernels can be listed by running the following command in a container created from the image:

    path/to/jupyter kernelspec list
  4. Defne the name of your chosen kernel within the ML_RUNTIME_JUPYTER_KERNEL_NAME environment variable in the Docker image.

    For example, if the name of your kernel is python3, include the following in the Dockerfile:

    ENV ML_RUNTIME_JUPYTER_KERNEL_NAME=python3

Adding the cdsw user

The user code executes in the image under the user and group identified as 8536:8536. Associate these IDs with the cdsw name in the image by adding the following command to the dockerfile:

RUN groupadd --gid 8536 cdsw && \
    useradd -c "CDSW User" --uid 8536 -g cdsw -m -s /bin/bash cdsw
   

Configuring permissions to enable writing Cloudera user settings

All code within the runtime container, including initial setup, executes under the cdsw user. The initial setup includes linking client files for Cloudera Data Services on premises to their standard paths. To enable this process, ensure that the following paths, along with their subfolders, have write permissions for the user ID 8536:

  • /etc
  • /bin
  • /usr/share/java
  • /opt
  • /usr

Additionally, set the permissions for the following directories, along with all their subdirectories to 777.

  • /etc
  • /etc/alternatives

Additional requirements

  • ML_RUNTIME_METADATA_VERSION environment variable and the corresponding Docker label must be set to value 2.
  • To use the PBJ Workbench editor, the ML_RUNTIME_EDITOR environment variable and the corresponding Docker label must be set to PBJ Workbench. If using a 3rd party editor, for example, JupyterLab or RStudio, set the ML_RUNTIME_EDITOR environment variable and the Docker label to the desired value.
  • The base image must be Ubuntu.
  • The Bash tool must be installed and must be configured as the default terminal used by the cdsw user.
  • When the PBJ Runtime is running the R kernel, the kernel must be registered with the IRkernel package and the bracketed paste mode must be disabled for the bash tool.
  • The executable, that is registered as a Jupyter kernel, must be on the PATH environment variable, must be found by the which command and must be named after the programming language of the kernel. For example, the name of the executable must be:
    • python in case of a Python kernel.
    • R in case of an R kernel
  • When using a virtual or Conda environment and a Python kernel, Cloudera recommends configuring the PATH environment variable so, that the default pip command corresponds to the Python executable registered as the Jupyter kernel.
  • Cloudera AI mounts the project’s filesystem under the path /home/cdsw and overwrites any files placed in that location within the Runtime image. Therefore, custom Runtime images must avoid installing any files or configurations under the home folder of the cdsw user.
  • Once the Runtime image starts up in Cloudera AI, the kernel must be configured to install new packages to user site libraries under /home/cdsw. That way, newly installed packages persist in the project filesystem.
  • The xz-utils package must be installed on the Runtime image.
  • The following binaries must be accessible on the PATH variable: kinit, klist, ktutil, and sshd. The binaries are installed on Ubuntu as part of the following packages: krb5-user and ssh.