Installing Packages and Libraries

Cloudera Data Science Workbench engines are preloaded with a few common packages and libraries for R, Python, and Scala. However, a key feature of Cloudera Data Science Workbench is the ability of different projects to install and use libraries pinned to specific versions, just as you would on your local computer.

You can install additional libraries and packages from the workbench, either using the command prompt or terminal. Alternatively, you might choose to use a package manager such as Conda to install and maintain packages and their dependencies. For some basic usage guidelines, see Using Conda with Cloudera Data Science Workbench.

To install a package from the command prompt:

  1. Launch a session.
  2. At the command prompt in the bottom right, enter the command to install the package. Some examples using Python and R have been provided.

R

# Install from CRAN 
install.packages("ggplot2") 

# Install using devtools 
install.packages('devtools') 
library(devtools) 
install_github("hadley/ggplot2") 

Python 2

# Installing from console using ! shell operator and pip:
!pip install beautifulsoup

# Installing from terminal
pip install beautifulsoup

Python 3

# Installing from console using ! shell operator and pip3:
!pip3 install beautifulsoup

# Installing from terminal
pip3 install beautifulsoup

Generally, Cloudera recommends you install all packages locally into your project. This will ensure you have the exact versions you want and that these libraries will not be upgraded when Cloudera upgrades the base engine image. You only need to install libraries and packages once per project. From then on, they are available to any new engine you spawn throughout the lifetime of the project.

Specify the packages you want in a requirements.txt file that lives in your project, then install them using pip/pip3. For example, if you list the following packages in requirements.txt:
beautifulsoup4==4.6.0
seaborn==0.7.1
To install the packages, just run:
!pip3 install -r requirements.txt

Cloudera Data Science Workbench does not currently support customization of system packages that require root access. However, Cloudera Data Science Workbench site administrators and project administrators can add libraries and other dependencies to the Docker image in which their engines run. See Creating Extended Engine Images.

Using Conda with Cloudera Data Science Workbench

Cloudera Data Science Workbench recommends using pip for package management along with a requirements.txt file (as described in the previous section). However, for users that prefer Conda, the default engine in Cloudera Data Science Workbench includes two environments called python2.7, and python3.6. These environments are added to sys.path, depending on the version of Python selected when you launch a new session.

In Python 2 and Python 3 sessions and attached terminals, Cloudera Data Science Workbench automatically sets the CONDA_DEFAULT_ENV and CONDA_PREFIX environment variables to point to Conda environments under /home/cdsw/.conda.

However, Cloudera Data Science Workbench does not automatically configure Conda to pin the actual Python version. Therefore if you are using Conda to install a package, you must specify the version of Python. For example, to use Conda to install the feather-format package into the python3.6 environment, run the following command in the Workbench command prompt:
!conda install -y -c conda-forge python=3.6.1 feather-format
To install a package into the python2.7 environment, run:
!conda install -y -c conda-forge python=2.7.11 feather-format

Note that on sys.path, pip packages have precedence over conda packages.

Creating an Extensible Engine With Conda

Cloudera Data Science Workbench also allows you to extend its base engine image to include packages of your choice such as Conda. To create an extended engine:
  1. Add the following lines to a Dockerfile to extend the base engine, push the engine image to your Docker registry, and whitelist the new engine for your project. For more details on this step, see Extensible Engines.
    Python 2
    RUN mkdir -p /opt/conda/envs/python2.7
    RUN conda install -y nbconvert python=2.7.11 -n python2.7
    Python 3
    RUN mkdir -p /opt/conda/envs/python3.6
    RUN conda install -y nbconvert python=3.6.1 -n python3.6
  2. Set the PYTHONPATH environmental variable as shown below. You can set this either globally in the site administrator dashboard, or for a specific project by going to the project's Settings > Engine page.
    Python 2
    PYTHONPATH=$PYTHONPATH:/opt/conda/envs/python2.7/lib/python2.7/site-packages
    Python 3
    PYTHONPATH=$PYTHONPATH:/opt/conda/envs/python3.6/lib/python3.6/site-packages