Install the NVIDIA Driver on GPU Hosts

Cloudera Data Science Workbench does not ship with any of the NVIDIA drivers needed to enable GPUs for general purpose processing. System administrators are expected to install the version of the drivers that are compatible with the CUDA libraries that will be consumed on each host.

Perform this step on all hosts with GPU hardware installed on them.
  1. Stop the CDSW service. Login to Cloudera Manager, navigate to the CDSW service, and select Actions > Stop.
    The CUDA program actively references the service, so if it is not stopped, the following error might occur during installation: ERROR: An NVIDIA kernel module 'nvidia-drm' appears to already be loaded in your kernel.
  2. Use the NVIDIA UNIX Driver archive to find out which driver is compatible with your GPU card and operating system.
    To download and install the NVIDIA driver, make sure you follow the instructions on the respective driver's download page. . It is crucial that you download the correct version.
    For example, if you use the .run file method (Linux 64 bit), you would download and install the driver as follows:
    wget http://us.download.nvidia.com/.../NVIDIA-Linux-x86_64-<driver_version>.run
    export NVIDIA_DRIVER_VERSION=<driver_version>
    chmod 755 ./NVIDIA-Linux-x86_64-$NVIDIA_DRIVER_VERSION.run
    ./NVIDIA-Linux-x86_64-$NVIDIA_DRIVER_VERSION.run -asq
  3. Once the installation is complete, run the following command to verify that the driver was installed correctly:
    /usr/bin/nvidia-smi
  4. Cloudera recommends installing the Nvidia Container Toolkit to better leverage GPUs in your system.
    Follow the instructions found on NVIDIA's website. Even without this toolkit installed, most GPU-based workloads will run as expected. However some GPU functionalities, for example, running nvidia-smi within a GPU enabled workload, need this toolkit to be installed.
  5. Start CDSW. Login to Cloudera Manager, navigate to the CDSW service, and select Actions > Start.
    Although CDSW starts running at this point, it can take additional time (for example, 20 minutes) for all CDSW processes to start running.