Using NVIDIA GPUs for Cloudera Data Science Workbench Projects
Minimum Required Roles: Cloudera Manager Cluster Administrator, CDSW Site Administrator
A GPU is a specialized processor that can be used to accelerate highly parallelized computationally-intensive workloads. Because of their computational power, GPUs have been found to be particularly well-suited to deep learning workloads. Ideally, CPUs and GPUs should be used in tandem for data engineering and data science workloads. A typical machine learning workflow involves data preparation, model training, model scoring, and model fitting. You can use existing general-purpose CPUs for each stage of the workflow, and optionally accelerate the math-intensive steps with the selective application of special-purpose GPUs. For example, GPUs allow you to accelerate model fitting using frameworks such as Tensorflow, PyTorch, Keras, MXNet, and Microsoft Cognitive Toolkit (CNTK).
By enabling GPU support, data scientists can share GPU resources available on Cloudera Data Science Workbench hosts. Users can requests a specific number of GPU instances, up to the total number available on a host, which are then allocated to the running session or job for the duration of the run. Projects can use isolated versions of libraries, and even different CUDA and cuDNN versions via Cloudera Data Science Workbench's extensible engine feature.
Prerequisite
This topic assumes you have already installed or upgraded to the latest version of Cloudera Data Science Workbench.
Key Points to Note
-
Cloudera Data Science Workbench only supports CUDA-enabled NVIDIA GPU cards.
-
Cloudera Data Science Workbench does not support heterogeneous GPU hardware in a single deployment.
-
Cloudera Data Science Workbench does not include an engine image that supports NVIDIA libraries. Create your own custom CUDA-capable engine image using the instructions described in this topic.
-
Cloudera Data Science Workbench does not install or configure the NVIDIA drivers on the Cloudera Data Science Workbench gateway hosts. These depend on your GPU hardware and will have to be installed by your system administrator. The steps provided in this topic are generic guidelines that will help you evaluate your setup.
-
The instructions described in this topic require Internet access. If you have an airgapped deployment, you will be required to manually download and load the resources onto your hosts.
- For a list of known issues associated with this feature, refer Known Issues - GPU Support.
Enabling Cloudera Data Science Workbench to use GPUs
To enable GPU usage on Cloudera Data Science Workbench, perform the following steps to provision the Cloudera Data Science Workbench hosts. As noted in the following instructions, certain steps must be repeated on all gateway hosts that have GPU hardware installed on them.
CDSW | OS & Kernel | NVIDIA Driver | CUDA |
---|---|---|---|
1.7.x (engine 10) |
RHEL 7.4 3.10.0-862.9.1.el7.x86_64 |
418.56 | CUDA 10.1 |
1.7.x (engine 10) |
RHEL 7.6 3.10.0-957.12.2.el7.x86_64 |
418.56 | CUDA 10.1 |
For more compatibility information across NVIDIA Drivers and CUDA, refer the NVIDIA documentation: CUDA Compatibility.
Set Up the Operating System and Kernel
Perform this step on all hosts with GPU hardware installed on them.
-
Install the kernel-devel package.
sudo yum install -y kernel-devel-`uname -r`
If the previous command fails to find a matching version of the kernel-devel package, list all the kernel/kernel-devel versions that are available from the RHEL/CentOS package repositories, and pick the desired version to install.
You can use a bash script as demonstrated here to do this:if ! yum install kernel-devel-`uname -r`; then yum install -y kernel kernel-devel; retValue=$? if [ $retValue -eq 0]; then echo "Reboot is required since new version of kernel was installed"; fi fi
-
If you upgraded to a new kernel version in the previous step, run the following command to reboot.
sudo reboot
- Install the Development tools package.
sudo yum groupinstall -y "Development tools"
Install the NVIDIA Driver on GPU Hosts
Perform this step on all hosts with GPU hardware installed on them.
Cloudera Data Science Workbench does not ship with any of the NVIDIA drivers needed to enable GPUs for general purpose processing. System administrators are expected to install the version of the drivers that are compatible with the CUDA libraries that will be consumed on each host.
Use the NVIDIA UNIX Driver archive to find out which driver is compatible with your GPU card and operating system. To download and install the NVIDIA driver, make sure you follow the instructions on the respective driver's download page.. It is crucial that you download the correct version.
wget http://us.download.nvidia.com/.../NVIDIA-Linux-x86_64-<driver_version>.run export NVIDIA_DRIVER_VERSION=<driver_version> chmod 755 ./NVIDIA-Linux-x86_64-$NVIDIA_DRIVER_VERSION.run ./NVIDIA-Linux-x86_64-$NVIDIA_DRIVER_VERSION.run -asq
/usr/bin/nvidia-smi
Enable GPU Support in Cloudera Data Science Workbench
Minimum Required Cloudera Manager Role: Cluster Administrator
Depending on your deployment, use one of the following sets of steps to enable Cloudera Data Science Workbench to identify the GPUs installed:
CSD Deployments
- Ensure that the Docker daemon and worker node roles are installed on the GPU node.
You might need to restart CDSW after you install the Docker daemon and worker node roles and before enabling GPU support.
- Go to the CDSW service in Cloudera Manager. Click Configuration. Search for the following property and enable it:
Enable GPU Support
Use the checkbox to enable GPU support for Cloudera Data Science Workbench workloads. When this property is enabled on a host that is equipped with GPU hardware, the GPU(s) will be available for use by Cloudera Data Science Workbench.
- Restart the CDSW service in Cloudera Manager.
- Test whether Cloudera Data Science Workbench is detecting GPUs.
RPM Deployments
- Set the following parameter in /etc/cdsw/config/cdsw.conf on all Cloudera Data Science Workbench hosts. You must make sure that
cdsw.conf is consistent across all hosts, irrespective of whether they have GPU hardware installed on them.
NVIDIA_GPU_ENABLE
Set this property to true to enable GPU support for Cloudera Data Science Workbench workloads. When this property is enabled on a host that is equipped with GPU hardware, the GPU(s) will be available for use by Cloudera Data Science Workbench.
- On the master host, run the following command to restart Cloudera Data Science Workbench.
cdsw restart
If you modified cdsw.conf on a worker host, run the following commands to make sure the changes go into effect:cdsw stop cdsw join
- Use the following section to test whether Cloudera Data Science Workbench can now detect GPUs.
Test whether Cloudera Data Science Workbench can detect GPUs
Once Cloudera Data Science Workbench has successfully restarted, if NVIDIA drivers have been installed on the Cloudera Data Science Workbench hosts, Cloudera Data Science Workbench will now be able to detect the GPUs available on its hosts.cdsw status
Create a Custom CUDA-capable Engine Image
The base engine image (docker.repository.cloudera.com/cdsw/engine:<version>) that ships with Cloudera Data Science Workbench will need to be extended with CUDA libraries to make it possible to use GPUs in jobs and sessions.
The following sample Dockerfile illustrates an engine on top of which machine learning frameworks such as Tensorflow and PyTorch can be used. This Dockerfile uses a deep learning library from NVIDIA called NVIDIA CUDA Deep Neural Network (cuDNN). For detailed information about compatibility between NVIDIA driver versions and CUDA, refer the cuDNN installation guide (prerequisites).
When creating the Dockerfile for the custom image, you must delete the Cloudera repository that is inaccessible because of the paywall by running the following:
RUN rm /etc/apt/sources.list.d/*
Make sure you also check with the machine learning framework that you intend to use in order to know which version of cuDNN is needed. As an example, Tensorflow's NVIDIA hardware and software requirements for GPU support are listed in the Tensorflow documentation here. Additionally, the Tensorflow version compatibility matrix for CUDA and cuDNN is documented here..
The following sample Dockerfile uses NVIDIA's official Dockerfiles for CUDA and cuDNN images.
cuda.Dockerfile
FROM docker.repository.cloudera.com/cdsw/engine:10 RUN rm /etc/apt/sources.list.d/* RUN apt-get update && apt-get install -y --no-install-recommends \ gnupg2 curl ca-certificates && \ curl -fsSL https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub | apt-key add - && \ echo "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 /" > /etc/apt/sources.list.d/cuda.list && \ echo "deb https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 /" > /etc/apt/sources.list.d/nvidia-ml.list && \ apt-get purge --autoremove -y curl && \ rm -rf /var/lib/apt/lists/* ENV CUDA_VERSION 10.1.243 LABEL com.nvidia.cuda.version="${CUDA_VERSION}" ENV CUDA_PKG_VERSION 10-0=$CUDA_VERSION-1 RUN apt-get update && apt-get install -y --no-install-recommends \ cuda-cudart-$CUDA_PKG_VERSION \ cuda-libraries-$CUDA_PKG_VERSION && \ ln -s cuda-10.1 /usr/local/cuda && \ rm -rf /var/lib/apt/lists/* RUN echo "/usr/local/cuda/lib64" >> /etc/ld.so.conf.d/cuda.conf && \ ldconfig RUN echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf && \ echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:${PATH} ENV LD_LIBRARY_PATH /usr/local/nvidia/lib:/usr/local/nvidia/lib64 RUN echo "deb http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64 /" > /etc/apt/sources.list.d/nvidia-ml.list ENV CUDNN_VERSION 7.6.5.32 LABEL com.nvidia.cudnn.version="${CUDNN_VERSION}" RUN apt-get update && apt-get install -y --no-install-recommends \ libcudnn7=$CUDNN_VERSION-1+cuda10.1 && \ apt-mark hold libcudnn7 && \ rm -rf /var/lib/apt/lists/*
docker build --network host -t <company-registry>/cdsw-cuda:10 . -f cuda.Dockerfile
Push this new engine image to a public Docker registry so that it can be made available for Cloudera Data Science Workbench workloads. For example:
docker push <company-registry>/cdsw-cuda:10
Site Admins: Add the Custom CUDA Engine to your Cloudera Data Science Workbench Deployment
Required CDSW Role: Site Administrator
- Sign in to Cloudera Data Science Workbench.
- Click Admin.
- Go to the Engines tab.
- Under Engine Images, add the custom CUDA-capable engine image created in the previous step. This allows project administrators across the deployment to start using this engine in their jobs and sessions.
- Site administrators can also set a limit on the maximum number of GPUs that can be allocated per session or job. From the Maximum GPUs per Session/Job dropdown, select the maximum number of GPUs that can be used by an engine.
- Click Update.
Project Admins: Enable the CUDA Engine for your Project
Project administrators can use the following steps to make it the CUDA engine the default engine used for workloads within a particular project.
- Navigate to your project's Overview page.
- Click Settings.
- Go to the Engines tab.
- Under Engine Image, select the CUDA-capable engine image from the dropdown.
Test the CUDA Engine
You can use the following simple examples to test whether the new CUDA engine is able to leverage GPUs as expected.
- Go to a project that is using the CUDA engine and click Open Workbench.
- Launch a new session with GPUs.
- Run the following command in the workbench command prompt to verify that the driver was installed correctly:
! /usr/bin/nvidia-smi
- Use any of the following code samples to confirm that the new engine works with common deep learning libraries.
Pytorch
!pip3 install torch from torch import cuda assert cuda.is_available() assert cuda.device_count() > 0 print(cuda.get_device_name(cuda.current_device()))
Tensorflow
!pip3 install tensorflow-gpu==2.1.0 from tensorflow.python.client import device_lib assert 'GPU' in str(device_lib.list_local_devices()) device_lib.list_local_devices()
Keras
!pip3 install keras from keras import backend assert len(backend.tensorflow_backend._get_available_gpus()) > 0 print(backend.tensorflow_backend._get_available_gpus())