You can configure GPU scheduling and isolation on your cluster. Currently only Nvidia
GPUs are supported in YARN.
- YARN NodeManager must be installed with the Nvidia drivers.
-
In Cloudera Manager, navigate to .
-
Search for cgroup.
-
Select the Enable Cgroup-based Resource Management
checkbox.
-
Click Save Changes.
-
Navigate to
-
Search for cgroup.
-
Find the Use CGroups for Resource Management property and
enable it for the applicable clusters.
-
Find the Always use Linux Container Executor property and
enable it for the applicable clusters.
-
Search for gpu.
-
Find the Enable GPU Usage property and select the
NodeManager Default Group checkbox.
-
Find the NodeManager GPU Devices Allowed property and define
the GPU devices that are managed by YARN using one of the following ways.
- Use the default value,
auto
, for auto detection of all GPU
devices. In this case all GPU devices are managed by YARN.
- Manually define the GPU devices that are managed by YARN.
-
Find the NodeManager GPU Detection Executable property and
define the location of nvidia-smi. By default, this property has no value and it
means that YARN checks the following paths to find nvidia-smi:
- /usr/bin
- /bin
- /usr/local/nvidia/bin
-
Click Save Changes.
-
Click the Stale Configuration: Restart needed button on the
top of the page.
-
Click Restart Stale Services.
Note that this step restarts all services with stale configurations.
-
Select Re-deploy client configuration and click
Restart Now.
If the NodeManager fails to start, the following error is
displayed:
INFO gpu.GpuDiscoverer (GpuDiscoverer.java:initialize(240)) - Trying to discover GPU information ... WARN gpu.GpuDiscoverer (GpuDiscoverer.java:initialize(247)) - Failed to discover GPU information from system, exception message:ExitCodeException exitCode=12: continue...
Fix the error by exporting the LD_LIBRARY_PATH
in the yarn -env.sh using
the following command: export LD_LIBRARY_PATH=/
usr/local/nvidia/lib:/usr/local/nvidia/lib64:$LD_LIBRARY_PATH