Configure GPU scheduling and isolation

You can configure GPU scheduling and isolation on your cluster. Currently only Nvidia GPUs are supported in YARN.

  • YARN NodeManager must be installed with the Nvidia drivers.
  1. In Cloudera Manager, navigate to Hosts > Hosts Configuration > ..
  2. Search for cgroup.
  3. Select the Enable Cgroup-based Resource Management checkbox.
  4. Click Save Changes.
  5. Navigate to YARN > Configuration > .
  6. Search for cgroup.
  7. Find the Use CGroups for Resource Management property and enable it for the applicable clusters.
  8. Find the Always use Linux Container Executor property and enable it for the applicable clusters.
  9. Search for gpu.
  10. Find the Enable GPU Usage property and select the NodeManager Default Group checkbox.
  11. Find the NodeManager GPU Devices Allowed property and define the GPU devices that are managed by YARN using one of the following ways.
    • Use the default value, auto, for auto detection of all GPU devices. In this case all GPU devices are managed by YARN.
    • Manually define the GPU devices that are managed by YARN.
  12. Find the NodeManager GPU Detection Executable property and define the location of nvidia-smi. By default, this property has no value and it means that YARN checks the following paths to find nvidia-smi:
    • /usr/bin
    • /bin
    • /usr/local/nvidia/bin
  13. Click Save Changes.
  14. Click the Stale Configuration: Restart needed button on the top of the page.
  15. Click Restart Stale Services.
    Note that this step restarts all services with stale configurations.
  16. Select Re-deploy client configuration and click Restart Now.

If the NodeManager fails to start, the following error is displayed:

INFO gpu.GpuDiscoverer (GpuDiscoverer.java:initialize(240)) - Trying to discover GPU information ... WARN gpu.GpuDiscoverer (GpuDiscoverer.java:initialize(247)) - Failed to discover GPU information from system, exception message:ExitCodeException exitCode=12: continue... 
Fix the error by exporting the LD_LIBRARY_PATH in the yarn -env.sh using the following command: export LD_LIBRARY_PATH=/ usr/local/nvidia/lib:/usr/local/nvidia/lib64:$LD_LIBRARY_PATH