(Optional) Installing UCX native libraries

The NVIDIA RAPIDS Shuffle Manager is a custom ShuffleManager for Apache Spark that allows fast shuffle block transfers between GPUs in the same host (over PCIe or NVLink) and over the network to remote hosts (over RoCE or Infiniband). The RAPIDS Shuffle Manager is based on Unified Communication X (UCX).

NVIDIA RAPIDS Shuffle Manager has been shown to accelerate workloads where shuffle is the bottleneck when using the RAPIDS accelerator for Apache Spark. It accomplishes this by using a GPU shuffle cache for fast shuffle writes when shuffle blocks fit in GPU memory, avoiding the cost of writes to host using the built-in Spark Shuffle, a spill framework that will spill to host memory and disk on demand, and UCX as its transport for fast network and peer-to-peer (GPU-to-GPU) transfers.

Cloudera and NVIDIA recommend using the RAPIDS shuffle manager for clusters with Infiniband or RoCE networking.

  1. Download UCX v1.10.1 for your operating system.
  2. Install the downloaded packages using your operating system package manager.
    For CentOS, if you do not have Infiniband or RoCE networking, you only need to install the following packages:
    ucx-1.10.1-1.el7.x86_64.rpm
    ucx-cuda-1.10.1-1.el7.x86_64.rpm

    If you have Infiniband or RoCE networking, install the following packages:

    ucx-1.10.1-1.el7.x86_64.rpm
    ucx-cuda-1.10.1-1.el7.x86_64.rpm
    ucx-rdmacm-1.10.1-1.el7.x86_64.rpm
    ucx-ib-1.10.1-1.el7.x86_64.rpm
  3. Validate your UCX environment following the instructions provided in the NVIDIA spark-rapids documentation.