Skip to content

Commit

Permalink
Update shuffle documentation for branch-21.06 and UCX 1.10.1 (#2475)
Browse files Browse the repository at this point in the history
* Update rapids-shuffle.md for UCX 1.10.1

Signed-off-by: Alessandro Bellina <abellina@nvidia.com>

* Add message around JUCX 1.11.0 compatibility warning whenusing with UCX 1.10

* Update minimum requirement. JUCX 1.11.0 requires UCX 1.10+

* Small tweaks

* Remove bullet points

* libnuma1 pulled automatically from apt install
  • Loading branch information
abellina authored May 26, 2021
1 parent bb332c9 commit 3b718f8
Showing 1 changed file with 69 additions and 51 deletions.
120 changes: 69 additions & 51 deletions docs/additional-functionality/rapids-shuffle.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,8 +38,11 @@ in these scenarios:
### System Setup

In order to enable the RAPIDS Shuffle Manager, UCX user-space libraries and its dependencies must
be installed on the host and within Docker containers. A host has additional requirements, like the
MLNX_OFED driver and `nv_peer_mem` kernel module.
be installed on the host and inside Docker containers (if not baremetal). A host has additional
requirements, like the MLNX_OFED driver and `nv_peer_mem` kernel module.

The minimum UCX requirement for the RAPIDS Shuffle Manager is
[UCX 1.10.1](https://github.com/openucx/ucx/releases/tag/v1.10.1)

#### Baremetal

Expand All @@ -58,24 +61,21 @@ MLNX_OFED driver and `nv_peer_mem` kernel module.
[file a GitHub issue](https://github.com/NVIDIA/spark-rapids/issues) so we can investigate
further.

2. (skip if you followed Step 1) For setups without RoCE/Infiniband, UCX 1.9.0 packaging
requires RDMA packages for installation. UCX 1.10.0+ will relax these requirements for these
types of environments, so this step should not be needed in the future.
2. Fetch and install the UCX package for your OS and CUDA version
[UCX 1.10.1](https://github.com/openucx/ucx/releases/tag/v1.10.1).

If you still want to install UCX 1.9.0 in a machine without RoCE/Infiniband hardware, please
build and install `rdma-core`. You can use the [Docker sample below](#ucx-minimal-dockerfile)
as reference.

3. Fetch and install the UCX package for your OS and CUDA version
[UCX 1.9.0](https://github.com/openucx/ucx/releases/tag/v1.9.0).
RDMA packages have extra requirements that should be satisfied by MLNX_OFED.

---
**NOTE:**

UCX versions 1.9 and below require the user to install the cuda-compat package
matching the cuda version of the UCX package (i.e. `cuda-compat-11-1`), in addition to:
`ibverbs-providers`, `libgomp1`, `libibverbs1`, `libnuma1`, `librdmacm1`.
Please note that the RAPIDS Shuffle Manager is built against
[JUCX 1.11.0](https://search.maven.org/artifact/org.openucx/jucx/1.11.0/jar). This is the JNI
component of UCX and was published ahead of the native library (UCX 1.11.0). Please disregard the
startup [compatibility warning](https://github.com/openucx/ucx/issues/6694),
as the JUCX usage within the RAPIDS Shuffle Manager is compatible with UCX 1.10.x.

For UCX versions 1.10.0+, UCX will drop the `cuda-compat` requirement, and remove explicit
RDMA dependencies greatly simplifying installation in some cases. Note that these dependencies
have been met if you followed Steps 1 or 2 above.
---

#### Docker containers

Expand All @@ -94,47 +94,65 @@ essentially turns off all isolation. We are also assuming `--network=host` is sp
the container to share the host's network. We will revise this document to include any new
configurations as we are able to test different scenarios.

1. A system administrator should have performed Step 1 in [Baremetal](#baremetal) in the
host system.
NOTE: A system administrator should have performed Step 1 in [Baremetal](#baremetal) in the host
system if you have RDMA capable hardware.

2. Within the Docker container we need to install UCX and its requirements. The following is an
example of a Docker container that shows how to install `rdma-core` and UCX 1.9.0 with
`cuda-11.0` support. You can use this as a base layer for containers that your executors
will use.

<a name="ucx-minimal-dockerfile"></a>

```
ARG CUDA_VER=11.0
# Throw away image to build rdma_core
FROM ubuntu:18.04 as rdma_core
RUN apt update
RUN apt-get install -y dh-make git build-essential cmake gcc libudev-dev libnl-3-dev libnl-route-3-dev ninja-build pkg-config valgrind python3-dev cython3 python3-docutils pandoc
RUN git clone --depth 1 --branch v33.0 https://github.com/linux-rdma/rdma-core
RUN cd rdma-core && debian/rules binary
# Now start the main container
FROM nvidia/cuda:${CUDA_VER}-devel-ubuntu18.04
COPY --from=rdma_core /*.deb /tmp/
Within the Docker container we need to install UCX and its requirements. These are Dockerfile
examples for Ubuntu 18.04:

##### Without RDMA:
The following is an example of a Docker container with UCX 1.10.1 and cuda-11.0 support, built
for a setup without RDMA capable hardware:

```
ARG CUDA_VER=11.0
# Now start the main container
FROM nvidia/cuda:${CUDA_VER}-devel-ubuntu18.04
RUN apt update
RUN apt-get install -y wget
RUN cd /tmp && wget https://github.com/openucx/ucx/releases/download/v1.10.1/ucx-v1.10.1-ubuntu18.04-mofed5.x-cuda11.0.deb
RUN apt install -y /tmp/*.deb && rm -rf /tmp/*.deb
```

##### With RDMA:
The following is an example of a Docker container that shows how to install `rdma-core` and
UCX 1.10.1 with `cuda-11.0` support. You can use this as a base layer for containers that your
executors will use.

```
ARG CUDA_VER=11.0
# Throw away image to build rdma_core
FROM ubuntu:18.04 as rdma_core
RUN apt update
RUN apt-get install -y dh-make git build-essential cmake gcc libudev-dev libnl-3-dev libnl-route-3-dev ninja-build pkg-config valgrind python3-dev cython3 python3-docutils pandoc
RUN git clone --depth 1 --branch v33.0 https://github.com/linux-rdma/rdma-core
RUN cd rdma-core && debian/rules binary
# Now start the main container
FROM nvidia/cuda:${CUDA_VER}-devel-ubuntu18.04
COPY --from=rdma_core /*.deb /tmp/
RUN apt update
RUN apt-get install -y cuda-compat-11-0 wget udev dh-make libudev-dev libnl-3-dev libnl-route-3-dev python3-dev cython3
RUN cd /tmp && wget https://github.com/openucx/ucx/releases/download/v1.10.1/ucx-v1.10.1-ubuntu18.04-mofed5.x-cuda11.0.deb
RUN apt install -y /tmp/*.deb && rm -rf /tmp/*.deb
```

RUN apt update
RUN apt-get install -y cuda-compat-11-0 wget udev dh-make libnuma1 libudev-dev libnl-3-dev libnl-route-3-dev python3-dev cython3
RUN cd /tmp && wget https://github.com/openucx/ucx/releases/download/v1.9.0/ucx-v1.9.0-ubuntu18.04-mofed5.0-1.0.0.0-cuda11.0.deb
RUN dpkg -i /tmp/*.deb && rm -rf /tmp/*.deb
```

### Validating UCX Environment

After installing UCX you can utilize `ucx_info` and `ucx_perftest` to validate the installation.

In this section, we are using a docker container built using the sample dockerfile above.

1. Start the docker container with `--privileged` mode. In this example, we are also adding
`--device /dev/infiniband` to make Mellanox devices available for our test:
`--device /dev/infiniband` to make Mellanox devices available for our test, but this is only
required if you are using RDMA:
```
nvidia-docker run \
--network=host \
Expand Down Expand Up @@ -260,7 +278,7 @@ In this section, we are using a docker container built using the sample dockerfi
| 3.1.2 | com.nvidia.spark.rapids.spark312.RapidsShuffleManager |
| 3.2.0 | com.nvidia.spark.rapids.spark320.RapidsShuffleManager |
2. Recommended settings for UCX 1.9.0+
2. Recommended settings for UCX 1.10.1+
```shell
...
--conf spark.shuffle.manager=com.nvidia.spark.rapids.spark301.RapidsShuffleManager \
Expand Down Expand Up @@ -335,4 +353,4 @@ for this, other than to trigger a GC cycle on the driver.

Spark has a configuration `spark.cleaner.periodicGC.interval` (defaults to 30 minutes), that
can be used to periodically cause garbage collection. If you are experiencing OOM situations, or
performance degradation with several Spark actions, consider tuning this setting in your jobs.
performance degradation with several Spark actions, consider tuning this setting in your jobs.

0 comments on commit 3b718f8

Please sign in to comment.