Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to UCX 1.11.0 #3067

Merged
merged 3 commits into from
Jul 28, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 27 additions & 31 deletions docs/additional-functionality/rapids-shuffle.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ be installed on the host and inside Docker containers (if not baremetal). A host
requirements, like the MLNX_OFED driver and `nv_peer_mem` kernel module.

The minimum UCX requirement for the RAPIDS Shuffle Manager is
[UCX 1.10.1](https://github.com/openucx/ucx/releases/tag/v1.10.1)
[UCX 1.11.0](https://github.com/openucx/ucx/releases/tag/v1.11.0).

#### Baremetal

Expand Down Expand Up @@ -65,52 +65,48 @@ The minimum UCX requirement for the RAPIDS Shuffle Manager is
[file a GitHub issue](https://github.com/NVIDIA/spark-rapids/issues) so we can investigate
further.

2. Fetch and install the UCX package for your OS and CUDA version
[UCX 1.10.1](https://github.com/openucx/ucx/releases/tag/v1.10.1).

RDMA packages have extra requirements that should be satisfied by MLNX_OFED.

---
**NOTE:**
2. Fetch and install the UCX package for your OS from:
[UCX 1.11.0](https://github.com/openucx/ucx/releases/tag/v1.11.0).

NOTE: Please install the artifact with the newest CUDA 11.x version (for UCX 1.11.0 please
pick CUDA 11.2) as CUDA 11 introduced [CUDA Enhanced Compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/index.html#enhanced-compat-minor-releases).
Starting with UCX 1.12, UCX will stop publishing individual artifacts for each minor version of CUDA.

Please note that the RAPIDS Shuffle Manager is built against
[JUCX 1.11.0](https://search.maven.org/artifact/org.openucx/jucx/1.11.0/jar). This is the JNI
component of UCX and was published ahead of the native library (UCX 1.11.0). Please disregard the
startup [compatibility warning](https://github.com/openucx/ucx/issues/6694),
as the JUCX usage within the RAPIDS Shuffle Manager is compatible with UCX 1.10.x.
Please refer to our [FAQ](../FAQ.md#what-hardware-is-supported) for caveats with
CUDA Enhanced Compatibility.

---
RDMA packages have extra requirements that should be satisfied by MLNX_OFED.

##### CentOS UCX RPM
The UCX packages for CentOS 7 and 8 are divided into different RPMs. For example, UCX 1.10.1
The UCX packages for CentOS 7 and 8 are divided into different RPMs. For example, UCX 1.11.0
available at
https://github.com/openucx/ucx/releases/download/v1.10.1/ucx-v1.10.1-centos7-mofed5.x-cuda11.0.tar.bz2
https://github.com/openucx/ucx/releases/download/v1.11.0/ucx-v1.11.0-centos7-mofed5.x-cuda11.2.tar.bz2
contains:

```
ucx-devel-1.10.1-1.el7.x86_64.rpm
ucx-debuginfo-1.10.1-1.el7.x86_64.rpm
ucx-1.10.1-1.el7.x86_64.rpm
ucx-cuda-1.10.1-1.el7.x86_64.rpm
ucx-rdmacm-1.10.1-1.el7.x86_64.rpm
ucx-cma-1.10.1-1.el7.x86_64.rpm
ucx-ib-1.10.1-1.el7.x86_64.rpm
ucx-devel-1.11.0-1.el7.x86_64.rpm
ucx-debuginfo-1.11.0-1.el7.x86_64.rpm
ucx-1.11.0-1.el7.x86_64.rpm
ucx-cuda-1.11.0-1.el7.x86_64.rpm
ucx-rdmacm-1.11.0-1.el7.x86_64.rpm
ucx-cma-1.11.0-1.el7.x86_64.rpm
ucx-ib-1.11.0-1.el7.x86_64.rpm
```

For a setup without RoCE or Infiniband networking, the only packages required are:

```
ucx-1.10.1-1.el7.x86_64.rpm
ucx-cuda-1.10.1-1.el7.x86_64.rpm
ucx-1.11.0-1.el7.x86_64.rpm
ucx-cuda-1.11.0-1.el7.x86_64.rpm
```

If accelerated networking is available, the package list is:

```
ucx-1.10.1-1.el7.x86_64.rpm
ucx-cuda-1.10.1-1.el7.x86_64.rpm
ucx-rdmacm-1.10.1-1.el7.x86_64.rpm
ucx-ib-1.10.1-1.el7.x86_64.rpm
ucx-1.11.0-1.el7.x86_64.rpm
ucx-cuda-1.11.0-1.el7.x86_64.rpm
ucx-rdmacm-1.11.0-1.el7.x86_64.rpm
ucx-ib-1.11.0-1.el7.x86_64.rpm
```

---
Expand Down Expand Up @@ -149,7 +145,7 @@ system if you have RDMA capable hardware.
Within the Docker container we need to install UCX and its requirements. These are Dockerfile
examples for Ubuntu 18.04:

The following are examples of Docker containers with UCX 1.10.1 and cuda-11.0 support.
The following are examples of Docker containers with UCX 1.11.0 and cuda-11.2 support.

| OS Type | RDMA | Dockerfile |
| ------- | ---- | ---------- |
Expand Down Expand Up @@ -294,7 +290,7 @@ In this section, we are using a docker container built using the sample dockerfi
| 3.1.2 | com.nvidia.spark.rapids.spark312.RapidsShuffleManager |
| 3.1.3 | com.nvidia.spark.rapids.spark313.RapidsShuffleManager |

2. Settings for UCX 1.10.1+:
2. Settings for UCX 1.11.0+:

Minimum configuration:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,14 +16,14 @@
# Sample Dockerfile to install UCX in a CentosOS 7 image
#
# The parameters are:
# - CUDA_VER: 11.0.3 to pick up the latest 11.x CUDA base layer
# - UCX_VER and UCX_CUDA_VER: these are used to pick a package matchin a specific UCX version and
# - CUDA_VER: 11.2.2 to pick up the latest 11.2 CUDA base layer
# - UCX_VER and UCX_CUDA_VER: these are used to pick a package matching a specific UCX version and
# CUDA runtime from the UCX github repo.
# See: https://github.com/openucx/ucx/releases/

ARG CUDA_VER=11.0.3
ARG UCX_VER=v1.10.1
ARG UCX_CUDA_VER=11.0
ARG CUDA_VER=11.2.2
ARG UCX_VER=v1.11.0
ARG UCX_CUDA_VER=11.2

FROM nvidia/cuda:${CUDA_VER}-runtime-centos7
ARG UCX_VER
Expand All @@ -32,6 +32,6 @@ ARG UCX_CUDA_VER
RUN yum update -y && yum install -y wget bzip2
RUN cd /tmp && wget https://github.com/openucx/ucx/releases/download/$UCX_VER/ucx-$UCX_VER-centos7-mofed5.x-cuda$UCX_CUDA_VER.tar.bz2
RUN cd /tmp && tar -xvf *.bz2 && \
yum install -y ucx-1.10.1-1.el7.x86_64.rpm && \
yum install -y ucx-cuda-1.10.1-1.el7.x86_64.rpm && \
yum install -y ucx-1.11.0-1.el7.x86_64.rpm && \
yum install -y ucx-cuda-1.11.0-1.el7.x86_64.rpm && \
rm -rf /tmp/*.rpm
Original file line number Diff line number Diff line change
Expand Up @@ -19,18 +19,18 @@
# The parameters are:
# - RDMA_CORE_VERSION: Set to 32.1 to match the rdma-core line in the latest
# released MLNX_OFED 5.x driver
# - CUDA_VER: 11.0.3 to pick up the latest 11.x CUDA base layer
# - UCX_VER and UCX_CUDA_VER: these are used to pick a package matchin a specific UCX version and
# - CUDA_VER: 11.2.2 to pick up the latest 11.2 CUDA base layer
# - UCX_VER and UCX_CUDA_VER: these are used to pick a package matching a specific UCX version and
# CUDA runtime from the UCX github repo.
# See: https://github.com/openucx/ucx/releases/
#
# The Dockerfile first fetches and builds `rdma-core` to satisfy requirements for
# the ucx-ib and ucx-rdma RPMs.

ARG RDMA_CORE_VERSION=32.1
ARG CUDA_VER=11.0.3
ARG UCX_VER=v1.10.1
ARG UCX_CUDA_VER=11.0
ARG CUDA_VER=11.2.2
ARG UCX_VER=v1.11.0
ARG UCX_CUDA_VER=11.2

# Throw away image to build rdma_core
FROM centos:7 as rdma_core
Expand Down Expand Up @@ -63,8 +63,8 @@ RUN cd /tmp && wget https://github.com/openucx/ucx/releases/download/$UCX_VER/uc
RUN cd /tmp && \
yum install -y *.rpm && \
tar -xvf *.bz2 && \
yum install -y ucx-1.10.1-1.el7.x86_64.rpm && \
yum install -y ucx-cuda-1.10.1-1.el7.x86_64.rpm && \
yum install -y ucx-ib-1.10.1-1.el7.x86_64.rpm && \
yum install -y ucx-rdmacm-1.10.1-1.el7.x86_64.rpm
yum install -y ucx-1.11.0-1.el7.x86_64.rpm && \
yum install -y ucx-cuda-1.11.0-1.el7.x86_64.rpm && \
yum install -y ucx-ib-1.11.0-1.el7.x86_64.rpm && \
yum install -y ucx-rdmacm-1.11.0-1.el7.x86_64.rpm
RUN rm -rf /tmp/*.rpm && rm /tmp/*.bz2
Original file line number Diff line number Diff line change
Expand Up @@ -16,14 +16,14 @@
# Sample Dockerfile to install UCX in a Ubuntu 18.04 image
#
# The parameters are:
# - CUDA_VER: 11.0.3 to pick up the latest 11.x CUDA base layer
# - CUDA_VER: 11.2.2 to pick up the latest 11.2 CUDA base layer
# - UCX_VER and UCX_CUDA_VER: these are used to pick a package matchin a specific UCX version and
# CUDA runtime from the UCX github repo.
# See: https://github.com/openucx/ucx/releases/

ARG CUDA_VER=11.0
ARG UCX_VER=v1.10.1
ARG UCX_CUDA_VER=11.0
ARG CUDA_VER=11.2.2
ARG UCX_VER=v1.11.0
ARG UCX_CUDA_VER=11.2

FROM nvidia/cuda:${CUDA_VER}-runtime-ubuntu18.04
ARG UCX_VER
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,25 +19,24 @@
# The parameters are:
# - RDMA_CORE_VERSION: Set to 32.1 to match the rdma-core line in the latest
# released MLNX_OFED 5.x driver
# - CUDA_VER: 11.0.3 to pick up the latest 11.x CUDA base layer
# - UCX_VER and UCX_CUDA_VER: these are used to pick a package matchin a specific UCX version and
# - CUDA_VER: 11.2.2 to pick up the latest 11.2 CUDA base layer
# - UCX_VER and UCX_CUDA_VER: these are used to pick a package matchin a specific UCX version and
# CUDA runtime from the UCX github repo.
# See: https://github.com/openucx/ucx/releases/
#
# The Dockerfile first fetches and builds `rdma-core` to satisfy requirements for
# the ucx-ib and ucx-rdma RPMs.

ARG RDMA_CORE_VERSION=32.1
ARG CUDA_VER=11.0.3
ARG UCX_VER=v1.10.1
ARG UCX_CUDA_VER=11.0
ARG CUDA_VER=11.2.2
ARG UCX_VER=v1.11.0
ARG UCX_CUDA_VER=11.2

# Throw away image to build rdma_core
FROM ubuntu:18.04 as rdma_core
ARG RDMA_CORE_VERSION

RUN apt update
RUN apt-get install -y dh-make wget build-essential cmake gcc libudev-dev libnl-3-dev libnl-route-3-dev ninja-build pkg-config valgrind python3-dev cython3 python3-docutils pandoc
RUN apt update && apt install -y dh-make wget build-essential cmake gcc libudev-dev libnl-3-dev libnl-route-3-dev ninja-build pkg-config valgrind python3-dev cython3 python3-docutils pandoc

RUN wget https://github.com/linux-rdma/rdma-core/releases/download/v${RDMA_CORE_VERSION}/rdma-core-${RDMA_CORE_VERSION}.tar.gz
RUN tar -xvf *.tar.gz && cd rdma-core*/ && dpkg-buildpackage -b -d
Expand All @@ -50,6 +49,6 @@ ARG UCX_CUDA_VER
COPY --from=rdma_core /*.deb /tmp/

RUN apt update
RUN apt-get install -y cuda-compat-11-0 wget udev dh-make libudev-dev libnl-3-dev libnl-route-3-dev python3-dev cython3
RUN apt-get install -y wget
RUN cd /tmp && wget https://github.com/openucx/ucx/releases/download/$UCX_VER/ucx-$UCX_VER-ubuntu18.04-mofed5.x-cuda$UCX_CUDA_VER.deb
RUN apt install -y /tmp/*.deb && rm -rf /tmp/*.deb
2 changes: 1 addition & 1 deletion shuffle-plugin/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@
<dependency>
<groupId>org.openucx</groupId>
<artifactId>jucx</artifactId>
<version>1.11.0-rc3</version>
<version>1.11</version>
<scope>compile</scope>
</dependency>
</dependencies>
Expand Down