Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to UCX 1.12.1 for 22.06 #5141

Merged
merged 4 commits into from
Apr 6, 2022
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 22 additions & 29 deletions docs/additional-functionality/rapids-shuffle.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ be installed on the host and inside Docker containers (if not baremetal). A host
requirements, like the MLNX_OFED driver and `nv_peer_mem` kernel module.

The minimum UCX requirement for the RAPIDS Shuffle Manager is
[UCX 1.11.2](https://github.com/openucx/ucx/releases/tag/v1.11.2).
[UCX 1.12.1](https://github.com/openucx/ucx/releases/tag/v1.12.1).

#### Baremetal

Expand Down Expand Up @@ -73,47 +73,40 @@ The minimum UCX requirement for the RAPIDS Shuffle Manager is
further.

2. Fetch and install the UCX package for your OS from:
[UCX 1.11.2](https://github.com/openucx/ucx/releases/tag/v1.11.2).

NOTE: Please install the artifact with the newest CUDA 11.x version (for UCX 1.11.2 please
pick CUDA 11.2) as CUDA 11 introduced [CUDA Enhanced Compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/index.html#enhanced-compat-minor-releases).
Starting with UCX 1.12, UCX will stop publishing individual artifacts for each minor version of CUDA.

Please refer to our [FAQ](../FAQ.md#what-hardware-is-supported) for caveats with
CUDA Enhanced Compatibility.
[UCX 1.12.1](https://github.com/openucx/ucx/releases/tag/v1.12.1).

RDMA packages have extra requirements that should be satisfied by MLNX_OFED.

##### CentOS UCX RPM
The UCX packages for CentOS 7 and 8 are divided into different RPMs. For example, UCX 1.11.2
The UCX packages for CentOS 7 and 8 are divided into different RPMs. For example, UCX 1.12.1
available at
https://github.com/openucx/ucx/releases/download/v1.11.2/ucx-v1.11.2-centos7-mofed5.x-cuda11.2.tar.bz2
https://github.com/openucx/ucx/releases/download/v1.12.1/ucx-v1.12.1-centos7-mofed5-cuda11.tar.bz2
contains:

```
ucx-devel-1.11.2-1.el7.x86_64.rpm
ucx-debuginfo-1.11.2-1.el7.x86_64.rpm
ucx-1.11.2-1.el7.x86_64.rpm
ucx-cuda-1.11.2-1.el7.x86_64.rpm
ucx-rdmacm-1.11.2-1.el7.x86_64.rpm
ucx-cma-1.11.2-1.el7.x86_64.rpm
ucx-ib-1.11.2-1.el7.x86_64.rpm
ucx-devel-1.12.1-1.el7.x86_64.rpm
ucx-debuginfo-1.12.1-1.el7.x86_64.rpm
ucx-1.12.1-1.el7.x86_64.rpm
ucx-cuda-1.12.1-1.el7.x86_64.rpm
ucx-rdmacm-1.12.1-1.el7.x86_64.rpm
ucx-cma-1.12.1-1.el7.x86_64.rpm
ucx-ib-1.12.1-1.el7.x86_64.rpm
```

For a setup without RoCE or Infiniband networking, the only packages required are:

```
ucx-1.11.2-1.el7.x86_64.rpm
ucx-cuda-1.11.2-1.el7.x86_64.rpm
ucx-1.12.1-1.el7.x86_64.rpm
ucx-cuda-1.12.1-1.el7.x86_64.rpm
```

If accelerated networking is available, the package list is:

```
ucx-1.11.2-1.el7.x86_64.rpm
ucx-cuda-1.11.2-1.el7.x86_64.rpm
ucx-rdmacm-1.11.2-1.el7.x86_64.rpm
ucx-ib-1.11.2-1.el7.x86_64.rpm
ucx-1.12.1-1.el7.x86_64.rpm
ucx-cuda-1.12.1-1.el7.x86_64.rpm
ucx-rdmacm-1.12.1-1.el7.x86_64.rpm
ucx-ib-1.12.1-1.el7.x86_64.rpm
```

---
Expand Down Expand Up @@ -152,7 +145,7 @@ system if you have RDMA capable hardware.
Within the Docker container we need to install UCX and its requirements. These are Dockerfile
examples for Ubuntu 18.04:

The following are examples of Docker containers with UCX 1.11.2 and cuda-11.2 support.
The following are examples of Docker containers with UCX 1.12.1 and cuda-11.2 support.

| OS Type | RDMA | Dockerfile |
| ------- | ---- | ---------- |
Expand Down Expand Up @@ -296,7 +289,7 @@ In this section, we are using a docker container built using the sample dockerfi
| Databricks 9.1 | com.nvidia.spark.rapids.spark312db.RapidsShuffleManager |
| Databricks 10.4 | com.nvidia.spark.rapids.spark321db.RapidsShuffleManager |

2. Settings for UCX 1.11.2+:
2. Settings for UCX 1.12.1+:

Minimum configuration:

Expand Down Expand Up @@ -345,9 +338,9 @@ guide for Databricks. The following are extra steps required to enable UCX.
```
#!/bin/bash
sudo apt install -y wget libnuma1 &&
wget https://github.com/openucx/ucx/releases/download/v1.11.2/ucx-v1.11.2-ubuntu18.04-mofed5.x-cuda11.2.deb &&
sudo dpkg -i ucx-v1.11.2-ubuntu18.04-mofed5.x-cuda11.2.deb &&
rm ucx-v1.11.2-ubuntu18.04-mofed5.x-cuda11.2.deb
wget https://github.com/openucx/ucx/releases/download/v1.12.1/ucx-v1.12.1-ubuntu18.04-mofed5-cuda11.deb &&
sudo dpkg -i ucx-v1.12.1-ubuntu18.04-mofed5-cuda11.deb &&
rm ucx-v1.12.1-ubuntu18.04-mofed5-cuda11.deb
```

Save the script in DBFS and add it to the "Init Scripts" list:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2021-2022, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand All @@ -22,15 +22,15 @@
# See: https://github.com/openucx/ucx/releases/

ARG CUDA_VER=11.2.2
ARG UCX_VER=1.11.2
ARG UCX_CUDA_VER=11.2
ARG UCX_VER=1.12.1
ARG UCX_CUDA_VER=11

FROM nvidia/cuda:${CUDA_VER}-runtime-centos7
ARG UCX_VER
ARG UCX_CUDA_VER

RUN yum update -y && yum install -y wget bzip2
RUN cd /tmp && wget https://github.com/openucx/ucx/releases/download/v$UCX_VER/ucx-v$UCX_VER-centos7-mofed5.x-cuda$UCX_CUDA_VER.tar.bz2
RUN cd /tmp && wget https://github.com/openucx/ucx/releases/download/v$UCX_VER/ucx-v$UCX_VER-centos7-mofed5-cuda$UCX_CUDA_VER.tar.bz2
RUN cd /tmp && tar -xvf *.bz2 && \
yum install -y ucx-$UCX_VER-1.el7.x86_64.rpm && \
yum install -y ucx-cuda-$UCX_VER-1.el7.x86_64.rpm && \
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2021-2022, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -29,8 +29,8 @@

ARG RDMA_CORE_VERSION=32.1
ARG CUDA_VER=11.2.2
ARG UCX_VER=1.11.2
ARG UCX_CUDA_VER=11.2
ARG UCX_VER=1.12.1
ARG UCX_CUDA_VER=11

# Throw away image to build rdma_core
FROM centos:7 as rdma_core
Expand Down Expand Up @@ -59,7 +59,7 @@ COPY --from=rdma_core /tmp/*.rpm /tmp/

RUN yum update -y
RUN yum install -y wget bzip2
RUN cd /tmp && wget https://github.com/openucx/ucx/releases/download/v$UCX_VER/ucx-v$UCX_VER-centos7-mofed5.x-cuda$UCX_CUDA_VER.tar.bz2
RUN cd /tmp && wget https://github.com/openucx/ucx/releases/download/v$UCX_VER/ucx-v$UCX_VER-centos7-mofed5-cuda$UCX_CUDA_VER.tar.bz2
RUN cd /tmp && \
yum install -y *.rpm && \
tar -xvf *.bz2 && \
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2021-2022, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand All @@ -22,14 +22,14 @@
# See: https://github.com/openucx/ucx/releases/

ARG CUDA_VER=11.2.2
ARG UCX_VER=1.11.2
ARG UCX_CUDA_VER=11.2
ARG UCX_VER=1.12.1
ARG UCX_CUDA_VER=11

FROM nvidia/cuda:${CUDA_VER}-runtime-ubuntu18.04
ARG UCX_VER
ARG UCX_CUDA_VER

RUN apt update
RUN apt-get install -y wget
RUN cd /tmp && wget https://github.com/openucx/ucx/releases/download/v$UCX_VER/ucx-v$UCX_VER-ubuntu18.04-mofed5.x-cuda$UCX_CUDA_VER.deb
RUN cd /tmp && wget https://github.com/openucx/ucx/releases/download/v$UCX_VER/ucx-v$UCX_VER-ubuntu18.04-mofed5-cuda$UCX_CUDA_VER.deb
RUN apt install -y /tmp/*.deb && rm -rf /tmp/*.deb
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2021-2022, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -29,8 +29,8 @@

ARG RDMA_CORE_VERSION=32.1
ARG CUDA_VER=11.2.2
ARG UCX_VER=1.11.2
ARG UCX_CUDA_VER=11.2
ARG UCX_VER=1.12.1
ARG UCX_CUDA_VER=11

# Throw away image to build rdma_core
FROM ubuntu:18.04 as rdma_core
Expand All @@ -50,5 +50,5 @@ COPY --from=rdma_core /*.deb /tmp/

RUN apt update
RUN apt-get install -y wget
RUN cd /tmp && wget https://github.com/openucx/ucx/releases/download/v$UCX_VER/ucx-v$UCX_VER-ubuntu18.04-mofed5.x-cuda$UCX_CUDA_VER.deb
RUN cd /tmp && wget https://github.com/openucx/ucx/releases/download/v$UCX_VER/ucx-v$UCX_VER-ubuntu18.04-mofed5-cuda$UCX_CUDA_VER.deb
RUN apt install -y /tmp/*.deb && rm -rf /tmp/*.deb
10 changes: 6 additions & 4 deletions jenkins/Dockerfile-blossom.ubuntu
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#
# Copyright (c) 2020-2021, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2020-2022, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand All @@ -21,15 +21,17 @@
# Arguments:
# CUDA_VER=11.0+
# UBUNTU_VER=18.04 or 20.04
# UCX_CUDA_VER=11 (major CUDA version)
# UCX_VER=1.12.1
###

jlowe marked this conversation as resolved.
Show resolved Hide resolved
ARG CUDA_VER=11.0
ARG UBUNTU_VER=18.04
ARG UCX_VER=1.11.2
FROM nvidia/cuda:${CUDA_VER}-runtime-ubuntu${UBUNTU_VER}
ARG CUDA_VER
ARG UBUNTU_VER
ARG UCX_VER
ARG UCX_VER=1.12.1
ARG UCX_CUDA_VER=11

# Install jdk-8, jdk-11, maven, docker image
RUN apt-get update -y && \
Expand All @@ -53,6 +55,6 @@ RUN apt install -y inetutils-ping expect wget libnuma1 libgomp1

RUN mkdir -p /tmp/ucx && \
cd /tmp/ucx && \
wget https://github.com/openucx/ucx/releases/download/v${UCX_VER}/ucx-v${UCX_VER}-ubuntu${UBUNTU_VER}-mofed5.x-cuda${CUDA_VER}.deb && \
wget https://github.com/openucx/ucx/releases/download/v${UCX_VER}/ucx-v${UCX_VER}-ubuntu${UBUNTU_VER}-mofed5-cuda${UCX_CUDA_VER}.deb && \
dpkg -i *.deb && \
rm -rf /tmp/ucx
2 changes: 1 addition & 1 deletion shuffle-plugin/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@
<dependency>
<groupId>org.openucx</groupId>
<artifactId>jucx</artifactId>
<version>1.11</version>
<version>1.12.1</version>
<scope>compile</scope>
</dependency>
<dependency>
Expand Down