From 269e7a76857df26dca428018d4a22195ebe36ac2 Mon Sep 17 00:00:00 2001 From: Alessandro Bellina Date: Mon, 21 Jun 2021 14:13:47 -0500 Subject: [PATCH] Add a link for Mellanox documentation on RoCE and a pointer to --without-ucx for the MOFED installation script (#2745) Signed-off-by: Alessandro Bellina --- docs/additional-functionality/rapids-shuffle.md | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/docs/additional-functionality/rapids-shuffle.md b/docs/additional-functionality/rapids-shuffle.md index 7f2caa98f7e..fc003b720eb 100644 --- a/docs/additional-functionality/rapids-shuffle.md +++ b/docs/additional-functionality/rapids-shuffle.md @@ -50,6 +50,16 @@ The minimum UCX requirement for the RAPIDS Shuffle Manager is in machines that don't connect their GPUs and NICs to PCIe switches (i.e. directly to the root-complex). + Other considerations: + + - Please refer to [Mellanox documentation]( + https://community.mellanox.com/s/article/recommended-network-configuration-examples-for-roce-deployment) + on how to configure RoCE networks (lossless/lossy, QoS, and more) + + - We recommend that the `--without-ucx` option is passed when installing MLNX_OFED + (`mlnxofedinstall`). This is because the UCX included in MLNX_OFED does not have CUDA support, + and is likely older than what is available in the UCX repo (see Step 2 below). + If you encounter issues or poor performance, GPUDirectRDMA can be controlled via the UCX environment variable `UCX_IB_GPU_DIRECT_RDMA=no`, but please [file a GitHub issue](https://github.com/NVIDIA/spark-rapids/issues) so we can investigate