NCCL error (`vendor err`) during multi-node training with mixed HCA vendors (Mellanox and Broadcom) #1526

asdfry · 2024-11-28T01:24:52Z

Hello,

I am encountering an NCCL error while trying to perform multi-node training with a DGX-H100 server equipped with a Mellanox HCA and a Kaytus server equipped with a Broadcom HCA. (The logs include a "vendor err" message.)

Both servers are using RoCE communication, have ACS functionality disabled, and HPC-X has been loaded.
For distributed training, I am using Hugging Face's Accelerate library.

I will attach the NCCL logs for your reference. Could you please take a look? 😥
Additionally, is multi-node training between servers with different HCA vendors unsupported?

Thank you in advance for your support!

nccl-log.zip

sjeaugey · 2024-11-28T08:04:21Z

It looks like the NICs simply fail to talk to each other:

pnode5:2674921:2676633 [0] ib_plugin.c:1105 NCCL WARN NET/IB : Got completion from peer 10.2.8.2<55466> with error 12, opcode 0, len 0, vendor err 129 (Recv) localGid fe80::966d:aeff:fe5d:9950 remoteGid fe80::6e92:cfff:fe87:ca50

NCCL picked gids fe80::966d:aeff:fe5d:9950 and fe80::6e92:cfff:fe87:ca50 to try to communicate, but those can't talk to each other (they are IPv6 link-level addresses, not sure we should use those).

Maybe run show_gids on each node and check whether you should force a different one with NCCL_IB_GID_INDEX?

Additionally, is multi-node training between servers with different HCA vendors unsupported?

It should work in theory, but I can't say that we test that frequently, so if it was broken, I'm not sure we'd notice.

asdfry · 2024-12-03T01:34:22Z

I remapped the IPs assigned to the NICs on the Kaytus server to match the GPU order, and ran the script with NCCL_IB_GID_INDEX=3 as you suggested. This resolved the vendor err issue.

However, the training now freezes at some point during the process. Each server has eight 400G NICs, but it seems like we're not utilizing the bandwidth effectively. Could you advise on what aspects I should look into to resolve this?

Any suggestions on network configuration or other possible issues would be greatly appreciated.

logs.zip

asdfry changed the title ~~NCCL error ('vendor err') during multi-node training with mixed HCA vendors (Mellanox and Broadcom)~~ NCCL error (vendor err) during multi-node training with mixed HCA vendors (Mellanox and Broadcom) Nov 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL error (`vendor err`) during multi-node training with mixed HCA vendors (Mellanox and Broadcom) #1526

NCCL error (`vendor err`) during multi-node training with mixed HCA vendors (Mellanox and Broadcom) #1526

asdfry commented Nov 28, 2024 •

edited

Loading

sjeaugey commented Nov 28, 2024 •

edited

Loading

asdfry commented Dec 3, 2024

NCCL error (vendor err) during multi-node training with mixed HCA vendors (Mellanox and Broadcom) #1526

NCCL error (vendor err) during multi-node training with mixed HCA vendors (Mellanox and Broadcom) #1526

Comments

asdfry commented Nov 28, 2024 • edited Loading

sjeaugey commented Nov 28, 2024 • edited Loading

asdfry commented Dec 3, 2024

NCCL error (`vendor err`) during multi-node training with mixed HCA vendors (Mellanox and Broadcom) #1526

NCCL error (`vendor err`) during multi-node training with mixed HCA vendors (Mellanox and Broadcom) #1526

asdfry commented Nov 28, 2024 •

edited

Loading

sjeaugey commented Nov 28, 2024 •

edited

Loading