You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am encountering an NCCL error while trying to perform multi-node training with a DGX-H100 server equipped with a Mellanox HCA and a Kaytus server equipped with a Broadcom HCA. (The logs include a "vendor err" message.)
Both servers are using RoCE communication, have ACS functionality disabled, and HPC-X has been loaded.
For distributed training, I am using Hugging Face's Accelerate library.
I will attach the NCCL logs for your reference. Could you please take a look? 😥
Additionally, is multi-node training between servers with different HCA vendors unsupported?
The text was updated successfully, but these errors were encountered:
asdfry
changed the title
NCCL error ('vendor err') during multi-node training with mixed HCA vendors (Mellanox and Broadcom)
NCCL error (vendor err) during multi-node training with mixed HCA vendors (Mellanox and Broadcom)
Nov 28, 2024
It looks like the NICs simply fail to talk to each other:
pnode5:2674921:2676633 [0] ib_plugin.c:1105 NCCL WARN NET/IB : Got completion from peer 10.2.8.2<55466> with error 12, opcode 0, len 0, vendor err 129 (Recv) localGid fe80::966d:aeff:fe5d:9950 remoteGid fe80::6e92:cfff:fe87:ca50
NCCL picked gids fe80::966d:aeff:fe5d:9950 and fe80::6e92:cfff:fe87:ca50 to try to communicate, but those can't talk to each other (they are IPv6 link-level addresses, not sure we should use those).
Maybe run show_gids on each node and check whether you should force a different one with NCCL_IB_GID_INDEX?
Additionally, is multi-node training between servers with different HCA vendors unsupported?
It should work in theory, but I can't say that we test that frequently, so if it was broken, I'm not sure we'd notice.
I remapped the IPs assigned to the NICs on the Kaytus server to match the GPU order, and ran the script with NCCL_IB_GID_INDEX=3 as you suggested. This resolved the vendor err issue.
However, the training now freezes at some point during the process. Each server has eight 400G NICs, but it seems like we're not utilizing the bandwidth effectively. Could you advise on what aspects I should look into to resolve this?
Any suggestions on network configuration or other possible issues would be greatly appreciated.
Hello,
I am encountering an NCCL error while trying to perform multi-node training with a DGX-H100 server equipped with a Mellanox HCA and a Kaytus server equipped with a Broadcom HCA. (The logs include a "vendor err" message.)
Both servers are using RoCE communication, have ACS functionality disabled, and HPC-X has been loaded.
For distributed training, I am using Hugging Face's Accelerate library.
I will attach the NCCL logs for your reference. Could you please take a look? 😥
Additionally, is multi-node training between servers with different HCA vendors unsupported?
Thank you in advance for your support!
nccl-log.zip
The text was updated successfully, but these errors were encountered: