Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL error (vendor err) during multi-node training with mixed HCA vendors (Mellanox and Broadcom) #1526

Open
asdfry opened this issue Nov 28, 2024 · 2 comments

Comments

@asdfry
Copy link

asdfry commented Nov 28, 2024

Hello,

I am encountering an NCCL error while trying to perform multi-node training with a DGX-H100 server equipped with a Mellanox HCA and a Kaytus server equipped with a Broadcom HCA. (The logs include a "vendor err" message.)

Both servers are using RoCE communication, have ACS functionality disabled, and HPC-X has been loaded.
For distributed training, I am using Hugging Face's Accelerate library.

I will attach the NCCL logs for your reference. Could you please take a look? 😥
Additionally, is multi-node training between servers with different HCA vendors unsupported?

Thank you in advance for your support!

nccl-log.zip

@asdfry asdfry changed the title NCCL error ('vendor err') during multi-node training with mixed HCA vendors (Mellanox and Broadcom) NCCL error (vendor err) during multi-node training with mixed HCA vendors (Mellanox and Broadcom) Nov 28, 2024
@sjeaugey
Copy link
Member

sjeaugey commented Nov 28, 2024

It looks like the NICs simply fail to talk to each other:

pnode5:2674921:2676633 [0] ib_plugin.c:1105 NCCL WARN NET/IB : Got completion from peer 10.2.8.2<55466> with error 12, opcode 0, len 0, vendor err 129 (Recv) localGid fe80::966d:aeff:fe5d:9950 remoteGid fe80::6e92:cfff:fe87:ca50

NCCL picked gids fe80::966d:aeff:fe5d:9950 and fe80::6e92:cfff:fe87:ca50 to try to communicate, but those can't talk to each other (they are IPv6 link-level addresses, not sure we should use those).

Maybe run show_gids on each node and check whether you should force a different one with NCCL_IB_GID_INDEX?

Additionally, is multi-node training between servers with different HCA vendors unsupported?

It should work in theory, but I can't say that we test that frequently, so if it was broken, I'm not sure we'd notice.

@asdfry
Copy link
Author

asdfry commented Dec 3, 2024

I remapped the IPs assigned to the NICs on the Kaytus server to match the GPU order, and ran the script with NCCL_IB_GID_INDEX=3 as you suggested. This resolved the vendor err issue.

However, the training now freezes at some point during the process. Each server has eight 400G NICs, but it seems like we're not utilizing the bandwidth effectively. Could you advise on what aspects I should look into to resolve this?

Any suggestions on network configuration or other possible issues would be greatly appreciated.

logs.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants