-
Notifications
You must be signed in to change notification settings - Fork 836
Issues: NVIDIA/nccl
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Author
Label
Projects
Milestones
Assignee
Sort
Issues list
When the number of nodes increases, the bandwidth performance of alltoall is unstable
#1531
opened Dec 5, 2024 by
fj1425fj
Error Using Different GPUs for Two Containers on the Same Node
#1529
opened Dec 2, 2024 by
cyberpunk-admin
NCCL error (
vendor err
) during multi-node training with mixed HCA vendors (Mellanox and Broadcom)
#1526
opened Nov 28, 2024 by
asdfry
local access violation work queue error when upgrade to v2.20.3-1
#1524
opened Nov 26, 2024 by
gangxie112
Why group calls (
ncclGroupStart()
and ncclGroupEnd()
) are invoked in ncclSend()
and ncclRecv()
#1521
opened Nov 21, 2024 by
ZhiyiHu1999
Is it safe or recommended to use multiple communicators for real distributed training
#1520
opened Nov 19, 2024 by
ZhiyiHu1999
Nccl socketStartConnect: Connect to x.x.x.x<xxxx> failed : Software caused connection abort
#1515
opened Nov 16, 2024 by
913871734
torch.distributed.DistBackendError: NCCL error in ProcessGroupNCCL.cpp:1275
#1514
opened Nov 14, 2024 by
shenshaowei
Previous Next
ProTip!
Exclude everything labeled
bug
with -label:bug.