You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This work is great, but when running on three GPUs with three prompts, I get the following error, how do I fix this?
Rank 1 is running.
Rank 0 is running.
Rank 2 is running.
Loading pipeline components...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00, 2.27it/s]
Loading pipeline components...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00, 1.92it/s]
Loading pipeline components...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00, 1.99it/s]
0%| | 0/30 [00:00<?, ?it/s]Found 34 attns
Found 22 convs
Found 34 attns
Found 22 convs
0%| | 0/30 [00:00<?, ?it/s][rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=15728640, NumelOut=47185920, Timeout(ms)=600000) ran for 600149 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:523] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=15728640, NumelOut=47185920, Timeout(ms)=600000) ran for 600152 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E ProcessGroupNCCL.cpp:1182] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=15728640, NumelOut=47185920, Timeout(ms)=600000) ran for 600152 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff49360ed87 in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7ff4947b66e6 in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7ff4947b9c3d in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7ff4947ba839 in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x7ff4de4e0bf4 in /.conda/envs/V_I_vc2/bin/../lib/libstdc++.so.6)
frame #5: + 0x8609 (0x7ff4e0071609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7ff4dfe3c353 in /lib/x86_64-linux-gnu/libc.so.6)
[rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1182] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=15728640, NumelOut=47185920, Timeout(ms)=600000) ran for 600149 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f56b609bd87 in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f56b72436e6 in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f56b7246c3d in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f56b7247839 in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x7f5700f6dbf4 in /.conda/envs/V_I_vc2/bin/../lib/libstdc++.so.6)
frame #5: + 0x8609 (0x7f5702afe609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f57028c9353 in /lib/x86_64-linux-gnu/libc.so.6)
[rank2]:[E ProcessGroupNCCL.cpp:523] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=15728640, NumelOut=47185920, Timeout(ms)=600000) ran for 600435 milliseconds before timing out.
[rank2]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E ProcessGroupNCCL.cpp:1182] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=15728640, NumelOut=47185920, Timeout(ms)=600000) ran for 600435 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f80a5aadd87 in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f80a6c556e6 in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f80a6c58c3d in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f80a6c59839 in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x7f80f097fbf4 in /.conda/envs/V_I_vc2/bin/../lib/libstdc++.so.6)
frame #5: + 0x8609 (0x7f80f2510609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f80f22db353 in /lib/x86_64-linux-gnu/libc.so.6)
The text was updated successfully, but these errors were encountered:
This seems to be an issue related to Nvidia GPU communication. May I know
Did this issue occur specifically when using 3 GPUs? Does it also happen when using 2 or 4 GPUs?
It might be that the NCCL port is occupied. Could you try changing the master_port in the configuration file config.json to see if that resolves the issue?
Does this issue occur with multi-GPU and a single prompt as well?
This work is great, but when running on three GPUs with three prompts, I get the following error, how do I fix this?
Rank 1 is running.
Rank 0 is running.
Rank 2 is running.
Loading pipeline components...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00, 2.27it/s]
Loading pipeline components...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00, 1.92it/s]
Loading pipeline components...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00, 1.99it/s]
0%| | 0/30 [00:00<?, ?it/s]Found 34 attns
Found 22 convs
Found 34 attns
Found 22 convs
0%| | 0/30 [00:00<?, ?it/s][rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=15728640, NumelOut=47185920, Timeout(ms)=600000) ran for 600149 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:523] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=15728640, NumelOut=47185920, Timeout(ms)=600000) ran for 600152 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E ProcessGroupNCCL.cpp:1182] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=15728640, NumelOut=47185920, Timeout(ms)=600000) ran for 600152 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff49360ed87 in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7ff4947b66e6 in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7ff4947b9c3d in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7ff4947ba839 in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x7ff4de4e0bf4 in /.conda/envs/V_I_vc2/bin/../lib/libstdc++.so.6)
frame #5: + 0x8609 (0x7ff4e0071609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7ff4dfe3c353 in /lib/x86_64-linux-gnu/libc.so.6)
[rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1182] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=15728640, NumelOut=47185920, Timeout(ms)=600000) ran for 600149 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f56b609bd87 in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f56b72436e6 in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f56b7246c3d in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f56b7247839 in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x7f5700f6dbf4 in /.conda/envs/V_I_vc2/bin/../lib/libstdc++.so.6)
frame #5: + 0x8609 (0x7f5702afe609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f57028c9353 in /lib/x86_64-linux-gnu/libc.so.6)
[rank2]:[E ProcessGroupNCCL.cpp:523] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=15728640, NumelOut=47185920, Timeout(ms)=600000) ran for 600435 milliseconds before timing out.
[rank2]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E ProcessGroupNCCL.cpp:1182] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=15728640, NumelOut=47185920, Timeout(ms)=600000) ran for 600435 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f80a5aadd87 in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f80a6c556e6 in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f80a6c58c3d in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f80a6c59839 in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x7f80f097fbf4 in /.conda/envs/V_I_vc2/bin/../lib/libstdc++.so.6)
frame #5: + 0x8609 (0x7f80f2510609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f80f22db353 in /lib/x86_64-linux-gnu/libc.so.6)
The text was updated successfully, but these errors were encountered: