Multi gpus problem #14

YisuiTT · 2024-07-29T11:14:09Z

This work is great, but when running on three GPUs with three prompts, I get the following error, how do I fix this?

Rank 1 is running.
Rank 0 is running.
Rank 2 is running.
Loading pipeline components...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00, 2.27it/s]
Loading pipeline components...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00, 1.92it/s]
Loading pipeline components...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00, 1.99it/s]
0%| | 0/30 [00:00<?, ?it/s]Found 34 attns
Found 22 convs
Found 34 attns
Found 22 convs
0%| | 0/30 [00:00<?, ?it/s][rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=15728640, NumelOut=47185920, Timeout(ms)=600000) ran for 600149 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:523] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=15728640, NumelOut=47185920, Timeout(ms)=600000) ran for 600152 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E ProcessGroupNCCL.cpp:1182] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=15728640, NumelOut=47185920, Timeout(ms)=600000) ran for 600152 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff49360ed87 in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7ff4947b66e6 in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7ff4947b9c3d in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7ff4947ba839 in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x7ff4de4e0bf4 in /.conda/envs/V_I_vc2/bin/../lib/libstdc++.so.6)
frame #5: + 0x8609 (0x7ff4e0071609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7ff4dfe3c353 in /lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1182] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=15728640, NumelOut=47185920, Timeout(ms)=600000) ran for 600149 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f56b609bd87 in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f56b72436e6 in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f56b7246c3d in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f56b7247839 in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x7f5700f6dbf4 in /.conda/envs/V_I_vc2/bin/../lib/libstdc++.so.6)
frame #5: + 0x8609 (0x7f5702afe609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f57028c9353 in /lib/x86_64-linux-gnu/libc.so.6)

[rank2]:[E ProcessGroupNCCL.cpp:523] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=15728640, NumelOut=47185920, Timeout(ms)=600000) ran for 600435 milliseconds before timing out.
[rank2]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E ProcessGroupNCCL.cpp:1182] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=15728640, NumelOut=47185920, Timeout(ms)=600000) ran for 600435 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f80a5aadd87 in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f80a6c556e6 in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f80a6c58c3d in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f80a6c59839 in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x7f80f097fbf4 in /.conda/envs/V_I_vc2/bin/../lib/libstdc++.so.6)
frame #5: + 0x8609 (0x7f80f2510609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f80f22db353 in /lib/x86_64-linux-gnu/libc.so.6)

Yuanshi9815 · 2024-07-29T15:18:55Z

This seems to be an issue related to Nvidia GPU communication. May I know

Did this issue occur specifically when using 3 GPUs? Does it also happen when using 2 or 4 GPUs?
It might be that the NCCL port is occupied. Could you try changing the master_port in the configuration file config.json to see if that resolves the issue?
Does this issue occur with multi-GPU and a single prompt as well?

YisuiTT · 2024-07-31T09:00:50Z

Sorry I didn't reply in time. I'm glad to take your suggestions, but unfortunately I've tried the above methods and still get NCLL errors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi gpus problem #14

Multi gpus problem #14

YisuiTT commented Jul 29, 2024

Yuanshi9815 commented Jul 29, 2024

YisuiTT commented Jul 31, 2024 •

edited

Loading

Multi gpus problem #14

Multi gpus problem #14

Comments

YisuiTT commented Jul 29, 2024

Yuanshi9815 commented Jul 29, 2024

YisuiTT commented Jul 31, 2024 • edited Loading

YisuiTT commented Jul 31, 2024 •

edited

Loading