[QUESTION] NCCL timeout error when the second interation #1140

zmtttt · 2024-09-13T07:47:58Z

I use one machine and 4GPUs to run gpt3；
the first iteration is runnning without any errors,
but the second iteration makes errors , strucked by the second iteration and the second step ,
the erros as follows：

[iteration] datetime: 2024-09-13 07:04:42
[E ProcessGroupNCCL.cpp:828] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=33, OpType=ALLREDUCE, Timeout(ms)=600000) ran for 607565 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=257, OpType=ALLREDUCE, Timeout(ms)=600000) ran for 608700 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1032, OpType=ALLREDUCE, Timeout(ms)=600000) ran for 608832 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1796, OpType=ALLREDUCE, Timeout(ms)=600000) ran for 608843 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1032, OpType=ALLREDUCE, Timeout(ms)=600000) ran for 608832 milliseconds before timing out.

have anyone met the same problem？ thanks a lot

zmtttt closed this as completed Oct 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] NCCL timeout error when the second interation #1140

[QUESTION] NCCL timeout error when the second interation #1140

zmtttt commented Sep 13, 2024

[QUESTION] NCCL timeout error when the second interation #1140

[QUESTION] NCCL timeout error when the second interation #1140

Comments

zmtttt commented Sep 13, 2024