[QUESTION]NCCL timeout error when the second iteration #1141

zmtttt · 2024-09-13T07:49:09Z

I use one machine and 4GPUs to run gpt3；
the first iteration is runnning without any errors,
but the second iteration makes errors , strucked by the second iteration and the second step,
the erros as follows：

[iteration] datetime: 2024-09-13 07:04:42
[E ProcessGroupNCCL.cpp:828] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=33, OpType=ALLREDUCE, Timeout(ms)=600000) ran for 607565 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=257, OpType=ALLREDUCE, Timeout(ms)=600000) ran for 608700 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1032, OpType=ALLREDUCE, Timeout(ms)=600000) ran for 608832 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1796, OpType=ALLREDUCE, Timeout(ms)=600000) ran for 608843 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1032, OpType=ALLREDUCE, Timeout(ms)=600000) ran for 608832 milliseconds before timing out.

have anyone met the same problem？ thanks a lot

wlu1998 · 2024-10-16T07:57:19Z

i have met same problem,may I ask if this problem has been resolved and how it was resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION]NCCL timeout error when the second iteration #1141

[QUESTION]NCCL timeout error when the second iteration #1141

zmtttt commented Sep 13, 2024

wlu1998 commented Oct 16, 2024

[QUESTION]NCCL timeout error when the second iteration #1141

[QUESTION]NCCL timeout error when the second iteration #1141

Comments

zmtttt commented Sep 13, 2024

wlu1998 commented Oct 16, 2024