[BUG] Computation wrapped by jit_fuser get stuck after using a standalone thread to schedule the collective comm operations of pipeline parallel. #1028

xinyu-yang · 2024-08-23T07:33:41Z

Describe the bug
I would like to implement a scheduler to schedule the orders of collective communication operations of the pipeline parallel. I implemented a scheduling class with a priority queue that provides two main methods, enqueue and dequeue.

Whenever the "group.comm" like "group.isend", "group.irecv" functions are called. I just put them into the queue, including the operator, args, and kwargs. At the same time, there is a separate thread that always schedules and dequeues the collective communication operations.

Currently, I just schedule the collective communication ops of the pipeline parallel to the original order (enqueue order) they are. However, the process gets stuck at the first computation function decorated by "jit_fuser". In our case, it is "bias_dropout_add_fused_train" function.

Note that the process gets stuck when I use NCCL backend but works well when I use GLOO backend.

I totally do not have any idea of this problem. Thanks to anyone who can help or provide any hints! 😊

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Computation wrapped by jit_fuser get stuck after using a standalone thread to schedule the collective comm operations of pipeline parallel. #1028

[BUG] Computation wrapped by jit_fuser get stuck after using a standalone thread to schedule the collective comm operations of pipeline parallel. #1028

xinyu-yang commented Aug 23, 2024

[BUG] Computation wrapped by jit_fuser get stuck after using a standalone thread to schedule the collective comm operations of pipeline parallel. #1028

[BUG] Computation wrapped by jit_fuser get stuck after using a standalone thread to schedule the collective comm operations of pipeline parallel. #1028

Comments

xinyu-yang commented Aug 23, 2024