You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[BUG] Computation wrapped by jit_fuser get stuck after using a standalone thread to schedule the collective comm operations of pipeline parallel.
#1028
Open
xinyu-yang opened this issue
Aug 23, 2024
· 0 comments
Describe the bug
I would like to implement a scheduler to schedule the orders of collective communication operations of the pipeline parallel. I implemented a scheduling class with a priority queue that provides two main methods, enqueue and dequeue.
Whenever the "group.comm" like "group.isend", "group.irecv" functions are called. I just put them into the queue, including the operator, args, and kwargs. At the same time, there is a separate thread that always schedules and dequeues the collective communication operations.
Currently, I just schedule the collective communication ops of the pipeline parallel to the original order (enqueue order) they are. However, the process gets stuck at the first computation function decorated by "jit_fuser". In our case, it is "bias_dropout_add_fused_train" function.
Note that the process gets stuck when I use NCCL backend but works well when I use GLOO backend.
I totally do not have any idea of this problem. Thanks to anyone who can help or provide any hints! 😊
The text was updated successfully, but these errors were encountered:
Describe the bug
I would like to implement a scheduler to schedule the orders of collective communication operations of the pipeline parallel. I implemented a scheduling class with a priority queue that provides two main methods, enqueue and dequeue.
Whenever the "group.comm" like "group.isend", "group.irecv" functions are called. I just put them into the queue, including the operator, args, and kwargs. At the same time, there is a separate thread that always schedules and dequeues the collective communication operations.
Currently, I just schedule the collective communication ops of the pipeline parallel to the original order (enqueue order) they are. However, the process gets stuck at the first computation function decorated by "jit_fuser". In our case, it is "bias_dropout_add_fused_train" function.
Note that the process gets stuck when I use NCCL backend but works well when I use GLOO backend.
I totally do not have any idea of this problem. Thanks to anyone who can help or provide any hints! 😊
The text was updated successfully, but these errors were encountered: