Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Computation wrapped by jit_fuser get stuck after using a standalone thread to schedule the collective comm operations of pipeline parallel. #1028

Open
xinyu-yang opened this issue Aug 23, 2024 · 0 comments

Comments

@xinyu-yang
Copy link

Describe the bug
I would like to implement a scheduler to schedule the orders of collective communication operations of the pipeline parallel. I implemented a scheduling class with a priority queue that provides two main methods, enqueue and dequeue.

Whenever the "group.comm" like "group.isend", "group.irecv" functions are called. I just put them into the queue, including the operator, args, and kwargs. At the same time, there is a separate thread that always schedules and dequeues the collective communication operations.

Currently, I just schedule the collective communication ops of the pipeline parallel to the original order (enqueue order) they are. However, the process gets stuck at the first computation function decorated by "jit_fuser". In our case, it is "bias_dropout_add_fused_train" function.

Note that the process gets stuck when I use NCCL backend but works well when I use GLOO backend.

I totally do not have any idea of this problem. Thanks to anyone who can help or provide any hints! 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant