-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[data][train] Fix deadlocks caused by streaming_split #42601
[data][train] Fix deadlocks caused by streaming_split #42601
Conversation
Signed-off-by: Hao Chen <chenh1024@gmail.com> fix Signed-off-by: Hao Chen <chenh1024@gmail.com> separate queues Signed-off-by: Hao Chen <chenh1024@gmail.com> debug Signed-off-by: Hao Chen <chenh1024@gmail.com> fix Signed-off-by: Hao Chen <chenh1024@gmail.com> fix Signed-off-by: Hao Chen <chenh1024@gmail.com> debug Signed-off-by: Hao Chen <chenh1024@gmail.com> refine Signed-off-by: Hao Chen <chenh1024@gmail.com> fix Signed-off-by: Hao Chen <chenh1024@gmail.com> Revert "fix" This reverts commit c63f8b71f150b0dc0add60b2817ce2241abd41ac. Revert "refine" This reverts commit 225db8279d128e1d00a359b42a5b7b5b93c57cfb. fix Signed-off-by: Hao Chen <chenh1024@gmail.com>
c63f8b7
to
d9aeb87
Compare
Hmm sorry but I don't quite understand the deadlock situation in the PR description and the proposed fix. Doesn't SplitCoordinator explicitly require all the consumers to read at the same time? Is the deadlock situation in the PR description somehow different? |
Update: for batch in it.iter_batches():
all_reduce() We suspect it's because |
) Fix a deadlock issue for training jobs. The issue happens in the following situation: * The output blocks of `streaming_split` are assigned to multiple splits (`output_split_idx`). * When one split has finished reading all blocks, it won't stop the iteration until all the other splits have all finished, because of [this](https://github.com/ray-project/ray/blob/fae8d2ff814377eb027d63d73a23d5c5bf3b02bd/python/ray/data/_internal/execution/streaming_executor_state.py#L288). * This is usually fine. But when the unfinished splits are waiting for the finished splits (e.g., there is a gradient synchronization), there will be a dead lock due to circular dependencies. This PR makes the finished splits can finish iteration immediately without waiting for others. --------- Signed-off-by: Hao Chen <chenh1024@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
late LGTM
Hi @raulchen , thanks for pushing this fix -- this actually fixed a NCCL timeout error that we were seeing when doing multi-node distributed training. The behavior there was that sometimes randomly at the start of a train epoch, we would hit a NCCL timeout error because all of the ranks except one were trying to allreduce the gradients. I'm also confused by the deadlock explanation though. Have you / the team thought more about how exactly this would have created a deadlock with gradient synchronization? We iterate over our data using the If it's helpful, we only started seeing this issue when we scaled up the model size (probably because gradient synchronization took longer). |
Why are these changes needed?
Fix a deadlock issue for training jobs. The issue happens in the following situation:
streaming_split
are assigned to multiple splits (output_split_idx
).This PR makes the finished splits can finish iteration immediately without waiting for others.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.