Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data][train] Fix deadlocks caused by streaming_split #42601

Merged
merged 8 commits into from
Jan 26, 2024

Conversation

raulchen
Copy link
Contributor

@raulchen raulchen commented Jan 23, 2024

Why are these changes needed?

Fix a deadlock issue for training jobs. The issue happens in the following situation:

  • The output blocks of streaming_split are assigned to multiple splits (output_split_idx).
  • When one split has finished reading all blocks, it won't stop the iteration until all the other splits have all finished, because of this.
  • This is usually fine. But when the unfinished splits are waiting for the finished splits (e.g., there is a gradient synchronization), there will be a dead lock due to circular dependencies.

This PR makes the finished splits can finish iteration immediately without waiting for others.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Hao Chen <chenh1024@gmail.com>

fix

Signed-off-by: Hao Chen <chenh1024@gmail.com>

separate queues

Signed-off-by: Hao Chen <chenh1024@gmail.com>

debug

Signed-off-by: Hao Chen <chenh1024@gmail.com>

fix

Signed-off-by: Hao Chen <chenh1024@gmail.com>

fix

Signed-off-by: Hao Chen <chenh1024@gmail.com>

debug

Signed-off-by: Hao Chen <chenh1024@gmail.com>

refine

Signed-off-by: Hao Chen <chenh1024@gmail.com>

fix

Signed-off-by: Hao Chen <chenh1024@gmail.com>

Revert "fix"

This reverts commit c63f8b71f150b0dc0add60b2817ce2241abd41ac.

Revert "refine"

This reverts commit 225db8279d128e1d00a359b42a5b7b5b93c57cfb.

fix

Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
@raulchen raulchen changed the title [data] Use multiple queues to store output blocks with split index [data][train] Fix deadlocks caused by streaming_split Jan 25, 2024
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
@stephanie-wang
Copy link
Contributor

Hmm sorry but I don't quite understand the deadlock situation in the PR description and the proposed fix. Doesn't SplitCoordinator explicitly require all the consumers to read at the same time? Is the deadlock situation in the PR description somehow different?

@raulchen
Copy link
Contributor Author

Update:
The issue happens when there is a gradient sync per batch. E.g.,

for batch in it.iter_batches():
    all_reduce()

We suspect it's because streaming_split returns different number of rows for each train worker and makes the all_reduce misaligned.
We'll merge this PR first, as it doesn't make sense to have this dependency in the first place.

@raulchen raulchen merged commit df3dd96 into ray-project:master Jan 26, 2024
9 checks passed
@raulchen raulchen deleted the streaming-split-output-queue branch January 26, 2024 23:03
raulchen added a commit to raulchen/ray that referenced this pull request Jan 26, 2024
)

Fix a deadlock issue for training jobs. The issue happens in the following situation:
* The output blocks of `streaming_split` are assigned to multiple splits (`output_split_idx`).
* When one split has finished reading all blocks, it won't stop the iteration until all the other splits have all finished, because of [this](https://github.com/ray-project/ray/blob/fae8d2ff814377eb027d63d73a23d5c5bf3b02bd/python/ray/data/_internal/execution/streaming_executor_state.py#L288).
* This is usually fine. But when the unfinished splits are waiting for the finished splits (e.g., there is a gradient synchronization), there will be a dead lock due to circular dependencies.

This PR makes the finished splits can finish iteration immediately without waiting for others.

---------

Signed-off-by: Hao Chen <chenh1024@gmail.com>
Copy link
Contributor

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

late LGTM

architkulkarni pushed a commit that referenced this pull request Jan 29, 2024
pick #42601 to 2.9.2, this PR fixes a potential deadlock issue for training jobs.

---------

Signed-off-by: Hao Chen <chenh1024@gmail.com>
@genesis-jamin
Copy link

Hi @raulchen , thanks for pushing this fix -- this actually fixed a NCCL timeout error that we were seeing when doing multi-node distributed training. The behavior there was that sometimes randomly at the start of a train epoch, we would hit a NCCL timeout error because all of the ranks except one were trying to allreduce the gradients.

I'm also confused by the deadlock explanation though. Have you / the team thought more about how exactly this would have created a deadlock with gradient synchronization? We iterate over our data using the iter_torch_batches function, and I thought this would call into streaming_split with equal=True (so each worker gets an equal number of rows).

If it's helpful, we only started seeing this issue when we scaled up the model size (probably because gradient synchronization took longer).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants