[data][train] Fix deadlocks caused by streaming_split #42601

raulchen · 2024-01-23T03:04:52Z

Why are these changes needed?

Fix a deadlock issue for training jobs. The issue happens in the following situation:

The output blocks of streaming_split are assigned to multiple splits (output_split_idx).
When one split has finished reading all blocks, it won't stop the iteration until all the other splits have all finished, because of this.
This is usually fine. But when the unfinished splits are waiting for the finished splits (e.g., there is a gradient synchronization), there will be a dead lock due to circular dependencies.

This PR makes the finished splits can finish iteration immediately without waiting for others.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Hao Chen <chenh1024@gmail.com> fix Signed-off-by: Hao Chen <chenh1024@gmail.com> separate queues Signed-off-by: Hao Chen <chenh1024@gmail.com> debug Signed-off-by: Hao Chen <chenh1024@gmail.com> fix Signed-off-by: Hao Chen <chenh1024@gmail.com> fix Signed-off-by: Hao Chen <chenh1024@gmail.com> debug Signed-off-by: Hao Chen <chenh1024@gmail.com> refine Signed-off-by: Hao Chen <chenh1024@gmail.com> fix Signed-off-by: Hao Chen <chenh1024@gmail.com> Revert "fix" This reverts commit c63f8b71f150b0dc0add60b2817ce2241abd41ac. Revert "refine" This reverts commit 225db8279d128e1d00a359b42a5b7b5b93c57cfb. fix Signed-off-by: Hao Chen <chenh1024@gmail.com>

Signed-off-by: Hao Chen <chenh1024@gmail.com>

stephanie-wang · 2024-01-26T01:36:53Z

Hmm sorry but I don't quite understand the deadlock situation in the PR description and the proposed fix. Doesn't SplitCoordinator explicitly require all the consumers to read at the same time? Is the deadlock situation in the PR description somehow different?

raulchen · 2024-01-26T23:01:31Z

Update:
The issue happens when there is a gradient sync per batch. E.g.,

for batch in it.iter_batches():
    all_reduce()

We suspect it's because streaming_split returns different number of rows for each train worker and makes the all_reduce misaligned.
We'll merge this PR first, as it doesn't make sense to have this dependency in the first place.

) Fix a deadlock issue for training jobs. The issue happens in the following situation: * The output blocks of `streaming_split` are assigned to multiple splits (`output_split_idx`). * When one split has finished reading all blocks, it won't stop the iteration until all the other splits have all finished, because of [this](https://github.com/ray-project/ray/blob/fae8d2ff814377eb027d63d73a23d5c5bf3b02bd/python/ray/data/_internal/execution/streaming_executor_state.py#L288). * This is usually fine. But when the unfinished splits are waiting for the finished splits (e.g., there is a gradient synchronization), there will be a dead lock due to circular dependencies. This PR makes the finished splits can finish iteration immediately without waiting for others. --------- Signed-off-by: Hao Chen <chenh1024@gmail.com>

c21

late LGTM

pick #42601 to 2.9.2, this PR fixes a potential deadlock issue for training jobs. --------- Signed-off-by: Hao Chen <chenh1024@gmail.com>

genesis-jamin · 2024-03-07T00:59:43Z

Hi @raulchen , thanks for pushing this fix -- this actually fixed a NCCL timeout error that we were seeing when doing multi-node distributed training. The behavior there was that sometimes randomly at the start of a train epoch, we would hit a NCCL timeout error because all of the ranks except one were trying to allreduce the gradients.

I'm also confused by the deadlock explanation though. Have you / the team thought more about how exactly this would have created a deadlock with gradient synchronization? We iterate over our data using the iter_torch_batches function, and I thought this would call into streaming_split with equal=True (so each worker gets an equal number of rows).

If it's helpful, we only started seeing this issue when we scaled up the model size (probably because gradient synchronization took longer).

raulchen requested review from ericl, scv119, c21, amogkam, scottjlee, bveeramani, stephanie-wang and Zandew as code owners January 23, 2024 03:04

raulchen force-pushed the streaming-split-output-queue branch from c63f8b7 to d9aeb87 Compare January 23, 2024 21:36

fix

501f58c

Signed-off-by: Hao Chen <chenh1024@gmail.com>

raulchen changed the title ~~[data] Use multiple queues to store output blocks with split index~~ [data][train] Fix deadlocks caused by streaming_split Jan 25, 2024

raulchen added 5 commits January 25, 2024 11:48

fix

3755344

Signed-off-by: Hao Chen <chenh1024@gmail.com>

comment

a873094

Signed-off-by: Hao Chen <chenh1024@gmail.com>

refine

e73b8aa

Signed-off-by: Hao Chen <chenh1024@gmail.com>

add test

2e07bc5

Signed-off-by: Hao Chen <chenh1024@gmail.com>

lint

b8fda49

Signed-off-by: Hao Chen <chenh1024@gmail.com>

raulchen assigned stephanie-wang and c21 Jan 26, 2024

Merge branch 'master' into streaming-split-output-queue

060ff48

architkulkarni added the v2.9.2-pick label Jan 26, 2024

zhe-thoughts approved these changes Jan 26, 2024

View reviewed changes

stephanie-wang approved these changes Jan 26, 2024

View reviewed changes

raulchen merged commit df3dd96 into ray-project:master Jan 26, 2024
9 checks passed

raulchen deleted the streaming-split-output-queue branch January 26, 2024 23:03

raulchen mentioned this pull request Jan 26, 2024

[CP][data][train] Fix deadlocks caused by streaming_split #42755

Merged

8 tasks

c21 reviewed Jan 26, 2024

View reviewed changes

raulchen mentioned this pull request Feb 6, 2024

[data] fix a race condition issue in OpBufferQueue #43015

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data][train] Fix deadlocks caused by streaming_split #42601

[data][train] Fix deadlocks caused by streaming_split #42601

raulchen commented Jan 23, 2024 •

edited

Loading

stephanie-wang commented Jan 26, 2024

raulchen commented Jan 26, 2024

c21 left a comment

genesis-jamin commented Mar 7, 2024

[data][train] Fix deadlocks caused by streaming_split #42601

[data][train] Fix deadlocks caused by streaming_split #42601

Conversation

raulchen commented Jan 23, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

stephanie-wang commented Jan 26, 2024

raulchen commented Jan 26, 2024

c21 left a comment

Choose a reason for hiding this comment

genesis-jamin commented Mar 7, 2024

raulchen commented Jan 23, 2024 •

edited

Loading