Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smarter stealing with dependencies #7024

Merged
merged 30 commits into from
Sep 29, 2022

Conversation

hendrikmakait
Copy link
Member

@hendrikmakait hendrikmakait commented Sep 8, 2022

  • Tests added / passed
  • Passes pre-commit run --all-files

@github-actions
Copy link
Contributor

github-actions bot commented Sep 8, 2022

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       15 files  ±  0         15 suites  ±0   6h 5m 45s ⏱️ - 34m 8s
  3 146 tests +11    3 060 ✔️ +12    85 💤 ±0  1  - 1 
23 287 runs  +88  22 381 ✔️ +92  905 💤  - 3  1  - 1 

For more details on these failures, see this check.

Results for commit b5afd77. ± Comparison against base commit 8f36aa5.

♻️ This comment has been updated with latest results.

distributed/tests/test_steal.py Outdated Show resolved Hide resolved
distributed/tests/test_steal.py Outdated Show resolved Hide resolved
([[1, 1, 1, 1], []], [[1, 1, 1], [1]]), # but don't move too many
(
[[0, 0], [0, 0], [0, 0], []], # no one clearly saturated
pytest.param([[1], []], [[1], []], id="don't move unnecessarily"),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better readability/lookup in VS Code, I made the in-line comments parametrization IDs. There are no changes to the actual test params.

distributed/tests/test_steal.py Outdated Show resolved Hide resolved
distributed/tests/test_steal.py Outdated Show resolved Hide resolved
@hendrikmakait hendrikmakait force-pushed the balanced-stealing branch 4 times, most recently from 2817b9b to f1ed606 Compare September 16, 2022 05:18
@hendrikmakait hendrikmakait marked this pull request as ready for review September 16, 2022 11:38
@hendrikmakait hendrikmakait self-assigned this Sep 16, 2022
)


def _run_dependency_balance_test(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything below is code for enabling the test cases above

@hendrikmakait
Copy link
Member Author

hendrikmakait commented Sep 16, 2022

A/B test: https://github.com/coiled/coiled-runtime/actions/runs/3068174742

N=2, so be cautious with the results. Notably, test_vorticity looks much better, yet test_basic_sum much worse. The remainder of the changes might be noise.

AB Test results

@fjetter
Copy link
Member

fjetter commented Sep 16, 2022

@hendrikmakait you set the worker-saturation parameter as well https://github.com/coiled/coiled-runtime/blob/hendrik/AB_stealing_objective/AB_environments/AB_sample.dask.yaml which explains the huge impact

@hendrikmakait
Copy link
Member Author

@hendrikmakait you set the worker-saturation parameter as well https://github.com/coiled/coiled-runtime/blob/hendrik/AB_stealing_objective/AB_environments/AB_sample.dask.yaml which explains the huge impact

🤦

@fjetter
Copy link
Member

fjetter commented Sep 16, 2022

FYI I retriggered the computation already with updated info

distributed/stealing.py Outdated Show resolved Hide resolved
@hendrikmakait
Copy link
Member Author

A/B test: https://github.com/coiled/coiled-runtime/actions/runs/3135725432 (ignore AB_sample)
n: 7
baseline: 2022.9.1

With n==7, there appear to be no noticeable effects of the change.

Screenshot 2022-09-27 at 16 00 47

Screenshot 2022-09-27 at 16 00 59

@hendrikmakait
Copy link
Member Author

I could not reproduce the flake of test_balance_prefers_busier_with_dependency locally in 1000 runs.
A possible cause might be described in this comment:

# NOTE: There is a timing issue that workers may already start executing
# tasks before we call balance, i.e. the workers will reject the
# stealing request and we end up with a different end result.
# Particularly tests with many input tasks are more likely to fail since
# the test setup takes longer and allows the workers more time to
# schedule a task on the threadpool

@hendrikmakait
Copy link
Member Author

hendrikmakait commented Sep 29, 2022

I could not reproduce the flake of test_balance_prefers_busier_with_dependency locally in 1000 runs. A possible cause might be described in this comment:

# NOTE: There is a timing issue that workers may already start executing
# tasks before we call balance, i.e. the workers will reject the
# stealing request and we end up with a different end result.
# Particularly tests with many input tasks are more likely to fail since
# the test setup takes longer and allows the workers more time to
# schedule a task on the threadpool

The culprit is worker-saturation (see #7085). To reproduce this locally, make sure to enable root task withholding by setting worker-saturation as hinted at by Tests / test (ubuntu-latest, 3.9, queue, ci1). It's easy to miss that one.

@hendrikmakait
Copy link
Member Author

CI flakes:

@crusaderky crusaderky merged commit 4052350 into dask:main Sep 29, 2022
@hendrikmakait
Copy link
Member Author

It looks like test_balance flakes after merging with main (likely due to AMM being enabled). I'm investigating and filing a follow-up PR.

gjoseph92 pushed a commit to gjoseph92/distributed that referenced this pull request Oct 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants