Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dask] disable work stealing for training tasks #6794

Merged
merged 1 commit into from
Mar 29, 2021
Merged

[dask] disable work stealing for training tasks #6794

merged 1 commit into from
Mar 29, 2021

Conversation

jameslamb
Copy link
Contributor

I recently found an issue with lightgbm.dask's uses of distributed.Client.submit(). Essentially, LightGBM training relies on running exactly one training task per worker, and it was not setting workers on client.submit() to guarantee this (microsoft/LightGBM#4132).

After discovering this, I came over here to check if XGBoost suffered from a similar issue, since this bug was in a piece of lightgbm.dask that has not been changed since dask-lightgbm, and dask-lightgbm was based on dask-xgboost 😂 .

It seems like xgboost.dask is already setting workers=[worker_addr] to prevent this situation. However, I think that xgboost.dask would also benefit from setting allow_other_workers=False when submitting dispatched_train() calls. This can be used to disable the training tasks from being moved to another worker by Dask work stealing. I think that Dask moving a dispatched_train() task to another worker during training would probably cause training to fail or produce incorrect results, although to be honest I'm not sure how to test that in XGBoost.

Notes for reviewers

I've checked the blame and the argument allow_other_workers has been in the signature of distributed.Client.submit() for at least three years, so adding it shouldn't introduce any significant compatibility issues with older versions of distributed.

https://github.com/dask/distributed/blame/1b32bd30201ef6ced5029180143d2c37b393b586/distributed/client.py#L1234-L1240

Thanks for your time and consideration!

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the default is already False. But setting it explicitly can help to prevent future changes and improve code readability.

Thanks for the PR!

@trivialfis trivialfis merged commit f01af43 into dmlc:master Mar 29, 2021
@jameslamb jameslamb deleted the fix/dask-submit branch June 1, 2021 04:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants