[dask] disable work stealing for training tasks #6794
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I recently found an issue with
lightgbm.dask
's uses ofdistributed.Client.submit()
. Essentially, LightGBM training relies on running exactly one training task per worker, and it was not settingworkers
onclient.submit()
to guarantee this (microsoft/LightGBM#4132).After discovering this, I came over here to check if XGBoost suffered from a similar issue, since this bug was in a piece of
lightgbm.dask
that has not been changed sincedask-lightgbm
, anddask-lightgbm
was based ondask-xgboost
😂 .It seems like
xgboost.dask
is already settingworkers=[worker_addr]
to prevent this situation. However, I think thatxgboost.dask
would also benefit from settingallow_other_workers=False
when submittingdispatched_train()
calls. This can be used to disable the training tasks from being moved to another worker by Dask work stealing. I think that Dask moving adispatched_train()
task to another worker during training would probably cause training to fail or produce incorrect results, although to be honest I'm not sure how to test that in XGBoost.Notes for reviewers
I've checked the blame and the argument
allow_other_workers
has been in the signature ofdistributed.Client.submit()
for at least three years, so adding it shouldn't introduce any significant compatibility issues with older versions ofdistributed
.https://github.com/dask/distributed/blame/1b32bd30201ef6ced5029180143d2c37b393b586/distributed/client.py#L1234-L1240
Thanks for your time and consideration!