-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dask] make Dask training resilient to worker restarts during network setup #3775
Labels
Comments
Closing this, as we use #2302 to track feature requests. Leave a comment below if you'd like to contribute this feature, and we'll be happy to re-open it! |
jameslamb
referenced
this issue
Jan 17, 2021
…twork (fixes #3753) (#3766) * starting work * fixed port-binding issue on localhost * minor cleanup * updates * getting closer * definitely working for LocalCluster * it works, it works * docs * add tests * removing testing-only files * linting * Apply suggestions from code review Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * remove duplicated code * remove unnecessary listen() Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Summary
LightGBM distributed training requires that all workers participating in training know about each other (https://lightgbm.readthedocs.io/en/latest/Parallel-Learning-Guide.html#preparation). This information is given via the parameter
machines
.LightGBM's Dask module generates a list of workers by checking where pieces of the input data are stored.
LightGBM/python-package/lightgbm/dask.py
Line 192 in 706f2af
It then uses the addresses in that list to set up the
machines
parameters. This step can take a few seconds to a few minutes (#3766 (comment)), and if one of the workers is restarted in between when that list is generated and when training starts, you can get a hard-to-understand KeyError like this:Improving this experience would mean one or both of these:
python-package/lightgbm/dask.py
to reduce the risk of this, maybe by adding error handling for the case where workers are restartedMotivation
Adding this feature would make the Dask interface a bit more stable, reducing the need for users to retry training or try to debug low-level Dask details.
Notes for Reviewers
I added "during network setup" to the title here, because I think that making it possible for LightGBM training to continue if a worker disappears during training is a much bigger task and not limited to Dask. @guolinke, I'd love your thoughts on this when you have time.
References
This issue was originally posted as dask/dask-lightgbm#24, and more background is available there. Moving it here as part of dask/community#104.
The text was updated successfully, but these errors were encountered: