[dask] make Dask training resilient to worker restarts during network setup #3775

jameslamb · 2021-01-17T06:48:20Z

Summary

LightGBM distributed training requires that all workers participating in training know about each other (https://lightgbm.readthedocs.io/en/latest/Parallel-Learning-Guide.html#preparation). This information is given via the parameter machines.

LightGBM's Dask module generates a list of workers by checking where pieces of the input data are stored.

LightGBM/python-package/lightgbm/dask.py

Line 192 in 706f2af

who_has = client.who_has(parts)

It then uses the addresses in that list to set up the machines parameters. This step can take a few seconds to a few minutes (#3766 (comment)), and if one of the workers is restarted in between when that list is generated and when training starts, you can get a hard-to-understand KeyError like this:

KeyError: 'tcp://10.10.0.20:44284'

Improving this experience would mean one or both of these:

refactoring python-package/lightgbm/dask.py to reduce the risk of this, maybe by adding error handling for the case where workers are restarted
detecting this condition and adding a more informative error message

Motivation

Adding this feature would make the Dask interface a bit more stable, reducing the need for users to retry training or try to debug low-level Dask details.

Notes for Reviewers

I added "during network setup" to the title here, because I think that making it possible for LightGBM training to continue if a worker disappears during training is a much bigger task and not limited to Dask. @guolinke, I'd love your thoughts on this when you have time.

References

This issue was originally posted as dask/dask-lightgbm#24, and more background is available there. Moving it here as part of dask/community#104.

The text was updated successfully, but these errors were encountered:

jameslamb · 2021-01-17T06:50:36Z

Closing this, as we use #2302 to track feature requests. Leave a comment below if you'd like to contribute this feature, and we'll be happy to re-open it!

…twork (fixes #3753) (#3766) * starting work * fixed port-binding issue on localhost * minor cleanup * updates * getting closer * definitely working for LocalCluster * it works, it works * docs * add tests * removing testing-only files * linting * Apply suggestions from code review Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * remove duplicated code * remove unnecessary listen() Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

jameslamb added the dask label Jan 17, 2021

jameslamb mentioned this issue Jan 17, 2021

Feature Requests & Voting Hub #2302

Open

jameslamb closed this as completed Jan 17, 2021

jameslamb added the feature request label Jan 17, 2021

jameslamb mentioned this issue Sep 17, 2021

[Dask] Socket recv error, code: 54 on macOS #4116

Closed

jameslamb mentioned this issue Jan 15, 2022

[dask] Distributed Lightgbm randomly hangs when multiple train calls are submitted #4942

Closed

jameslamb mentioned this issue May 3, 2023

[Dask] Race condition in finding ports #5865

Open

jameslamb mentioned this issue Jun 11, 2023

lightgbm.dask hangs after worker restarting #5920

Closed

jameslamb mentioned this issue Sep 12, 2023

[python-package] Non-unique data error when doing hyper-parameter search with DaskLGBMClassifier #6091

Closed

jameslamb mentioned this issue Nov 16, 2023

[python-package] [dask] Dask fit task just crashes #6196

Open

jameslamb mentioned this issue Mar 16, 2024

[Distributed][C-api]failed: Error in LGBM_BoosterUpdateOneIter: Socket recv error, Connection reset by peer (code: 104) #6363

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dask] make Dask training resilient to worker restarts during network setup #3775

[dask] make Dask training resilient to worker restarts during network setup #3775

jameslamb commented Jan 17, 2021 •

edited

Loading

jameslamb commented Jan 17, 2021

[dask] make Dask training resilient to worker restarts during network setup #3775

[dask] make Dask training resilient to worker restarts during network setup #3775

Comments

jameslamb commented Jan 17, 2021 • edited Loading

Summary

Motivation

Notes for Reviewers

References

jameslamb commented Jan 17, 2021

jameslamb commented Jan 17, 2021 •

edited

Loading