Skip to content
This repository has been archived by the owner on Jul 9, 2021. It is now read-only.

KeyError in build_network_params #24

Closed
pseudotensor opened this issue Nov 24, 2020 · 10 comments
Closed

KeyError in build_network_params #24

pseudotensor opened this issue Nov 24, 2020 · 10 comments

Comments

@pseudotensor
Copy link

2020-11-21 09:05:51,923 C:  3% D:42.6GB  M:76.9GB  NODE:LOCAL2      20010  DATA   |   File "/data/jon/h2oai.fullcondatest3/h2oaicore/models.py", line 1739, in dask_fit
2020-11-21 09:05:51,924 C:  3% D:42.6GB  M:76.9GB  NODE:LOCAL2      20010  DATA   |     func(X, y, **kwargs_dask)
2020-11-21 09:05:51,925 C:  3% D:42.6GB  M:76.9GB  NODE:LOCAL2      20010  DATA   |   File "/home/jon/minicondadai/lib/python3.6/site-packages/dask_lightgbm/core.py", line 187, in fit
2020-11-21 09:05:51,926 C:  3% D:42.6GB  M:76.9GB  NODE:LOCAL2      20010  DATA   |     model = train(client, X, y, params, model_factory, sample_weight, **kwargs)
2020-11-21 09:05:51,926 C:  3% D:42.6GB  M:76.9GB  NODE:LOCAL2      20010  DATA   |   File "/home/jon/minicondadai/lib/python3.6/site-packages/dask_lightgbm/core.py", line 131, in train
2020-11-21 09:05:51,927 C:  3% D:42.6GB  M:76.9GB  NODE:LOCAL2      20010  DATA   |     results = client.gather(futures_classifiers)
2020-11-21 09:05:51,927 C:  3% D:42.6GB  M:76.9GB  NODE:LOCAL2      20010  DATA   |   File "/home/jon/minicondadai/lib/python3.6/site-packages/distributed/client.py", line 1974, in gather
2020-11-21 09:05:51,928 C:  3% D:42.6GB  M:76.9GB  NODE:LOCAL2      20010  DATA   |     asynchronous=asynchronous,
2020-11-21 09:05:51,928 C:  3% D:42.6GB  M:76.9GB  NODE:LOCAL2      20010  DATA   |   File "/home/jon/minicondadai/lib/python3.6/site-packages/distributed/client.py", line 824, in sync
2020-11-21 09:05:51,929 C:  3% D:42.6GB  M:76.9GB  NODE:LOCAL2      20010  DATA   |     self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
2020-11-21 09:05:51,929 C:  3% D:42.6GB  M:76.9GB  NODE:LOCAL2      20010  DATA   |   File "/home/jon/minicondadai/lib/python3.6/site-packages/distributed/utils.py", line 339, in sync
2020-11-21 09:05:51,930 C:  3% D:42.6GB  M:76.9GB  NODE:LOCAL2      20010  DATA   |     raise exc.with_traceback(tb)
2020-11-21 09:05:51,930 C:  3% D:42.6GB  M:76.9GB  NODE:LOCAL2      20010  DATA   |   File "/home/jon/minicondadai/lib/python3.6/site-packages/distributed/utils.py", line 323, in f
2020-11-21 09:05:51,931 C:  3% D:42.6GB  M:76.9GB  NODE:LOCAL2      20010  DATA   |     result[0] = yield future
2020-11-21 09:05:51,931 C:  3% D:42.6GB  M:76.9GB  NODE:LOCAL2      20010  DATA   |   File "/home/jon/minicondadai/lib/python3.6/site-packages/tornado/gen.py", line 735, in run
2020-11-21 09:05:51,932 C:  3% D:42.6GB  M:76.9GB  NODE:LOCAL2      20010  DATA   |     value = future.result()
2020-11-21 09:05:51,932 C:  3% D:42.6GB  M:76.9GB  NODE:LOCAL2      20010  DATA   |   File "/home/jon/minicondadai/lib/python3.6/site-packages/distributed/client.py", line 1833, in _gather
2020-11-21 09:05:51,933 C:  3% D:42.6GB  M:76.9GB  NODE:LOCAL2      20010  DATA   |     raise exception.with_traceback(traceback)
2020-11-21 09:05:51,933 C:  3% D:42.6GB  M:76.9GB  NODE:LOCAL2      20010  DATA   |   File "/home/jenkins/minicondadai/lib/python3.6/site-packages/dask_lightgbm/core.py", line 60, in _train_part
2020-11-21 09:05:51,934 C:  3% D:42.6GB  M:76.9GB  NODE:LOCAL2      20010  DATA   |   File "/home/jenkins/minicondadai/lib/python3.6/site-packages/dask_lightgbm/core.py", line 36, in build_network_params
2020-11-21 09:05:51,934 C:  3% D:42.6GB  M:76.9GB  NODE:LOCAL2      20010  DATA   | KeyError: 'tcp://172.16.2.192:43141'
2020-11-21 09:05:51,935 C:  3% D:42.6GB  M:76.9GB  NODE:LOCAL2      20010  DATA   | ].

@pseudotensor
Copy link
Author

Doesn't always happen, so some race condition.

@pseudotensor
Copy link
Author

Even for same experiment, the port keeps changing

2020-11-21 09:14:54,078 C:  3% D:42.4GB  M:80.1GB  NODE:LOCAL2      20010  DATA   |   File "/home/jon/minicondadai/lib/python3.6/site-packages/dask_lightgbm/core.py", line 187, in fit
2020-11-21 09:14:54,078 C:  3% D:42.4GB  M:80.1GB  NODE:LOCAL2      20010  DATA   |     model = train(client, X, y, params, model_factory, sample_weight, **kwargs)
2020-11-21 09:14:54,079 C:  3% D:42.4GB  M:80.1GB  NODE:LOCAL2      20010  DATA   |   File "/home/jon/minicondadai/lib/python3.6/site-packages/dask_lightgbm/core.py", line 131, in train
2020-11-21 09:14:54,079 C:  3% D:42.4GB  M:80.1GB  NODE:LOCAL2      20010  DATA   |     results = client.gather(futures_classifiers)
2020-11-21 09:14:54,080 C:  3% D:42.4GB  M:80.1GB  NODE:LOCAL2      20010  DATA   |   File "/home/jon/minicondadai/lib/python3.6/site-packages/distributed/client.py", line 1974, in gather
2020-11-21 09:14:54,080 C:  3% D:42.4GB  M:80.1GB  NODE:LOCAL2      20010  DATA   |     asynchronous=asynchronous,
2020-11-21 09:14:54,081 C:  3% D:42.4GB  M:80.1GB  NODE:LOCAL2      20010  DATA   |   File "/home/jon/minicondadai/lib/python3.6/site-packages/distributed/client.py", line 824, in sync
2020-11-21 09:14:54,081 C:  3% D:42.4GB  M:80.1GB  NODE:LOCAL2      20010  DATA   |     self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
2020-11-21 09:14:54,082 C:  3% D:42.4GB  M:80.1GB  NODE:LOCAL2      20010  DATA   |   File "/home/jon/minicondadai/lib/python3.6/site-packages/distributed/utils.py", line 339, in sync
2020-11-21 09:14:54,082 C:  3% D:42.4GB  M:80.1GB  NODE:LOCAL2      20010  DATA   |     raise exc.with_traceback(tb)
2020-11-21 09:14:54,083 C:  3% D:42.4GB  M:80.1GB  NODE:LOCAL2      20010  DATA   |   File "/home/jon/minicondadai/lib/python3.6/site-packages/distributed/utils.py", line 323, in f
2020-11-21 09:14:54,083 C:  3% D:42.4GB  M:80.1GB  NODE:LOCAL2      20010  DATA   |     result[0] = yield future
2020-11-21 09:14:54,084 C:  3% D:42.4GB  M:80.1GB  NODE:LOCAL2      20010  DATA   |   File "/home/jon/minicondadai/lib/python3.6/site-packages/tornado/gen.py", line 735, in run
2020-11-21 09:14:54,084 C:  3% D:42.4GB  M:80.1GB  NODE:LOCAL2      20010  DATA   |     value = future.result()
2020-11-21 09:14:54,085 C:  3% D:42.4GB  M:80.1GB  NODE:LOCAL2      20010  DATA   |   File "/home/jon/minicondadai/lib/python3.6/site-packages/distributed/client.py", line 1833, in _gather
2020-11-21 09:14:54,085 C:  3% D:42.4GB  M:80.1GB  NODE:LOCAL2      20010  DATA   |     raise exception.with_traceback(traceback)
2020-11-21 09:14:54,086 C:  3% D:42.4GB  M:80.1GB  NODE:LOCAL2      20010  DATA   |   File "/home/jenkins/minicondadai/lib/python3.6/site-packages/dask_lightgbm/core.py", line 60, in _train_part
2020-11-21 09:14:54,086 C:  3% D:42.4GB  M:80.1GB  NODE:LOCAL2      20010  DATA   |   File "/home/jenkins/minicondadai/lib/python3.6/site-packages/dask_lightgbm/core.py", line 36, in build_network_params
2020-11-21 09:14:54,087 C:  3% D:42.4GB  M:80.1GB  NODE:LOCAL2      20010  DATA   | KeyError: 'tcp://172.16.2.192:37135'

@pseudotensor
Copy link
Author

Maybe related:

2020-11-21 09:19:24,512 C:  7% D:41.8GB  M:75.1GB  NODE:LOCAL2      20010  DATA   |   File "/data/jon/h2oai.fullcondatest3/h2oaicore/models.py", line 1739, in dask_fit
2020-11-21 09:19:24,512 C:  7% D:41.8GB  M:75.1GB  NODE:LOCAL2      20010  DATA   |     func(X, y, **kwargs_dask)
2020-11-21 09:19:24,513 C:  7% D:41.8GB  M:75.1GB  NODE:LOCAL2      20010  DATA   |   File "/home/jon/minicondadai/lib/python3.6/site-packages/dask_lightgbm/core.py", line 187, in fit
2020-11-21 09:19:24,514 C:  7% D:41.8GB  M:75.1GB  NODE:LOCAL2      20010  DATA   |     model = train(client, X, y, params, model_factory, sample_weight, **kwargs)
2020-11-21 09:19:24,514 C:  7% D:41.8GB  M:75.1GB  NODE:LOCAL2      20010  DATA   |   File "/home/jon/minicondadai/lib/python3.6/site-packages/dask_lightgbm/core.py", line 131, in train
2020-11-21 09:19:24,515 C:  7% D:41.8GB  M:75.1GB  NODE:LOCAL2      20010  DATA   |     results = client.gather(futures_classifiers)
2020-11-21 09:19:24,515 C:  7% D:41.8GB  M:75.1GB  NODE:LOCAL2      20010  DATA   |   File "/home/jon/minicondadai/lib/python3.6/site-packages/distributed/client.py", line 1974, in gather
2020-11-21 09:19:24,516 C:  7% D:41.8GB  M:75.1GB  NODE:LOCAL2      20010  DATA   |     asynchronous=asynchronous,
2020-11-21 09:19:24,516 C:  7% D:41.8GB  M:75.1GB  NODE:LOCAL2      20010  DATA   |   File "/home/jon/minicondadai/lib/python3.6/site-packages/distributed/client.py", line 824, in sync
2020-11-21 09:19:24,517 C:  7% D:41.8GB  M:75.1GB  NODE:LOCAL2      20010  DATA   |     self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
2020-11-21 09:19:24,517 C:  7% D:41.8GB  M:75.1GB  NODE:LOCAL2      20010  DATA   |   File "/home/jon/minicondadai/lib/python3.6/site-packages/distributed/utils.py", line 339, in sync
2020-11-21 09:19:24,518 C:  7% D:41.8GB  M:75.1GB  NODE:LOCAL2      20010  DATA   |     raise exc.with_traceback(tb)
2020-11-21 09:19:24,518 C:  7% D:41.8GB  M:75.1GB  NODE:LOCAL2      20010  DATA   |   File "/home/jon/minicondadai/lib/python3.6/site-packages/distributed/utils.py", line 323, in f
2020-11-21 09:19:24,519 C:  7% D:41.8GB  M:75.1GB  NODE:LOCAL2      20010  DATA   |     result[0] = yield future
2020-11-21 09:19:24,519 C:  7% D:41.8GB  M:75.1GB  NODE:LOCAL2      20010  DATA   |   File "/home/jon/minicondadai/lib/python3.6/site-packages/tornado/gen.py", line 735, in run
2020-11-21 09:19:24,520 C:  7% D:41.8GB  M:75.1GB  NODE:LOCAL2      20010  DATA   |     value = future.result()
2020-11-21 09:19:24,520 C:  7% D:41.8GB  M:75.1GB  NODE:LOCAL2      20010  DATA   |   File "/home/jon/minicondadai/lib/python3.6/site-packages/distributed/client.py", line 1833, in _gather
2020-11-21 09:19:24,521 C:  7% D:41.8GB  M:75.1GB  NODE:LOCAL2      20010  DATA   |     raise exception.with_traceback(traceback)
2020-11-21 09:19:24,521 C:  7% D:41.8GB  M:75.1GB  NODE:LOCAL2      20010  DATA   |   File "/home/jenkins/minicondadai/lib/python3.6/site-packages/dask_lightgbm/core.py", line 71, in _train_part
2020-11-21 09:19:24,522 C:  7% D:41.8GB  M:75.1GB  NODE:LOCAL2      20010  DATA   |   File "/home/jenkins/minicondadai/lib/python3.6/site-packages/lightgbm/sklearn.py", line 805, in fit
2020-11-21 09:19:24,522 C:  7% D:41.8GB  M:75.1GB  NODE:LOCAL2      20010  DATA   |   File "/home/jenkins/minicondadai/lib/python3.6/site-packages/lightgbm/sklearn.py", line 600, in fit
2020-11-21 09:19:24,523 C:  7% D:41.8GB  M:75.1GB  NODE:LOCAL2      20010  DATA   |   File "/home/jenkins/minicondadai/lib/python3.6/site-packages/lightgbm/engine.py", line 228, in train
2020-11-21 09:19:24,523 C:  7% D:41.8GB  M:75.1GB  NODE:LOCAL2      20010  DATA   |   File "/home/jenkins/minicondadai/lib/python3.6/site-packages/lightgbm/basic.py", line 1709, in __init__
2020-11-21 09:19:24,523 C:  7% D:41.8GB  M:75.1GB  NODE:LOCAL2      20010  DATA   |   File "/home/jenkins/minicondadai/lib/python3.6/site-packages/lightgbm/basic.py", line 1840, in set_network
2020-11-21 09:19:24,524 C:  7% D:41.8GB  M:75.1GB  NODE:LOCAL2      20010  DATA   |   File "/home/jenkins/minicondadai/lib/python3.6/site-packages/lightgbm/basic.py", line 45, in _safe_call
2020-11-21 09:19:24,524 C:  7% D:41.8GB  M:75.1GB  NODE:LOCAL2      20010  DATA   | lightgbm.basic.LightGBMError: Binding port 12401 failed
2020-11-21 09:19:24,525 C:  7% D:41.8GB  M:75.1GB  NODE:LOCAL2      20010  DATA   | ].

@pseudotensor
Copy link
Author

2020-11-21 09:21:42,478 C:  6% D:41.8GB  M:78.1GB  NODE:LOCAL2      20010  DATA   |   File "/data/jon/h2oai.fullcondatest3/h2oaicore/models.py", line 1739, in dask_fit
2020-11-21 09:21:42,479 C:  6% D:41.8GB  M:78.1GB  NODE:LOCAL2      20010  DATA   |     func(X, y, **kwargs_dask)
2020-11-21 09:21:42,479 C:  6% D:41.8GB  M:78.1GB  NODE:LOCAL2      20010  DATA   |   File "/home/jon/minicondadai/lib/python3.6/site-packages/dask_lightgbm/core.py", line 187, in fit
2020-11-21 09:21:42,480 C:  6% D:41.8GB  M:78.1GB  NODE:LOCAL2      20010  DATA   |     model = train(client, X, y, params, model_factory, sample_weight, **kwargs)
2020-11-21 09:21:42,480 C:  6% D:41.8GB  M:78.1GB  NODE:LOCAL2      20010  DATA   |   File "/home/jon/minicondadai/lib/python3.6/site-packages/dask_lightgbm/core.py", line 131, in train
2020-11-21 09:21:42,481 C:  6% D:41.8GB  M:78.1GB  NODE:LOCAL2      20010  DATA   |     results = client.gather(futures_classifiers)
2020-11-21 09:21:42,481 C:  6% D:41.8GB  M:78.1GB  NODE:LOCAL2      20010  DATA   |   File "/home/jon/minicondadai/lib/python3.6/site-packages/distributed/client.py", line 1974, in gather
2020-11-21 09:21:42,482 C:  6% D:41.8GB  M:78.1GB  NODE:LOCAL2      20010  DATA   |     asynchronous=asynchronous,
2020-11-21 09:21:42,482 C:  6% D:41.8GB  M:78.1GB  NODE:LOCAL2      20010  DATA   |   File "/home/jon/minicondadai/lib/python3.6/site-packages/distributed/client.py", line 824, in sync
2020-11-21 09:21:42,483 C:  6% D:41.8GB  M:78.1GB  NODE:LOCAL2      20010  DATA   |     self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
2020-11-21 09:21:42,483 C:  6% D:41.8GB  M:78.1GB  NODE:LOCAL2      20010  DATA   |   File "/home/jon/minicondadai/lib/python3.6/site-packages/distributed/utils.py", line 339, in sync
2020-11-21 09:21:42,484 C:  6% D:41.8GB  M:78.1GB  NODE:LOCAL2      20010  DATA   |     raise exc.with_traceback(tb)
2020-11-21 09:21:42,484 C:  6% D:41.8GB  M:78.1GB  NODE:LOCAL2      20010  DATA   |   File "/home/jon/minicondadai/lib/python3.6/site-packages/distributed/utils.py", line 323, in f
2020-11-21 09:21:42,485 C:  6% D:41.8GB  M:78.1GB  NODE:LOCAL2      20010  DATA   |     result[0] = yield future
2020-11-21 09:21:42,485 C:  6% D:41.8GB  M:78.1GB  NODE:LOCAL2      20010  DATA   |   File "/home/jon/minicondadai/lib/python3.6/site-packages/tornado/gen.py", line 735, in run
2020-11-21 09:21:42,486 C:  6% D:41.8GB  M:78.1GB  NODE:LOCAL2      20010  DATA   |     value = future.result()
2020-11-21 09:21:42,486 C:  6% D:41.8GB  M:78.1GB  NODE:LOCAL2      20010  DATA   |   File "/home/jon/minicondadai/lib/python3.6/site-packages/distributed/client.py", line 1833, in _gather
2020-11-21 09:21:42,487 C:  6% D:41.8GB  M:78.1GB  NODE:LOCAL2      20010  DATA   |     raise exception.with_traceback(traceback)
2020-11-21 09:21:42,487 C:  6% D:41.8GB  M:78.1GB  NODE:LOCAL2      20010  DATA   |   File "/home/jon/minicondadai/lib/python3.6/site-packages/dask_lightgbm/core.py", line 60, in _train_part
2020-11-21 09:21:42,488 C:  6% D:41.8GB  M:78.1GB  NODE:LOCAL2      20010  DATA   |     network_params = build_network_params(worker_addresses, get_worker().address, local_listen_port, time_out)
2020-11-21 09:21:42,488 C:  6% D:41.8GB  M:78.1GB  NODE:LOCAL2      20010  DATA   |   File "/home/jon/minicondadai/lib/python3.6/site-packages/dask_lightgbm/core.py", line 36, in build_network_params
2020-11-21 09:21:42,488 C:  6% D:41.8GB  M:78.1GB  NODE:LOCAL2      20010  DATA   |     'local_listen_port': addr_port_map[local_worker_ip],
2020-11-21 09:21:42,489 C:  6% D:41.8GB  M:78.1GB  NODE:LOCAL2      20010  DATA   | KeyError: 'tcp://172.16.2.210:44155'

@pseudotensor
Copy link
Author

Also see: ('LightGBM Failure: Socket send error, code: 32',)

@pseudotensor
Copy link
Author

[LightGBM] [Info] Binding port 12400 succeeded
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Listening...
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 200 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 260 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 338 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 439 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 570 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 741 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 963 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 1251 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 1626 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 2113 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 2746 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 3569 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 4639 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 6030 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 7838 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 10189 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 13245 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 17218 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 22383 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 29097 milliseconds
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Trying to bind port 12400...
[LightGBM] [Info] Binding port 12400 succeeded

Also get horrible delays like this.

@ffineis
Copy link

ffineis commented Nov 29, 2020

I'd try boosting worker memory - I was getting a similar KeyError during make system-test. My rough estimation of what was going on for me was the dask nanny was restarting workers who were consuming too much memory and spinning up new ones with different listen ports, for example:

scheduler_1  | distributed.core - INFO - Removing comms to tcp://172.18.0.3:34695
worker_1     | distributed.nanny - WARNING - Restarting worker
worker_1     | distributed.worker - INFO -       Start worker at:     tcp://172.18.0.3:33639
worker_1     | distributed.worker - INFO -          Listening to:     tcp://172.18.0.3:33639

So the worker's original IP address as assigned by dask isn't keeping up with get_worker().address (in dask_lightgbm/core.py) each time the workers get restarted. Again - this is just my understanding, I could be wrong.

Boosting worker memory seemed to clear up the KeyError, e.g. dask-worker --memory-limit 3e9.

@pseudotensor
Copy link
Author

Thanks, although I'm not setting any memory limit so the default is "auto" that is no limit.

@pseudotensor
Copy link
Author

Still hit this.

2021-01-15 07:41:33,624 C: 12% D:802.9GB M:204.0GB NODE:LOCAL2      7243   DATA   |     model = train(client, X, y, params, model_factory, sample_weight, **kwargs)
2021-01-15 07:41:33,624 C: 12% D:802.9GB M:204.0GB NODE:LOCAL2      7243   DATA   |   File "/opt/h2oai/dai/cuda-10.0/lib/python3.6/site-packages/dask_lightgbm/core.py", line 168, in train
2021-01-15 07:41:33,625 C: 12% D:802.9GB M:204.0GB NODE:LOCAL2      7243   DATA   |     results = client.gather(futures_classifiers)
2021-01-15 07:41:33,625 C: 12% D:802.9GB M:204.0GB NODE:LOCAL2      7243   DATA   |   File "/opt/h2oai/dai/cuda-10.0/lib/python3.6/site-packages/distributed/client.py", line 1982, in gather
2021-01-15 07:41:33,626 C: 12% D:802.9GB M:204.0GB NODE:LOCAL2      7243   DATA   |     asynchronous=asynchronous,
2021-01-15 07:41:33,626 C: 12% D:802.9GB M:204.0GB NODE:LOCAL2      7243   DATA   |   File "/opt/h2oai/dai/cuda-10.0/lib/python3.6/site-packages/distributed/client.py", line 832, in sync
2021-01-15 07:41:33,627 C: 12% D:802.9GB M:204.0GB NODE:LOCAL2      7243   DATA   |     self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
2021-01-15 07:41:33,627 C: 12% D:802.9GB M:204.0GB NODE:LOCAL2      7243   DATA   |   File "/opt/h2oai/dai/cuda-10.0/lib/python3.6/site-packages/distributed/utils.py", line 339, in sync
2021-01-15 07:41:33,627 C: 12% D:802.9GB M:204.0GB NODE:LOCAL2      7243   DATA   |     raise exc.with_traceback(tb)
2021-01-15 07:41:33,628 C: 12% D:802.9GB M:204.0GB NODE:LOCAL2      7243   DATA   |   File "/opt/h2oai/dai/cuda-10.0/lib/python3.6/site-packages/distributed/utils.py", line 323, in f
2021-01-15 07:41:33,628 C: 12% D:802.9GB M:204.0GB NODE:LOCAL2      7243   DATA   |     result[0] = yield future
2021-01-15 07:41:33,629 C: 12% D:802.9GB M:204.0GB NODE:LOCAL2      7243   DATA   |   File "/opt/h2oai/dai/cuda-10.0/lib/python3.6/site-packages/tornado/gen.py", line 735, in run
2021-01-15 07:41:33,629 C: 12% D:802.9GB M:204.0GB NODE:LOCAL2      7243   DATA   |     value = future.result()
2021-01-15 07:41:33,629 C: 12% D:802.9GB M:204.0GB NODE:LOCAL2      7243   DATA   |   File "/opt/h2oai/dai/cuda-10.0/lib/python3.6/site-packages/distributed/client.py", line 1841, in _gather
2021-01-15 07:41:33,630 C: 12% D:802.9GB M:204.0GB NODE:LOCAL2      7243   DATA   |     raise exception.with_traceback(traceback)
2021-01-15 07:41:33,630 C: 12% D:802.9GB M:204.0GB NODE:LOCAL2      7243   DATA   |   File "/opt/h2oai/dai/cuda-10.0/lib/python3.6/site-packages/dask_lightgbm/core.py", line 61, in _train_part
2021-01-15 07:41:33,631 C: 12% D:802.9GB M:204.0GB NODE:LOCAL2      7243   DATA   |     network_params = build_network_params(worker_addresses, get_worker().address, local_listen_port, time_out)
2021-01-15 07:41:33,631 C: 12% D:802.9GB M:204.0GB NODE:LOCAL2      7243   DATA   |   File "/opt/h2oai/dai/cuda-10.0/lib/python3.6/site-packages/dask_lightgbm/core.py", line 37, in build_network_params
2021-01-15 07:41:33,631 C: 12% D:802.9GB M:204.0GB NODE:LOCAL2      7243   DATA   |     'local_listen_port': addr_port_map[local_worker_ip],
2021-01-15 07:41:33,632 C: 12% D:802.9GB M:204.0GB NODE:LOCAL2      7243   DATA   | KeyError: 'tcp://10.10.0.20:44284'

pseudotensor referenced this issue in microsoft/LightGBM Jan 16, 2021
…twork (fixes #3753) (#3766)

* starting work

* fixed port-binding issue on localhost

* minor cleanup

* updates

* getting closer

* definitely working for LocalCluster

* it works, it works

* docs

* add tests

* removing testing-only files

* linting

* Apply suggestions from code review

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* remove duplicated code

* remove unnecessary listen()

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
@SfinxCZ
Copy link
Collaborator

SfinxCZ commented Jan 16, 2021

Closing this issue, since this repo is no longer maintained and the code itself is now part of the lightgbm library (https://github.com/microsoft/LightGBM). In case that this issue is still relevant, recreate please this issue in the LightGBM repository.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants