-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dask] make random port search more resilient to random collisions (fixes #4057) #4133
Conversation
Linking #4132 (comment), the finding that inspired this pull request. |
python-package/lightgbm/dask.py
Outdated
@@ -371,6 +383,18 @@ def _train( | |||
_find_random_open_port, | |||
workers=list(worker_addresses) | |||
) | |||
# handle the case where _find_random_open_port() produces duplicates | |||
retries_left = 10 | |||
while _worker_map_has_duplicates(worker_address_to_port) and retries_left > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid that multiple re-runs of the same function still have high probability to produce same values, especially on the same physical machine.
If a is omitted or None, the current system time is used. If randomness sources are provided by the operating system, they are used instead of the system time (see the os.urandom() function for details on availability).
https://docs.python.org/3/library/random.html#random.seed
I believe that more reliable way to handle the case of same ports will be to resolve it manually by simply incrementing ports while conflicts are not resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I'll change this to something like that. I was worried about making this a lot more complicated, but maybe it's unavoidable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry for this, I'll try to come up with a solution as well. FWIW the "randomness" isn't related to python's random.seed
, since we're asking the OS for an open port and it decides which one to give us. I believe the collisions happen when a worker has completed the function and freed the port and another one is just asking for it and the OS just returns the same one (kinda troll). I'll see if we can put the port in wait for a bit or something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alright, I tried a different approach in 05303c8. I think this will be more reliable. Instead of re-running _find_random_port()
all at once for all workers again, the process will be like:
- run
_find_random_port()
for every worker - if any duplicate IP-port pairs were found, run
_find_random_port()
again only for those workers that need to be changed to eliminate duplicates
I think that pattern (only re-running for the workers that need to be changed to resolve duplicates) should give us confidence that duplicates will be resolved, because the system time will change each time that function is run.
I think this approach, will still relies on _find_random_port()
, is preferable to just incrementing and checking if the new port is open, because it should (on average) find a new open port more quickly than the incrementing approach. Consider the case where, for example, LightGBM tries to put multiple workers in a LocalCluster
on port 8887
(1 less than the default port for Jupyter). Jupyter uses such an approach of "increment by one until you find an open port", so if someone has multiple Jupyter sessions running it's possible that they might have ports 8888, 8889, 8890, and 8891 all occupied (for example), which would mean LightGBM would need 5 attempts to find a new open port (if 8892 is open).
I think the existence of systems like this is why Dask also searches randomly if the default port it prefers for its scheduler is occupied when you run dask-scheduler
(instead of incrementing). You can see that in the logs for LightGBM's Dask tests that use dsitributed.LocalCluster
, for example from this build:
/opt/conda/envs/test-env/lib/python3.9/site-packages/distributed/node.py:151: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 37211 instead
warnings.warn(
/opt/conda/envs/test-env/lib/python3.9/site-packages/distributed/node.py:151: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 35045 instead
warnings.warn(
/opt/conda/envs/test-env/lib/python3.9/site-packages/distributed/node.py:151: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 35051 instead
warnings.warn(
/opt/conda/envs/test-env/lib/python3.9/site-packages/distributed/node.py:151: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 38031 instead
/opt/conda/envs/test-env/lib/python3.9/site-packages/distributed/node.py:151: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 41941 instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmoralez I don't think you need to do any new work. Your review on the change I just pushed is welcome if you see anything wrong with it, but I'm fairly confident it will get the job done.
assert retry_msg not in capsys.readouterr().out | ||
|
||
# should handle worker maps with duplicates | ||
map_without_duplicates = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
map_with_duplicates
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in 75cfbbb
} | ||
patched_map = lgb.dask._possibly_fix_worker_map_duplicates( | ||
client=client, | ||
worker_map=map_without_duplicates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
map_with_duplicates
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in 75cfbbb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
@jmoralez Are you OK to merge this? |
@StrikerRUS Yes |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Changes in this PR
lightgbm.dask.
will now check for the situation where it found duplicateIP-port
pairs. If such duplicates are found,lightgbm.dask
will re-run the random search up to 10 times.Background
#4057 documents a class of error that can fail with distributed training using
lightgbm.dask
. If you do not providemachines
orlocal_listen_port
at training time, LightGBM will randomly search for ports to use on each worker.To limit the overhead introduced by this random search, a function
_find_random_open_port()
is run once at the same time on every worker, usingdistributed.Client.run()
(added in #3823). If you have multiple Dask worker processes on the same physical host (i.e. if you're usingdistirbuted.LocalCluster
ornrprocs > 1
), there is a small but non-zero probability that this setup will choose the same random port for multiple workers on that host. When that happens, LightGBM training will fail with an error likeBased on #4112 (comment), I think my original assumptions about how likely such conflicts were (#4057 (comment)) was not correct.
How this improves
lightgbm
Removes a source of instability that could cause distributed training with Dask to fail.
Improves the stability of LightGBM's tests, which should reduce maintainer effort needed in re-running failed builds.