Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dask] flaky unit test test_find_random_open_port #4458

Closed
jameslamb opened this issue Jul 9, 2021 · 7 comments
Closed

[dask] flaky unit test test_find_random_open_port #4458

jameslamb opened this issue Jul 9, 2021 · 7 comments

Comments

@jameslamb
Copy link
Collaborator

Description

The unit tests tests/python_package_test/test_dask.py::test_find_random_open_port can occasionally fail randomly. Specifically, lgb.dask._find_random_open_port() is called once on each of the two workers in a Dask cluster, and if it finds the same port for both workers, the test fails.

I think what we're seeing is the problem mentioned in #4133 (comment).

the collisions happen when a worker has completed the function and freed the port and another one is just asking for it and the OS just returns the same one

I think that when we dealt with this problem before in #4133, it was patched within lightgbm.dask directly.

# if any duplicates were found, search for new ports one by one
for worker in workers_that_need_new_ports:
_log_info(f"Searching for a LightGBM training port for worker '{worker}'")
host = urlparse(worker).hostname
retries_remaining = 100
while retries_remaining > 0:
retries_remaining -= 1
new_port = client.submit(
_find_random_open_port,
workers=[worker],
allow_other_workers=False,
pure=False
).result()
if new_port not in host_to_port[host]:
worker_map[worker] = new_port
host_to_port[host].add(new_port)
break
if retries_remaining == 0:
raise LightGBMError(
"Failed to find an open port. Try re-running training or explicitly setting 'machines' or 'local_listen_port'."
)

But I didn't consider that fact that this particular unit test might suffer from the same problem.

In other words, this unit test is saying such collisions should never happen but in fact they DO happen and lgb.dask was changed to explicitly handle that case.

I think such failures should be very rare, but that we should still try to eliminate them or make them even rarer. I see two options:

  1. Change this test from "should never produce conflicts" to "should rarely produce conflicts" (e.g., run client.run(lgb.dask._find_random_open_ports) 5 times and check that no more than 1 conflict was found)
  2. Remove test_find_random_open_port and trust other Dask unit tests (like test_network_params_not_required_but_respected_if_given and test_possibly_fix_worker_map) to do the work.

Reproducible example

This is hard to reproduce because it is random. Most recently, the unit tests on #4454 for one job failed with the following error.

cluster = LocalCluster(1a80f5c1, 'tcp://127.0.0.1:43799', workers=2, threads=4, memory=110.07 GiB)

    def test_find_random_open_port(cluster):
        with Client(cluster) as client:
            for _ in range(5):
                worker_address_to_port = client.run(lgb.dask._find_random_open_port)
                found_ports = worker_address_to_port.values()
                # check that found ports are different for same address (LocalCluster)
>               assert len(set(found_ports)) == len(found_ports)
E               assert 1 == 2
E                +  where 1 = len({50065})
E                +    where {50065} = set(dict_values([50065, 50065]))
E                +  and   2 = len(dict_values([50065, 50065]))

../tests/python_package_test/test_dask.py:455: AssertionError

Environment info

LightGBM version or commit hash: latest master (0d1d12f)

Command(s) you used to install LightGBM

This test most recently failed on CUDA Version / cuda 11.2.2 source (linux, gcc, Python 3.7) CI job in #4454 (https://github.com/microsoft/LightGBM/pull/4454/checks?check_run_id=3012592901), but I think it could happen randomly in any Python jobs where that test is run.

Additional Comments

Opened from #4454 (comment).

@jameslamb jameslamb changed the title [dask] flaky unit tests [dask] flaky unit tests test_find_random_open_port Jul 9, 2021
@jameslamb jameslamb changed the title [dask] flaky unit tests test_find_random_open_port [dask] flaky unit test test_find_random_open_port Jul 9, 2021
@jameslamb
Copy link
Collaborator Author

@StrikerRUS will you look at the two options I provided in the description and let me know if you have a preference?

I have a preference for option 1,

Change this test from "should never produce conflicts" to "should rarely produce conflicts"

because I think that will make such failures even rarer but still be a defense against problems where lightgbm.dask._find_random_open_port() suddenly starts finding the same port more frequently.

@StrikerRUS
Copy link
Collaborator

I'm OK with any option of ones you've provided above.

@jmoralez
Copy link
Collaborator

I recently thought about maybe defining a function that finds n ports where n is the number of processes running on each worker and doing client.submit(get_n_ports, worker['nprocs'], workers=[worker['address']] for each worker. That function wouldn't close the sockets until it has gotten all n so there would be no collisions. What do you guys think?

@jameslamb
Copy link
Collaborator Author

hey @jmoralez , sorry it took so long to get back to you.

If you want to attempt that, I'd support it! I hadn't taken that on yet because it seemed fairly complex in exchange for eliminating a somewhat-rare collision.

worker['address'] wouldn't be quite enough. For example, on a LocalCluster with three workers they might have addresses 127.0.0.1:35651, 127.0.0.1:35652, and 127.0.0.1:35653. The code would have have to drop the ports from those addresses to know they're on the same host. Like this:

worker_host = urlparse(address).hostname

Then, the code would have to run get_n_ports() once per host. To achieve "once per host", the code would have to run run get_n_ports() on only one of the worker processes running on each host.


I might still do the option Change this test from "should never produce conflicts" to "should rarely produce conflicts" just because it'll be easy and could help to reduce test failures until that more permanent solution to the problem is merged.

@jmoralez
Copy link
Collaborator

Haha, that's ok. I'll try this and let you know if I get it working. I agree on changing that test to allow some conflicts in the meantime.

@jameslamb
Copy link
Collaborator Author

I think @jmoralez 's fix in #4498 is a better and more reliable long-term fix than the test changes I proposed in the description of this issue, so closing it for now. Hopefully we won't see any more conflicts like this 🤞

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants