[dask] flaky unit test test_find_random_open_port #4458

jameslamb · 2021-07-09T18:26:14Z

Description

The unit tests tests/python_package_test/test_dask.py::test_find_random_open_port can occasionally fail randomly. Specifically, lgb.dask._find_random_open_port() is called once on each of the two workers in a Dask cluster, and if it finds the same port for both workers, the test fails.

I think what we're seeing is the problem mentioned in #4133 (comment).

the collisions happen when a worker has completed the function and freed the port and another one is just asking for it and the OS just returns the same one

I think that when we dealt with this problem before in #4133, it was patched within lightgbm.dask directly.

LightGBM/python-package/lightgbm/dask.py

Lines 345 to 367 in 0d1d12f

    
           # if any duplicates were found, search for new ports one by one 
        
           for worker in workers_that_need_new_ports: 
        
               _log_info(f"Searching for a LightGBM training port for worker '{worker}'") 
        
               host = urlparse(worker).hostname 
        
               retries_remaining = 100 
        
               while retries_remaining > 0: 
        
                   retries_remaining -= 1 
        
                   new_port = client.submit( 
        
                       _find_random_open_port, 
        
                       workers=[worker], 
        
                       allow_other_workers=False, 
        
                       pure=False 
        
                   ).result() 
        
                   if new_port not in host_to_port[host]: 
        
                       worker_map[worker] = new_port 
        
                       host_to_port[host].add(new_port) 
        
                       break 
        
               if retries_remaining == 0: 
        
                   raise LightGBMError( 
        
                       "Failed to find an open port. Try re-running training or explicitly setting 'machines' or 'local_listen_port'." 
        
                   )

But I didn't consider that fact that this particular unit test might suffer from the same problem.

In other words, this unit test is saying such collisions should never happen but in fact they DO happen and lgb.dask was changed to explicitly handle that case.

I think such failures should be very rare, but that we should still try to eliminate them or make them even rarer. I see two options:

Change this test from "should never produce conflicts" to "should rarely produce conflicts" (e.g., run client.run(lgb.dask._find_random_open_ports) 5 times and check that no more than 1 conflict was found)
Remove test_find_random_open_port and trust other Dask unit tests (like test_network_params_not_required_but_respected_if_given and test_possibly_fix_worker_map) to do the work.

Reproducible example

This is hard to reproduce because it is random. Most recently, the unit tests on #4454 for one job failed with the following error.

cluster = LocalCluster(1a80f5c1, 'tcp://127.0.0.1:43799', workers=2, threads=4, memory=110.07 GiB)

    def test_find_random_open_port(cluster):
        with Client(cluster) as client:
            for _ in range(5):
                worker_address_to_port = client.run(lgb.dask._find_random_open_port)
                found_ports = worker_address_to_port.values()
                # check that found ports are different for same address (LocalCluster)
>               assert len(set(found_ports)) == len(found_ports)
E               assert 1 == 2
E                +  where 1 = len({50065})
E                +    where {50065} = set(dict_values([50065, 50065]))
E                +  and   2 = len(dict_values([50065, 50065]))

../tests/python_package_test/test_dask.py:455: AssertionError

Environment info

LightGBM version or commit hash: latest master (0d1d12f)

Command(s) you used to install LightGBM

This test most recently failed on CUDA Version / cuda 11.2.2 source (linux, gcc, Python 3.7) CI job in #4454 (https://github.com/microsoft/LightGBM/pull/4454/checks?check_run_id=3012592901), but I think it could happen randomly in any Python jobs where that test is run.

Additional Comments

Opened from #4454 (comment).

The text was updated successfully, but these errors were encountered:

jameslamb · 2021-07-09T18:29:16Z

@StrikerRUS will you look at the two options I provided in the description and let me know if you have a preference?

I have a preference for option 1,

Change this test from "should never produce conflicts" to "should rarely produce conflicts"

because I think that will make such failures even rarer but still be a defense against problems where lightgbm.dask._find_random_open_port() suddenly starts finding the same port more frequently.

StrikerRUS · 2021-07-09T20:08:38Z

I'm OK with any option of ones you've provided above.

jmoralez · 2021-07-13T02:03:21Z

I recently thought about maybe defining a function that finds n ports where n is the number of processes running on each worker and doing client.submit(get_n_ports, worker['nprocs'], workers=[worker['address']] for each worker. That function wouldn't close the sockets until it has gotten all n so there would be no collisions. What do you guys think?

jameslamb · 2021-07-25T04:54:54Z

hey @jmoralez , sorry it took so long to get back to you.

If you want to attempt that, I'd support it! I hadn't taken that on yet because it seemed fairly complex in exchange for eliminating a somewhat-rare collision.

worker['address'] wouldn't be quite enough. For example, on a LocalCluster with three workers they might have addresses 127.0.0.1:35651, 127.0.0.1:35652, and 127.0.0.1:35653. The code would have have to drop the ports from those addresses to know they're on the same host. Like this:

LightGBM/python-package/lightgbm/dask.py

Line 327 in fdc582e

worker_host = urlparse(address).hostname

Then, the code would have to run get_n_ports() once per host. To achieve "once per host", the code would have to run run get_n_ports() on only one of the worker processes running on each host.

I might still do the option Change this test from "should never produce conflicts" to "should rarely produce conflicts" just because it'll be easy and could help to reduce test failures until that more permanent solution to the problem is merged.

jmoralez · 2021-07-28T04:21:52Z

Haha, that's ok. I'll try this and let you know if I get it working. I agree on changing that test to allow some conflicts in the meantime.

jameslamb · 2021-08-03T22:25:39Z

I think @jmoralez 's fix in #4498 is a better and more reliable long-term fix than the test changes I proposed in the description of this issue, so closing it for now. Hopefully we won't see any more conflicts like this 🤞

github-actions · 2023-08-23T14:26:41Z

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

jameslamb added bug dask labels Jul 9, 2021

jameslamb mentioned this issue Jul 9, 2021

[ci] add support for 8.0 and 8.6 CUDA archs #4454

Merged

jameslamb changed the title ~~[dask] flaky unit tests~~ [dask] flaky unit tests test_find_random_open_port Jul 9, 2021

jameslamb changed the title ~~[dask] flaky unit tests test_find_random_open_port~~ [dask] flaky unit test test_find_random_open_port Jul 9, 2021

jmoralez mentioned this issue Aug 1, 2021

[dask] find all needed ports in each host at once (fixes #4458) #4498

Merged

jameslamb closed this as completed in 5fe27d5 Aug 3, 2021

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dask] flaky unit test test_find_random_open_port #4458

[dask] flaky unit test test_find_random_open_port #4458

jameslamb commented Jul 9, 2021

jameslamb commented Jul 9, 2021

StrikerRUS commented Jul 9, 2021

jmoralez commented Jul 13, 2021

jameslamb commented Jul 25, 2021

jmoralez commented Jul 28, 2021

jameslamb commented Aug 3, 2021

github-actions bot commented Aug 23, 2023