Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tests][dask] fix hangs after exception in worker #4772

Closed
wants to merge 3 commits into from

Conversation

jmoralez
Copy link
Collaborator

@jmoralez jmoralez commented Nov 5, 2021

As described in #4771, with dask==2021.10.0 and distributed==2021.10.0 when a worker encounters an exception in a test as happens in

def f(part):
raise Exception('foo')

the following tests hang indefinitely.

This adds a client.restart to restart the workers and get rid of the raised exception, since as of #4159 all of LightGBM's Dask unit tests share the same cluster.

@StrikerRUS
Copy link
Collaborator

With added client.restart() only one job timed out.

../tests/python_package_test/test_dask.py .............................. [ 14%]
........................................................................ [ 25%]
........................................................................ [ 37%]
......s...............s...............s...............s................. [ 48%]
..........................
##[error]The operation was canceled.

https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=11463&view=logs&j=e3619e06-e539-5956-1ae7-e8c2c29d8e91&t=4b3d6e5d-f8c3-577f-235f-b68c5ed4d3ab&l=1238

But I guess it means that this workaround cannot be treated as reliable enough and it brings flakiness into our tests.

@jameslamb
Copy link
Collaborator

But I guess it means that this workaround cannot be treated as reliable enough

I agree, it seems like this doesn't totally fix the issue 😕

we'll have to keep investigating in #4771

@jmoralez
Copy link
Collaborator Author

jmoralez commented Nov 5, 2021

Seems like it should work, I don't see what could be hanging with fresh workers. I can maybe try increasing the timeout in the restart or add a sleep after it. What do you guys think?

@jameslamb
Copy link
Collaborator

can maybe try increasing the timeout in the restart or add a sleep after it

Sure, try it! We'd probably learn something from that exercise either way.

Feel free to use this PR to test strategies. One trick I like to use, when I need to use CI to test, is to have a single commit that removes all the other CI jobs that aren't relevant to what I'm working on, and then bring those CI configs back in later with git revert.

So for you, that might be like

git rm .appveyor.yaml
git rm .github/workflows/r_*.yaml
git add .
git commit -m "remove unrelated CI configs"

@jmoralez
Copy link
Collaborator Author

jmoralez commented Nov 5, 2021

Wow this job really broke

@jmoralez
Copy link
Collaborator Author

jmoralez commented Nov 5, 2021

Closing this as client.restart doesn't seem like the way to go.

@jmoralez jmoralez closed this Nov 5, 2021
@jameslamb
Copy link
Collaborator

Wow this job really broke

whoa! Ok thanks for investigating. We'll have to try to find another approach.

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants