Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize UCX errors on connect() and correct pytest fixtures #6434

Merged
merged 3 commits into from
May 26, 2022

Conversation

pentschev
Copy link
Member

@pentschev pentschev commented May 24, 2022

Various different errors may happen when trying to connect to a remote endpoint. By generalizing the exception we catch, we leave the responsibility of raising the appropriate error depending on known communication issues to Distributed.

Additionally moved UCX pytest fixture to utils_test.ucx_loop and the argument added to all pytest functions that do UCX testing, which was not being properly used.

Closes #6429

  • Tests added / passed
  • Passes pre-commit run --all-files

Various different errors may happen when trying to connect to a remote
endpoint. By generalizing the exception we catch, we leave the
responsibility of raising the appropriate error depending on known
communication issues to Distributed.
@charlesbluca
Copy link
Member

Looks like things timed out on distributed/tests/test_worker.py::test_protocol_from_scheduler_address[Worker]?

@quasiben
Copy link
Member

rerun tests

@quasiben
Copy link
Member

with #6428 in we can rerun and hopefully resolve UCX issues

@quasiben
Copy link
Member

rerun tests

1 similar comment
@quasiben
Copy link
Member

rerun tests

Copy link
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. Does this change mean we need to bump minimum versions anywhere?

@pentschev
Copy link
Member Author

I can reproduce the CI hang locally but only if running the entire gpuCI test set, (i.e., not when running on stributed/tests/test_worker.py::test_protocol_from_scheduler_address alone). I'm investigating.

@pentschev
Copy link
Member Author

This looks good. Does this change mean we need to bump minimum versions anywhere?

No, it just has been generalized so that the exception is (hopefully) always an OSError. It's possible that the version assumption was incorrect in the past from conda packages vs local builds for the same reasons I mentioned in #6429 (comment) , but I'm not sure about that.

@github-actions
Copy link
Contributor

github-actions bot commented May 24, 2022

Unit Test Results

       15 files  ±  0         15 suites  ±0   6h 7m 23s ⏱️ - 26m 20s
  2 812 tests +  2    2 727 ✔️  - 1    82 💤 +  3  2  - 1  1 🔥 +1 
20 848 runs  +15  19 900 ✔️ +3  944 💤 +12  3  - 1  1 🔥 +1 

For more details on these failures and errors, see this check.

Results for commit e95fb9f. ± Comparison against base commit d32f4b0.

♻️ This comment has been updated with latest results.

The fixture was not being properly used, it must be added to each
function that requires it as an argument.

Fixture was moved to `utils_test.ucx_loop` and the argument added to all
pytest functions that do UCX testing.
@pentschev pentschev changed the title Generalize UCX errors on connect() Generalize UCX errors on connect() and correct pytest fixtures May 25, 2022
@quasiben
Copy link
Member

The failures here are for known failing test test_stress_scatter_death . @fjetter as this does add new functions to utils_test.py would you care to review ? If not, this is ready to be merged in

@pentschev
Copy link
Member Author

To add to @quasiben's comment above, here is a more detailed list of failing tests:

  • distributed/dashboard/tests/test_scheduler_bokeh.py::test_basic
  • distributed/dashboard/tests/test_worker_bokeh.py::test_services_kwargs
  • distributed/dashboard/tests/test_scheduler_bokeh.py::test_memory_by_key
  • distributed/tests/test_stress.py::test_stress_scatter_death (3 instances)

I believe none of them are related to this PR, but please let me know if you think they are related somehow.

@jacobtomlinson
Copy link
Member

test_stress_scatter_death is a known issue but I'm just going to restart CI to see if the others pop up again.

@pentschev
Copy link
Member Author

Failed tests now are:

  • distributed/tests/test_stress.py::test_stress_scatter_death (all 3 instances);
  • distributed/diagnostics/tests/test_progress.py::test_group_timing.

I still seems that neither of the errors are related to this PR, but let me know if there's reason to believe otherwise.

@jacobtomlinson
Copy link
Member

Yeah I agree they are unrelated. Thanks for handling this @pentschev.

@pentschev pentschev deleted the ucx-generalize-errors-on-connect branch October 10, 2022 12:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test_ucx_unreachable failed on gpuCI
4 participants