Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NetCDF thread safety take two #5095

Merged
merged 67 commits into from
Feb 20, 2023
Merged

NetCDF thread safety take two #5095

merged 67 commits into from
Feb 20, 2023

Conversation

trexfeathers
Copy link
Contributor

@trexfeathers trexfeathers commented Dec 6, 2022

🚀 Pull Request

Description

Taking the lead from #5061, I have minimised the footprint of this change, making it hopefully easier for future maintenance and better suited to a patch release (our goal).

I still get very occasional segfaults and invalid ID NetCDF errors. These were guaranteed before, but I would now estimate they fail a full test run ~ 1 in 20 times. This of course makes them very difficult to investigate - I don't know if there are any commits in this PR's sequence where there was genuinely no problem, or whether it was just luck. The only working theory I have is that they show up when certain combinations of tests are being run on separate processors, but I don't know how I would test such a theory.

I'm putting this up now because I need help with what's left:

  • Investigate remaining intermittent errors (see above) and decide if they are a problem.
  • Write test(s) specifically aimed at thread safety.
  • Arrange for some real world user testing.

Consult Iris pull request check list

@pp-mo
Copy link
Member

pp-mo commented Feb 14, 2023

Ok for your latest !
I'm totally 👍 on the change to a context managerdask.config.set -- I probably should have spotted that + raised it myself.
Still looking forward to some notes about when they fail...

@trexfeathers
Copy link
Contributor Author

trexfeathers commented Feb 15, 2023

Test results

(☟ code) (environment ☞) netcdf_segs v3.4.0
netcdf_segs ✔️ x5 ✔️ x5
v3.4.0 test_realise_data(), test_comparison() - seg-fault * ✔️ x5

Notes

  • Needed to use pytest -k to identify the specific seg-faulting test.
  • test_thread_safety.py obviously doesn't exist in v3.4.0 so I had to bring that in manually when running with the v3.4.0 code base.

@trexfeathers
Copy link
Contributor Author

trexfeathers commented Feb 15, 2023

Discoveries from investigating tests

  • More, smaller threads increase the chance of thread collision. So tests using tiny_chunks fail more reliably than _multisource tests.
  • Tests involving da.store() rather than da.compute() - so those for saving and streaming - NEVER fail, even when streaming from multiple sources. This suggests that Dask's [current] strategy is to serialise the WHOLE operation whenever file writing is involved.
    Good to have those tests in there should the situation ever change in future.
  • Inserting a sleep(random()) line into the NetCDFDataProxy does not produce more thread collisions. In fact it looks like it decreases them, possibly by desynchronising some moments during computation that are naturally synched up.

@trexfeathers
Copy link
Contributor Author

Closes #5016

Copy link

@valeriupredoi valeriupredoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very very many thanks for tackling this @trexfeathers and everyone else who reviewed and tested - I am approving this from the side of us doing testing with ESMValCore, tests that show this fixes the issue completely; of course, you guys decide when to merge, but it'd be very cool to have this in a new iris release soon 🍻 🥳

@trexfeathers trexfeathers linked an issue Feb 20, 2023 that may be closed by this pull request
@trexfeathers
Copy link
Contributor Author

Do you need anything else @pp-mo?

@pp-mo
Copy link
Member

pp-mo commented Feb 20, 2023

Do you need anything else @pp-mo?

No, all good now the tests have passed.
Huge applause ! 👏 👏 🥇

@pp-mo pp-mo merged commit 11da71b into SciTools:main Feb 20, 2023
trexfeathers added a commit to trexfeathers/iris that referenced this pull request Feb 21, 2023
* Unpin netcdf4.

* Temporarily enable GHA on this branch.

* Temporarily enable GHA on this branch.

* Temporarily enable GHA on this branch.

* Experiment to disable wheel CI on forks.

* Disable segfaulting routines.

* More temporary changes to get CI passing.

* More temporary changes to get CI passing.

* Finessed segfault skipping.

* Bring in changed from SciTools#5061.

* Re-instate test_load_laea_grid.

* Adaptations to get the tests passing.

* Use typing.Mapping instead.

* Get doctests passing.

* CF only resolve non-url filenames.

* Confirm thread safety fixes.

* Remove dummy assignment.

* Restored plot_nemo What's New entry.

* _add_aux_factories temporarily release global lock.

* Remove per-file locking.

* Remove remaining test workarounds.

* Remove remaining comments.

* Correct use of CFReader context manager.

* Refactor for easier future maintenance.

* Rename netcdf _thread_safe, add header.

* Full use of ThreadSafeAggregators.

* Full use of ThreadSafeAggregators.

* Remove remaining imports of NetCDF4.

* Test to ensure netCDF4 is via _thread_safe module.

* More refined netcdf._thread_safe classes.

* _thread_safe docstrings.

* Restore original NetCDF code where possible.

* Revert changes to 2.3.rst.

* Update lockfiles.

* Additions to _thread_safe.py

* Remove temporary CI shims.

* New locking stategy for NetCDFDataProxy.

* NetCDFDataProxy simpler use of netCDF4 lock.

* Update lock files.

* Go back to using a Threading Lock.

* Remove superfluous pass commands in test_cf.py.

* Rename _thread_safe to _thread_safe_nc.

* Rename thread safe classes to be 'Wrappers'.

* Better contained getattr and setattr pattern.

* Explicitly name netCDF4 module in _thread_safe_nc docstring.

* Better docstring for _ThreadSafeWrapper.

* Better comment about THREAD_SAFE_FLAG.

* list() wrapping within _GLOBAL_NETCDF4_LOCK, to account for generators.

* More accurate thread_safe docstrings in netcdf.saver.

* Split netcdf integration tests into multiple modules.

* Tests for non-thread-safe NetCDF behaviour.

* Docstring accuracy.

* Correct use of dask config set (context manager).

* Update dependencies.

* Review - don't need the first-class import of iris.tests.

* Better name for the loading test.

* Better selection of data to load.

* What's New entry.

* Improve tests.

* Update lock files.

* Increase chunking on test_save.

---------

Co-authored-by: Patrick Peglar <patrick.peglar@metoffice.gov.uk>
@trexfeathers trexfeathers deleted the netcdf_segs branch February 24, 2023 13:54
This was referenced Mar 27, 2023
tkknight added a commit to tkknight/iris that referenced this pull request Apr 22, 2023
* upstream/main: (23 commits)
  Lockfiles and pydata-sphinx-theme fix (SciTools#5188)
  Allow smarter weights (cubes, coordinates, cell measures, or ancillary variables) for aggregation (SciTools#5084)
  removed cell measure mask check and error (SciTools#5181)
  Updated environment lockfiles (SciTools#5177)
  Lazy weighted RMS calculation (SciTools#5017)
  Add coverage badge to README.md (SciTools#5176)
  Add coverage testing (SciTools#4765)
  Whats new updates for v3.4.1 .
  NetCDF thread safety take two (SciTools#5095)
  Updated environment lockfiles (SciTools#5163)
  Plugin support (SciTools#5144)
  Expand scope of common contributor links (SciTools#5159)
  Replace apparently retired UDUNITS documentation link. (SciTools#5153)
  [pre-commit.ci] pre-commit autoupdate (SciTools#5150)
  Fixing typo's in Gitwash. (SciTools#5145)
  add readme #showyourstripes (SciTools#5141)
  [pre-commit.ci] pre-commit autoupdate (SciTools#5143)
  Iris ❤ Xarray docs page. (SciTools#5025)
  [pre-commit.ci] pre-commit autoupdate (SciTools#5136)
  Updated citation (SciTools#5116)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Investigate segfaults with NetCDF4 >1.6.0
3 participants