Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Obtaining fresh data from the disk when reopening a NetCDF file a second time #4862

Closed
cjauvin opened this issue Feb 4, 2021 · 2 comments
Closed

Comments

@cjauvin
Copy link
Contributor

cjauvin commented Feb 4, 2021

I have a program where I open a .nc file, do something with it, and want to reopen it later, after an external program has been modifying it, and my issue is that the caching mechanism will give me the already opened version of the file, and not the refreshed version on the disk. To demonstrate this behavior, let's say you have two files: bla.nc and bla_mod.nc, with different content:

import shutil
import xarray as xr

a = xr.open_dataset("bla.nc")

# Simulate external process modifying bla.nc while this script is running
shutil.copy("bla_mod.nc", "bla.nc")

# a.close()  # this is the only thing that WOULD make it work!

b = xr.open_dataset("bla.nc") 

# Here I would expect b to be different than a, but it is not

I understand that the file SHOULD be closed (or that I should use a context manager) in an ideal world, and that if so it would work but let's say it is not (perhaps we forgot, or we're simply being lazy).

At first I thought that I could use the cache parameter to open_dataset for that purpose, but after studying the code, I discovered that it is connected to a different caching mechanism than the one that is at play here.

After some experiments to better understand the code, I came to the conclusion that the only way my particular use case could be supported (that is, without using an explicit close or a context manager, which is, in itself, debatable, I admit) is that if the underlying netCDF4._netCDF4.Dataset file object is explicitly closed, like it is when flushed out of the cache:

Given that I cannot really see how, in the particular case where the user calls open_dataset for a second time, she wouldn't want the fresh version on disk, it made me think that a fix for that behavior would be to simply explicitly flush the cache immediately after the CachingFileManager for a particular dataset has been created, as I do here:

master...cjauvin:netcdf-caching-bug

Because I admit that this looks weird at first sight (why close an object immediately after having created it?), I imagine that a better option would probably be to add a boolean option to the CachingFileManager, in order to make it optional (something like flush_and_close_file_if_already_present).

I think this subtle change would result in a more coherent experience with the exact use case that I present, but admittedly, I didn't study the overall code deeply enough to be certain that it couldn't result in unwanted side effects for some other backends.

@shoyer
Copy link
Member

shoyer commented Feb 7, 2021

I think this is basically the same issue as #4240

shoyer added a commit to shoyer/xarray that referenced this issue Feb 7, 2021
This means that explicitly opening a file multiple times with
``open_dataset`` (e.g., after modifying it on disk) now reopens the file
from scratch, rather than reusing a cached version.

If users want to reuse the cached file, they can reuse the same xarray
object. We don't need this for handling many files in Dask (the original
motivation for caching), because in those cases only a single
CachingFileManager is created.

I think this should some long-standing usability issues: pydata#4240, pydata#4862

Conveniently, this also obviates the need for some messy reference
counting logic.
shoyer added a commit to shoyer/xarray that referenced this issue Feb 7, 2021
This means that explicitly opening a file multiple times with
``open_dataset`` (e.g., after modifying it on disk) now reopens the file
from scratch, rather than reusing a cached version.

If users want to reuse the cached file, they can reuse the same xarray
object. We don't need this for handling many files in Dask (the original
motivation for caching), because in those cases only a single
CachingFileManager is created.

I think this should some long-standing usability issues: pydata#4240, pydata#4862

Conveniently, this also obviates the need for some messy reference
counting logic.
@shoyer
Copy link
Member

shoyer commented Feb 7, 2021

Thanks for raising this issue, especially the in-depth explanation!

I have a tentative fix in #4879. Rather than closing existing files, it simply separately caches files for each time open_dataset() is called.

dcherian added a commit that referenced this issue Oct 18, 2022
* Cache files for different CachingFileManager objects separately

This means that explicitly opening a file multiple times with
``open_dataset`` (e.g., after modifying it on disk) now reopens the file
from scratch, rather than reusing a cached version.

If users want to reuse the cached file, they can reuse the same xarray
object. We don't need this for handling many files in Dask (the original
motivation for caching), because in those cases only a single
CachingFileManager is created.

I think this should some long-standing usability issues: #4240, #4862

Conveniently, this also obviates the need for some messy reference
counting logic.

* Fix whats-new message location

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add id to CachingFileManager

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* restrict new test to only netCDF files

* fix whats-new message

* skip test on windows

* Revert "[pre-commit.ci] auto fixes from pre-commit.com hooks"

This reverts commit e637165.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Revert "Fix whats-new message location"

This reverts commit 6bc80e7.

* fixups

* fix syntax

* tweaks

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix types for mypy

* add uuid

* restore ref_counts

* doc tweaks

* close files inside test_open_mfdataset_list_attr

* remove unused itertools

* don't use refcounts

* re-enable ref counting

* cleanup

* Apply typing suggestions from code review

Co-authored-by: Illviljan <14371165+Illviljan@users.noreply.github.com>

* fix import of Hashable

* ignore __init__ type

* fix whats-new

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
Co-authored-by: Illviljan <14371165+Illviljan@users.noreply.github.com>
Co-authored-by: dcherian <deepak@cherian.net>
keewis pushed a commit to keewis/xarray that referenced this issue Oct 19, 2022
…ta#4879)

* Cache files for different CachingFileManager objects separately

This means that explicitly opening a file multiple times with
``open_dataset`` (e.g., after modifying it on disk) now reopens the file
from scratch, rather than reusing a cached version.

If users want to reuse the cached file, they can reuse the same xarray
object. We don't need this for handling many files in Dask (the original
motivation for caching), because in those cases only a single
CachingFileManager is created.

I think this should some long-standing usability issues: pydata#4240, pydata#4862

Conveniently, this also obviates the need for some messy reference
counting logic.

* Fix whats-new message location

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add id to CachingFileManager

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* restrict new test to only netCDF files

* fix whats-new message

* skip test on windows

* Revert "[pre-commit.ci] auto fixes from pre-commit.com hooks"

This reverts commit e637165.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Revert "Fix whats-new message location"

This reverts commit 6bc80e7.

* fixups

* fix syntax

* tweaks

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix types for mypy

* add uuid

* restore ref_counts

* doc tweaks

* close files inside test_open_mfdataset_list_attr

* remove unused itertools

* don't use refcounts

* re-enable ref counting

* cleanup

* Apply typing suggestions from code review

Co-authored-by: Illviljan <14371165+Illviljan@users.noreply.github.com>

* fix import of Hashable

* ignore __init__ type

* fix whats-new

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
Co-authored-by: Illviljan <14371165+Illviljan@users.noreply.github.com>
Co-authored-by: dcherian <deepak@cherian.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants