-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to choose mfdataset attributes source. #3498
Conversation
I think I'm done. Can someone look at it? The The default is |
xarray/backends/api.py
Outdated
@@ -825,6 +826,10 @@ def open_mfdataset( | |||
- 'override': if indexes are of same size, rewrite indexes to be | |||
those of the first object with that dimension. Indexes for the same | |||
dimension must have the same size in all objects. | |||
master_file : int or str, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is netcDF4's documentation for master_file
:
file to use as "master file", defining all the variables with an aggregation dimension and all global attributes.
let's make it clear that unlike netCDF4 we are only using this for attributes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you suggest to use a different keyword, maybe attrs_file
?
Or just clarify the difference in the docs? I don't mind.
@dcherian Thanks for the review!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was initially thinking of just adding a line to the docstring but we should think about renaming this to something like attrs_from
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I've renamed it to attrs_file
to avoid confusion with netCDF4. Thanks for pointing that. I am open to any name as long as the option is here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dcherian can we mark this as resolved? The attrs_file
now only accept a file name (see other conversation below).
Thanks @juseg . I've left a few comments. I see that this is your first PR. Welcome to xarray! and thanks for contributing 👏 |
Co-Authored-By: Deepak Cherian <dcherian@users.noreply.github.com>
Unlike netCDF4's master_file this is only used for attributes.
This will add a new kwarg in |
Thanks for this @juseg. The only problem I see is that a scalar number to specify the file only makes sense if it's a 1D list, but On the other hand specifying the particular filepath or object makes sense in all cases, so perhaps the easiest way to avoid ambiguity would be to restrict to that option? (The default would just be left as-is.) |
@TomNicholas Thanks for bringing the discussion live again! I'm not sure what happens in those cases, but I'm confident the default behaviour is unchanged, i.e. the attributes file is 0, whatever that 0 means (see my first commit). If this is an issue I would suggest to discuss in a separate thread, as I think it is independent from my changes. On the other hand I am eager to keep the file number option because (1) |
I'm not sure we should merge changes if we're unsure how they will behave in certain circumstances.
If we kept just the string specifier, you could still solve the problem of preserving the history: files_to_open = ['filepath1', 'filepath2']
ds = open_mfdataset(files_to_open, attrs_file=files_to_open[-1]) But then the option would always have clear and well-defined behaviour, even in more complex cases like |
@TomNicholas I've had a closer look at the code. Nested lists of file paths are processed by: Lines 881 to 882 in f2b2f9f
Using the method defined in: Lines 15 to 46 in f2b2f9f
In Python 3.7+ the list of
Unfortunately the current code uses a dictionary which means that in Python 3.6- the order is not guaranteed preserved. This also implies that the current default Line 900 in f2b2f9f
Line 964 in f2b2f9f
On the other hands the
Or should we just stick to file paths as you suggest? And leave the default as is (e.g. ambiguous for Python 3.6-)? |
Thanks @juseg .
I think for python 3.6 and above the order is preserved isn't it?
Yes, this is what I was thinking of.
We could do this, and that's how we would solve it in general, but I don't really think it's worth the effort/complexity.
I think so - if we do this then users can still easily pick the attributes from the file of their choosing (solving the original issue), and if someone wants to be able to choose the |
Index behaviour is ambiguous for nested lists on older Python versions. The default remains index 0, which is backward-compatible but also ambiguous in this case (see docstring and pull request #3498).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments about passing Path
objects, but other than that this looks good to me.
Co-Authored-By: keewis <keewis@users.noreply.github.com>
I think there's nothing left to do here, thanks @juseg! |
* upstream/master: allow passing any iterable to drop when dropping variables (pydata#3693) Typo on DataSet/DataArray.to_dict documentation (pydata#3692) Fix mypy type checking tests failure in ds.merge (pydata#3690) Explicitly convert result of pd.to_datetime to a timezone-naive type (pydata#3688) ds.merge(da) bugfix (pydata#3677) fix docstring for combine_first: returns a Dataset (pydata#3683) Add option to choose mfdataset attributes source. (pydata#3498) How do I add a new variable to dataset. (pydata#3679) Add map_blocks example to whats-new (pydata#3682) Make dask names change when chunking Variables by different amounts. (pydata#3584) raise an error when renaming dimensions to existing names (pydata#3645) Support swap_dims to dimension names that are not existing variables (pydata#3636) Add map_blocks example to docs. (pydata#3667) add multiindex level name checking to .rename() (pydata#3658)
* upstream/master: Add an example notebook using apply_ufunc to vectorize 1D functions (pydata#3629) Use encoding['dtype'] over data.dtype when possible within CFMaskCoder.encode (pydata#3652) allow passing any iterable to drop when dropping variables (pydata#3693) Typo on DataSet/DataArray.to_dict documentation (pydata#3692) Fix mypy type checking tests failure in ds.merge (pydata#3690) Explicitly convert result of pd.to_datetime to a timezone-naive type (pydata#3688) ds.merge(da) bugfix (pydata#3677) fix docstring for combine_first: returns a Dataset (pydata#3683) Add option to choose mfdataset attributes source. (pydata#3498) How do I add a new variable to dataset. (pydata#3679) Add map_blocks example to whats-new (pydata#3682) Make dask names change when chunking Variables by different amounts. (pydata#3584) raise an error when renaming dimensions to existing names (pydata#3645) Support swap_dims to dimension names that are not existing variables (pydata#3636) Add map_blocks example to docs. (pydata#3667) add multiindex level name checking to .rename() (pydata#3658)
* upstream/master: (23 commits) Feature/align in dot (pydata#3699) ENH: enable `H5NetCDFStore` to work with already open h5netcdf.File a… (pydata#3618) One-off isort run (pydata#3705) hardcoded xarray.__all__ (pydata#3703) Bump mypy to v0.761 (pydata#3704) remove DataArray and Dataset constructor deprecations for 0.15 (pydata#3560) Tests for variables with units (pydata#3654) Add an example notebook using apply_ufunc to vectorize 1D functions (pydata#3629) Use encoding['dtype'] over data.dtype when possible within CFMaskCoder.encode (pydata#3652) allow passing any iterable to drop when dropping variables (pydata#3693) Typo on DataSet/DataArray.to_dict documentation (pydata#3692) Fix mypy type checking tests failure in ds.merge (pydata#3690) Explicitly convert result of pd.to_datetime to a timezone-naive type (pydata#3688) ds.merge(da) bugfix (pydata#3677) fix docstring for combine_first: returns a Dataset (pydata#3683) Add option to choose mfdataset attributes source. (pydata#3498) How do I add a new variable to dataset. (pydata#3679) Add map_blocks example to whats-new (pydata#3682) Make dask names change when chunking Variables by different amounts. (pydata#3584) raise an error when renaming dimensions to existing names (pydata#3645) ...
Add a
master_file
keyword arguments toopen_mfdataset
to choose the source of global attributes in a multi-file dataset.black . && mypy . && flake8
whats-new.rst
for all changes andapi.rst
for new API