Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update signature open_dataset for API v2 #4547

Merged
merged 54 commits into from
Nov 6, 2020
Merged
Show file tree
Hide file tree
Changes from 48 commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
f961606
add in api.open_dataset dispatching to stub apiv2
aurghs Sep 25, 2020
fb166fa
remove in apiv2 check for input AbstractDataStore
aurghs Sep 25, 2020
0221eec
bugfix typo
aurghs Sep 25, 2020
36a02c7
add kwarg engines in _get_backend_cls needed by apiv2
aurghs Sep 25, 2020
cfb8cb8
add alpha support for h5netcdf
aurghs Sep 25, 2020
4256bc8
style: clean not used code, modify some variable/function name
aurghs Sep 28, 2020
1bc7391
Add ENGINES entry for cfgrib.
Sep 28, 2020
748fe5a
Define function open_backend_dataset_cfgrib() to be used in apiv2.py.
Sep 28, 2020
fb368fe
Apply black to check formatting.
Sep 28, 2020
80e111c
Apply black to check formatting.
Sep 28, 2020
e15ca6b
add dummy zarr apiv2 backend
aurghs Sep 28, 2020
025cc87
Merge branch 'master' into backend-read-refactor
aurghs Sep 28, 2020
4b19399
align apiv2.open_dataset to api.open_dataset
aurghs Sep 28, 2020
572595f
remove unused extra_coords in open_backend_dataset_*
aurghs Sep 29, 2020
d6e632e
Merge remote-tracking branch 'origin/cfgrib_refactor' into backend-re…
aurghs Sep 29, 2020
74aba14
remove extra_coords in open_backend_dataset_cfgrib
aurghs Sep 29, 2020
d6280ec
transform zarr maybe_chunk and get_chunks in classmethod
aurghs Sep 29, 2020
c0e0f34
make alpha zarr apiv2 working
aurghs Sep 29, 2020
6431101
refactor apiv2.open_dataset:
aurghs Sep 29, 2020
50d1ebe
move dataset_from_backend_dataset out of apiv2.open_dataset
aurghs Sep 30, 2020
383d323
remove blank lines
aurghs Sep 30, 2020
457a09c
remove blank lines
aurghs Sep 30, 2020
2803fe3
style
aurghs Sep 30, 2020
08db0bd
Re-write error messages
alexamici Sep 30, 2020
1f11845
Fix code style
alexamici Sep 30, 2020
93303b1
Fix code style
alexamici Sep 30, 2020
bc2fe00
remove unused import
aurghs Sep 30, 2020
d694146
replace warning with ValueError for not supported kwargs in backends
aurghs Oct 8, 2020
56f4d3f
change zarr.ZarStore.get_chunks into a static method
aurghs Oct 8, 2020
df23b18
group `backend_kwargs` and `kwargs` in `extra_tokes` argument in apiv…
aurghs Oct 8, 2020
a04e6ac
remove in open_backend_dayaset_${engine} signature kwarags and the re…
aurghs Oct 8, 2020
de29a4c
black
aurghs Oct 8, 2020
be8c23b
Change signature of open_dataset function in apiv2 to include explici…
Oct 12, 2020
feb486c
Set an alias for chunks='auto'.
Oct 19, 2020
8f6af46
Allign empty rows with previous version.
Oct 19, 2020
c9088d3
reverse changes in chunks management
aurghs Oct 21, 2020
fe8099c
move check on decoders from backends to open_dataset (apiv2)
aurghs Oct 21, 2020
fed8b3e
update documentation
aurghs Oct 22, 2020
6fec3ea
Change signature of open_dataset function in apiv2 to include explici…
Oct 12, 2020
231895e
Set an alias for chunks='auto'.
Oct 19, 2020
b88b567
Allign empty rows with previous version.
Oct 19, 2020
be51bc7
reverse changes in chunks management
aurghs Oct 21, 2020
5aa533d
move check on decoders from backends to open_dataset (apiv2)
aurghs Oct 21, 2020
7e75f1c
update documentation
aurghs Oct 22, 2020
2047d46
Merge branch 'change-signature-open_dataset' of github.com:bopen/xarr…
aurghs Oct 23, 2020
3057abb
change defaut value for decode_cf in open_dataset. The function bahav…
aurghs Oct 27, 2020
842fc29
Review docstring of open_dataset function.
Oct 27, 2020
ff1181c
bugfix typo
aurghs Oct 29, 2020
bdcf0fe
- add check on backends signatures
aurghs Nov 2, 2020
61be8a8
- black isort
aurghs Nov 2, 2020
c0b290a
- add type declaration in plugins.py
aurghs Nov 4, 2020
c217031
Fix the type hint for ENGINES
alexamici Nov 6, 2020
8530ff0
Drop special case and simplify resolve_decoders_kwargs
alexamici Nov 6, 2020
73328ac
isort
alexamici Nov 6, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
154 changes: 97 additions & 57 deletions xarray/backends/apiv2.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import inspect
import os

from ..core.utils import is_remote_uri
Expand All @@ -23,7 +24,7 @@ def dataset_from_backend_dataset(
chunks,
cache,
overwrite_encoded_chunks,
extra_tokens,
**extra_tokens,
):
if not (isinstance(chunks, (int, dict)) or chunks is None):
if chunks != "auto":
Expand Down Expand Up @@ -73,17 +74,34 @@ def dataset_from_backend_dataset(
# Ensure source filename always stored in dataset object (GH issue #2550)
if "source" not in ds.encoding:
if isinstance(filename_or_obj, str):
ds.encoding["source"] = filename_or_obj
ds2.encoding["source"] = filename_or_obj

return ds2


def resolve_decoders_kwargs(decode_cf, engine, **decoders):
signature = inspect.signature(ENGINES[engine]).parameters
if decode_cf is False:
for d in decoders:
if d in signature and d != "use_cftime":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this special case d != "use_cftime"? Does it break any tests if we simply remove it?

(My guess is that the existing code may not bother to set use_cftime = False, but only because the value of use_cftime is ignored if decode_times = False.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have forgotten to check it. You are right, we can remove it.

decoders[d] = decode_cf
return {k: v for k, v in decoders.items() if v is not None}


def open_dataset(
filename_or_obj,
*,
engine=None,
chunks=None,
cache=None,
decode_cf=None,
mask_and_scale=None,
decode_times=None,
decode_timedelta=None,
use_cftime=None,
concat_characters=None,
decode_coords=None,
drop_variables=None,
backend_kwargs=None,
**kwargs,
):
Expand All @@ -94,70 +112,50 @@ def open_dataset(
filename_or_obj : str, Path, file-like or DataStore
Strings and Path objects are interpreted as a path to a netCDF file
or an OpenDAP URL and opened with python-netCDF4, unless the filename
ends with .gz, in which case the file is gunzipped and opened with
ends with .gz, in which case the file is unzipped and opened with
scipy.io.netcdf (only netCDF3 supported). Byte-strings or file-like
objects are opened by scipy.io.netcdf (netCDF3) or h5py (netCDF4/HDF).
group : str, optional
Path to the netCDF4 group in the given file to open (only works for
netCDF4 files).
decode_cf : bool, optional
Whether to decode these variables, assuming they were saved according
to CF conventions.
mask_and_scale : bool, optional
If True, replace array values equal to `_FillValue` with NA and scale
values according to the formula `original_values * scale_factor +
add_offset`, where `_FillValue`, `scale_factor` and `add_offset` are
taken from variable attributes (if they exist). If the `_FillValue` or
`missing_value` attribute contains multiple values a warning will be
issued and all array values matching one of the multiple values will
be replaced by NA. mask_and_scale defaults to True except for the
pseudonetcdf backend.
decode_times : bool, optional
If True, decode times encoded in the standard NetCDF datetime format
into datetime objects. Otherwise, leave them encoded as numbers.
autoclose : bool, optional
If True, automatically close files to avoid OS Error of too many files
being open. However, this option doesn't work with streams, e.g.,
BytesIO.
concat_characters : bool, optional
If True, concatenate along the last dimension of character arrays to
form string arrays. Dimensions will only be concatenated over (and
removed) if they have no corresponding variable and if they are only
used as the last dimension of character arrays.
decode_coords : bool, optional
If True, decode the 'coordinates' attribute to identify coordinates in
the resulting dataset.
engine : {"netcdf4", "scipy", "pydap", "h5netcdf", "pynio", "cfgrib", \
"pseudonetcdf", "zarr"}, optional
engine : str, optional
Engine to use when reading files. If not provided, the default engine
is chosen based on available dependencies, with a preference for
"netcdf4".
"netcdf4". Options are: {"netcdf4", "scipy", "pydap", "h5netcdf",\
"pynio", "cfgrib", "pseudonetcdf", "zarr"}.
chunks : int or dict, optional
If chunks is provided, it is used to load the new dataset into dask
arrays. ``chunks={}`` loads the dataset with dask using a single
chunk for all arrays. When using ``engine="zarr"``, setting
``chunks='auto'`` will create dask chunks based on the variable's zarr
chunks.
lock : False or lock-like, optional
Resource lock to use when reading data from disk. Only relevant when
using dask or another form of parallelism. By default, appropriate
locks are chosen to safely read and write files with the currently
active dask scheduler.
cache : bool, optional
If True, cache data loaded from the underlying datastore in memory as
If True, cache data is loaded from the underlying datastore in memory as
NumPy arrays when accessed to avoid reading from the underlying data-
store multiple times. Defaults to True unless you specify the `chunks`
argument to use dask, in which case it defaults to False. Does not
change the behavior of coordinates corresponding to dimensions, which
always load their data from disk into a ``pandas.Index``.
drop_variables: str or iterable, optional
A variable or list of variables to exclude from being parsed from the
dataset. This may be useful to drop variables with problems or
inconsistent values.
backend_kwargs: dict, optional
A dictionary of keyword arguments to pass on to the backend. This
may be useful when backend options would improve performance or
allow user control of dataset processing.
decode_cf : bool, optional
Setting ``decode_cf=False`` will disable ``mask_and_scale``,
``decode_times``, ``decode_timedelta``, ``concat_characters``,
``decode_coords``.
mask_and_scale : bool, optional
If True, array values equal to `_FillValue` are replaced with NA and other
values are scaled according to the formula `original_values * scale_factor +
add_offset`, where `_FillValue`, `scale_factor` and `add_offset` are
taken from variable attributes (if they exist). If the `_FillValue` or
`missing_value` attribute contains multiple values, a warning will be
issued and all array values matching one of the multiple values will
be replaced by NA. mask_and_scale defaults to True except for the
pseudonetcdf backend. This keyword may not be supported by all the backends.
decode_times : bool, optional
If True, decode times encoded in the standard NetCDF datetime format
into datetime objects. Otherwise, leave them encoded as numbers.
This keyword may not be supported by all the backends.
decode_timedelta : bool, optional
If True, decode variables and coordinates with time units in
{"days", "hours", "minutes", "seconds", "milliseconds", "microseconds"}
into timedelta objects. If False, they remain encoded as numbers.
If None (default), assume the same value of decode_time.
This keyword may not be supported by all the backends.
use_cftime: bool, optional
Only relevant if encoded dates come from a standard calendar
(e.g. "gregorian", "proleptic_gregorian", "standard", or not
Expand All @@ -167,12 +165,38 @@ def open_dataset(
``cftime.datetime`` objects, regardless of whether or not they can be
represented using ``np.datetime64[ns]`` objects. If False, always
decode times to ``np.datetime64[ns]`` objects; if this is not possible
raise an error.
decode_timedelta : bool, optional
If True, decode variables and coordinates with time units in
{"days", "hours", "minutes", "seconds", "milliseconds", "microseconds"}
into timedelta objects. If False, leave them encoded as numbers.
If None (default), assume the same value of decode_time.
raise an error. This keyword may not be supported by all the backends.
concat_characters : bool, optional
If True, concatenate along the last dimension of character arrays to
form string arrays. Dimensions will only be concatenated over (and
removed) if they have no corresponding variable and if they are only
used as the last dimension of character arrays.
This keyword may not be supported by all the backends.
decode_coords : bool, optional
If True, decode the 'coordinates' attribute to identify coordinates in
the resulting dataset. This keyword may not be supported by all the
backends.
drop_variables: str or iterable, optional
A variable or list of variables to exclude from the dataset parsing.
This may be useful to drop variables with problems or
inconsistent values.
backend_kwargs:
Additional keyword arguments passed on to the engine open function.
**kwargs: dict
Additional keyword arguments passed on to the engine open function.
For example:

- 'group': path to the netCDF4 group in the given file to open given as
a str,supported by "netcdf4", "h5netcdf", "zarr".

- 'lock': resource lock to use when reading data from disk. Only
relevant when using dask or another form of parallelism. By default,
appropriate locks are chosen to safely read and write files with the
currently active dask scheduler. Supported by "netcdf4", "h5netcdf",
"pynio", "pseudonetcdf", "cfgrib".

See engine open function for kwargs accepted by each specific engine.


Returns
-------
Expand Down Expand Up @@ -202,12 +226,25 @@ def open_dataset(
if engine is None:
engine = _autodetect_engine(filename_or_obj)

decoders = resolve_decoders_kwargs(
decode_cf,
engine=engine,
mask_and_scale=mask_and_scale,
decode_times=decode_times,
decode_timedelta=decode_timedelta,
concat_characters=concat_characters,
use_cftime=use_cftime,
decode_coords=decode_coords,
)

backend_kwargs = backend_kwargs.copy()
overwrite_encoded_chunks = backend_kwargs.pop("overwrite_encoded_chunks", None)

open_backend_dataset = _get_backend_cls(engine, engines=ENGINES)
backend_ds = open_backend_dataset(
filename_or_obj,
drop_variables=drop_variables,
**decoders,
**backend_kwargs,
**{k: v for k, v in kwargs.items() if v is not None},
)
Expand All @@ -218,7 +255,10 @@ def open_dataset(
chunks,
cache,
overwrite_encoded_chunks,
{**backend_kwargs, **kwargs},
drop_variables=drop_variables,
**decoders,
**backend_kwargs,
**kwargs,
)

return ds
8 changes: 0 additions & 8 deletions xarray/backends/cfgrib_.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,6 @@ def get_encoding(self):
def open_backend_dataset_cfgrib(
filename_or_obj,
*,
decode_cf=True,
mask_and_scale=True,
decode_times=None,
concat_characters=None,
Expand All @@ -93,13 +92,6 @@ def open_backend_dataset_cfgrib(
time_dims=("time", "step"),
):

if not decode_cf:
mask_and_scale = False
decode_times = False
concat_characters = False
decode_coords = False
decode_timedelta = False

store = CfGribDataStore(
filename_or_obj,
indexpath=indexpath,
Expand Down
8 changes: 0 additions & 8 deletions xarray/backends/h5netcdf_.py
Original file line number Diff line number Diff line change
Expand Up @@ -328,7 +328,6 @@ def close(self, **kwargs):
def open_backend_dataset_h5necdf(
filename_or_obj,
*,
decode_cf=True,
mask_and_scale=True,
decode_times=None,
concat_characters=None,
Expand All @@ -343,13 +342,6 @@ def open_backend_dataset_h5necdf(
phony_dims=None,
):

if not decode_cf:
mask_and_scale = False
decode_times = False
concat_characters = False
decode_coords = False
decode_timedelta = False

store = H5NetCDFStore.open(
filename_or_obj,
format=format,
Expand Down
8 changes: 0 additions & 8 deletions xarray/backends/zarr.py
Original file line number Diff line number Diff line change
Expand Up @@ -684,7 +684,6 @@ def open_zarr(

def open_backend_dataset_zarr(
filename_or_obj,
decode_cf=True,
mask_and_scale=True,
decode_times=None,
concat_characters=None,
Expand All @@ -700,13 +699,6 @@ def open_backend_dataset_zarr(
chunk_store=None,
):

if not decode_cf:
mask_and_scale = False
decode_times = False
concat_characters = False
decode_coords = False
decode_timedelta = False

store = ZarrStore.open_group(
filename_or_obj,
group=group,
Expand Down