Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update signature open_dataset for API v2 #4547

Merged
merged 54 commits into from
Nov 6, 2020
Merged

Update signature open_dataset for API v2 #4547

merged 54 commits into from
Nov 6, 2020

Conversation

aurghs
Copy link
Collaborator

@aurghs aurghs commented Oct 28, 2020

Proposal for the new API of open_dataset(). It is implemented in apiv2.py and it doesn't modify the current behavior of api.open_dataset().
It is something in between the first and second alternative suggested at #4490 (comment), see the related quoted text:

Describe alternatives you've considered

For the overall approach:

  1. We could keep the current design, with separate keyword arguments for decoding options, and just be very careful about passing around these arguments. This seems pretty painful for the backend refactor, though.
  2. We could keep the current design only for the user facing open_dataset() interface, and then internally convert into the DecodingOptions() struct for passing to backend constructors. This would provide much needed flexibility for backend authors, but most users wouldn't benefit from the new interface. Perhaps this would make sense as an intermediate step?

Instead of a class for the decoders, I have added a function: resolve_decoders_kwargs.
resolve_decoders_kwargs performs two tasks:

  • If decode_cf is False, it sets to False all the decoders supported by the backend (using inspect).
  • It filters out the None decoder keywords.

So xarray manages the keyword decode_cf and passes on only the non-default decoders to the backend. If the user sets to a non-None value a decoder not supported by the backend, the backend will raise an error.

With this implementation drop_variable should be always supported by the backend. But I think this could be implemented easely by all the backends. I wouldn't group it with the decoders: to me, it seems to be more a filter rather than a decoder.

The behavior decode_cf is unchanged.

PRO:

  • the user doesn't need to import and instantiate a class.
  • users get the argument completion on open_dataset.
  • the backend defines directly in open_backend_dataset_${engine} API the accepted decoders.
  • xarray manages decode_cf, not the backends.

Missing points:

  • deccode_cf should be renamed decode. Probably, the behavior of decode should be modified for two reason:
    • currently If decode_cf is False, it sets the decoders to False, but there is no check on the other values. The accepted values should be: None (it keeps decoders default values), True (it sets all the decoders to True), False (it sets all the decoders to False).
    • currently we can set both a decoder and decode_cf without any warning. , but the
  • Deprecate backend_kwargs (or kwargs).
  • Separate mask_and_scale?

I think that we need a different PR for the three of them.

aurghs and others added 30 commits September 25, 2020 19:07
# Conflicts:
#	xarray/backends/api.py
…ad-refactor

# Conflicts:
#	xarray/backends/apiv2.py
- to be used in apiv2 without instantiate the object
- modify signature
- move default setting inside backends
signature = inspect.signature(ENGINES[engine]).parameters
if decode_cf is False:
for d in decoders:
if d in signature and d != "use_cftime":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this special case d != "use_cftime"? Does it break any tests if we simply remove it?

(My guess is that the existing code may not bother to set use_cftime = False, but only because the value of use_cftime is ignored if decode_times = False.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have forgotten to check it. You are right, we can remove it.

Comment on lines +222 to 245
decoders = resolve_decoders_kwargs(
decode_cf,
engine=engine,
mask_and_scale=mask_and_scale,
decode_times=decode_times,
decode_timedelta=decode_timedelta,
concat_characters=concat_characters,
use_cftime=use_cftime,
decode_coords=decode_coords,
)

backend_kwargs = backend_kwargs.copy()
overwrite_encoded_chunks = backend_kwargs.pop("overwrite_encoded_chunks", None)

open_backend_dataset = _get_backend_cls(engine, engines=ENGINES)
open_backend_dataset = _get_backend_cls(engine, engines=plugins.ENGINES)[
"open_dataset"
]
backend_ds = open_backend_dataset(
filename_or_obj,
drop_variables=drop_variables,
**decoders,
**backend_kwargs,
**{k: v for k, v in kwargs.items() if v is not None},
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For completeness, let me mention an alternative design that would not need inspect.signature. Instead of using a single function open_backend_dataset, we could use a "Loader" class with two separate methods, one for decoding a dataset and another for returning a raw dataset, e.g.,

backend_cls = _get_backend_cls(engine)
loader = backend_loader(filename_or_obj, **backend_kwargs)
if decode:
    ds = loader.load_decoded(**decoders)
else:
    ds = loader.load_raw()

Is this better than the current approach in this PR (using inspect), or the current approach (on master) of calling the single open_backend_dataset() function with decode_cf=False? Honestly I'm not sure. It is most explicit, but also probably a little more annoying to write.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that in this way we are going to complicate the interface without a real advantage. But I'm not sure about it. @alexamici, @jhamman what do you think about it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer the main reason we proposed the use inspect is to raise an appropriate error message when a backend doesn't support a specific decoding option.

Your proposal simplifies the mangling of the decode options, but I'd still use inspect in the same as we do now before calling load_decode.

Personally I'd favour keeping the current implementation.

Comment on lines +27 to +30
raise TypeError(
f'All the parameters in {engine["open_dataset"]!r} signature should be explicit. '
"*args and **kwargs is not supported"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it correctly implements my suggestion from last week 👍 .

I don't know how annoying backend authors would find this restriction to be.

Copy link
Collaborator

@alexamici alexamici Nov 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current implementation a backend developer can add the "signature" key explicitly to the engine dictionary to skip running the inspection.

This allows also providing a "open_dataset" as a C-function.

I like the current implementation.

@shoyer
Copy link
Member

shoyer commented Nov 5, 2020

As discussed, let's stick with the current PR

@aurghs @alexamici you should have "merge" permissions now!

@alexamici alexamici changed the title Update signature open dataset Update signature open_dataset for API v2 Nov 6, 2020
@alexamici
Copy link
Collaborator

I fixed the type ints and the two test failures are unrelated to the changes.

I'm going to test my shiny new merge rights :)

@alexamici alexamici merged commit ba989f6 into pydata:master Nov 6, 2020
Comment on lines +20 to +30
if "signature" not in engine:
parameters = inspect.signature(engine["open_dataset"]).parameters
for name, param in parameters.items():
if param.kind in (
inspect.Parameter.VAR_KEYWORD,
inspect.Parameter.VAR_POSITIONAL,
):
raise TypeError(
f'All the parameters in {engine["open_dataset"]!r} signature should be explicit. '
"*args and **kwargs is not supported"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a lot of indentation! I think it might be a bit easier to read if we get rid of one level of indentation using

    if "signature" in engine:
        continue

nothing urgent, though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are completely right!

There is some style fix that I really need to do. I will do a new pull request.

@aurghs aurghs deleted the change-signature-open_dataset branch February 11, 2021 01:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants