Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datasets more robust to non-string keys #2174

Merged
merged 16 commits into from
May 27, 2018
Merged

Conversation

max-sixty
Copy link
Collaborator

@max-sixty max-sixty commented May 23, 2018

I don't think this is the most efficient way of doing this, though it does work. Any ideas for a more efficient implementation?

@@ -8,6 +8,7 @@ testpaths=xarray/tests
[flake8]
max-line-length=79
ignore=
W503
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is line break before operator, which has been changed in PEP8

try:
is_coord_key = any([
isinstance(key, basestring),
key in set(self.dims).union(self.coords)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would hesitate to do this fix -- it means array[2] could pull out a coordinate if 2 is the name of a dimension. More generally, DataArray.__getitem__ is only meant as a convenient short-cut for getting coordinates, which we might even want to deprecate entirely. DataArray.coords.__getitem__ is the reliable way to get a coordinate out.

Instead, I think the better option would be to avoid making use of ** unpacking below, in the expression self.isel(**self._item_key_to_dict(key)). This would be a little more involved, but I think the right fix would be to make most of these functions accept a positional dictionaries instead of **kwargs, as suggested in #1231.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, completely agree, thanks

@max-sixty
Copy link
Collaborator Author

I added an initial attempt at allowing for supplying a dict; on isel only at first

Let me know ppl's thoughts:

  • Arg name?
  • Combine or raise if supplied with kwargs?
  • Is there an encapsulation leak? Should we have a user API layer which is then translated to a more robust representation (i.e. kwargs to a dict), and those be separate methods? My initial inclination is that the current approach is good and managable

@max-sixty
Copy link
Collaborator Author

DataArray.__getitem__ is only meant as a convenient short-cut for getting coordinates, which we might even want to deprecate entirely

👍

@max-sixty
Copy link
Collaborator Author

@shoyer I see you're online - let me know if you have any thoughts re the changes. Am on a burst of open-source work atm

@shoyer
Copy link
Member

shoyer commented May 25, 2018

Arg name?

I'd like a convention of matching names to make it clear in docstrings that these are the same thing, e.g.,

  • something_dict/**something, or
  • something/**something_kwargs

Probably the last is best since users will never have cause to type the longer argument name ending with _kwargs.

In practice, I suspect these first arguments are almost always going to be used positionally. We might even imagine making them positional only argument if PEP 570 is ever accepted. So I don't think it matters too much.

Combine or raise if supplied with kwargs?

Let's raise (like your current implementation) if both are provided. Combining could be error prone and seems unnecessarily complicated.

This is the same logic I implemented in the somewhat misleadingly named utils.combine_pos_and_kw_args() utility function (used by Dataset.reindex()).

Is there an encapsulation leak? Should we have a user API layer which is then translated to a more robust representation (i.e. kwargs to a dict), and those be separate methods? My initial inclination is that the current approach is good and managable

I agree -- I think the current version is manageable.

For internal use, we should always use the positional argument, but there's not much harm in leaving **kwargs around.

@max-sixty
Copy link
Collaborator Author

sel, isel & reindex are done

sel_points etc are deprecated so we can leave those

Any others on the 'first priority' list?

@max-sixty
Copy link
Collaborator Author

AppVeyor failure unrelated

@@ -1404,12 +1406,16 @@ def isel(self, drop=False, **indexers):
Dataset.sel
DataArray.isel
"""

indexers = combine_pos_and_kw_args(indexers, indexers_kwargs, 'isel')
assert isinstance(drop, bool)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should either be dropped for now, or raise TypeError. assert isn't appropriate for external APIs.

"""Conform this object onto a new set of indexes, filling in
missing values with NaN.

Parameters
----------
**indexers : dict
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't start with **

@@ -18,7 +18,9 @@
from .formatting import format_item
from .options import OPTIONS
from .pycompat import OrderedDict, basestring, iteritems, range, zip
from .utils import decode_numpy_dict_values, ensure_us_time_resolution
from .utils import (
combine_pos_and_kw_args, decode_numpy_dict_values,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we're at it, maybe we should rename combine_pos_and_kw_args to something clearer, maybe either_dict_or_kwargs?

values will be filled in with NaN, and any mis-matched dimension
names will simply be ignored.
**indexers_kwargs : {dim: indexer, ...}
The keyword arguments form of ``indexers``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add something like: Only indexers or indexer_kwargs may be provided.

@shoyer
Copy link
Member

shoyer commented May 25, 2018

Looking through Dataset/DataArray methods that support **kwargs, I think this may be the full list:

  • stack
  • shift
  • roll
  • set_index
  • reorder_levels
  • rolling
  • resample (not yet, we still support old behavior for the first positional arguments with a warning)

@shoyer
Copy link
Member

shoyer commented May 25, 2018

But I think just supporting this for isel/sel/reindex for now would be enough to be valuable. Note that currently it's on Dataset.reindex() but not DataArray.reindex().

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me!

@@ -1444,6 +1452,15 @@ def sel(self, method=None, tolerance=None, drop=False, **indexers):

Parameters
----------
----------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete

@@ -183,7 +183,7 @@ def is_full_slice(value):
return isinstance(value, slice) and value == slice(None)


def combine_pos_and_kw_args(pos_kwargs, kw_kwargs, func_name):
def either_dict_or_kwargs(pos_kwargs, kw_kwargs, func_name):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to test every use, but we should at least add unit test for this helper function -- I don't think we have any test coverage currently.

@shoyer shoyer merged commit 8470500 into pydata:master May 27, 2018
@shoyer
Copy link
Member

shoyer commented May 27, 2018

thanks @maxim-lian !

@max-sixty
Copy link
Collaborator Author

thanks for working through the feedback, as ever, @shoyer !
It's like free coding classes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants