Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dimensions: type as str | Iterable[Hashable]? #6142

Closed
mathause opened this issue Jan 5, 2022 · 16 comments
Closed

dimensions: type as str | Iterable[Hashable]? #6142

mathause opened this issue Jan 5, 2022 · 16 comments
Labels
API design plan to close May be closeable, needs more eyeballs topic-typing

Comments

@mathause
Copy link
Collaborator

mathause commented Jan 5, 2022

What happened?

We generally type dimensions as:

dims: Hashable | Iterable[Hashable]

However, this is in conflict with passing a tuple of independent dimensions to a method - e.g. da.mean(("x", "y")) because a tuple is also hashable.

Also mypy requires an isinstance(dims, Hashable) check when typing a function. We use an isinstance(dims, str) check in many places to wrap a single dimension in a list. Changing this to isinstance(dims, Hashable) will change the behavior for tuples.

What did you expect to happen?

In the community call today we discussed to change this to

dims: str | Iterable[Hashable]

i.e. if a single dim is passed it has to be a string and wrapping it in a list is a convenience function. Special use cases with Hashable types should be wrapped in a Iterable by the user. This probably best reflects the current state of the repo (dims = [dims] if isinstance(dims, str) else dims).

The disadvantage could be that it is a bit more difficult to explain in the docstrings?

@shoyer - did I get this right from the discussion?


Other options

  1. Require str as dimension names.

This could be too restrictive. @keewis mentioned that tuple dimension names are already used somwehere in the xarray repo. Also we discussed in another issue or PR (which I cannot find right know) that we want to keep allowing Hashable.

  1. Disallow passing tuples (only allow tuples if a dimension is a tuple), require lists to pass several dimensions.

This is too restrictive in the other direction and will probably lead to a lot of downstream troubles. Naming a single dimension with a tuple will be a very rare case, in contrast to passing several dimension names as a tuple.

  1. Special case tuples. We could potentially check if dims is a tuple and if there are any dimension names consisting of a tuple. Seems more complicated and potentially brittle for probably small gains (IMO).

Minimal Complete Verifiable Example

No response

Relevant log output

No response

Anything else we need to know?

  • We need to check carefully where general Hashable are really allowed. E.g. dims of a DataArray are typed as

dims: Union[Hashable, Sequence[Hashable], None] = None,

but tuples are not actually allowed:

import xarray as xr
xr.DataArray([1], dims=("x", "y"))
# ValueError: different number of dimensions on data and dims: 1 vs 2
xr.DataArray([1], dims=[("x", "y")])
# TypeError: dimension ('x', 'y') is not a string
  • We need to be careful typing functions where only one dim is allowed, e.g. xr.concat, which should probably set dim: Hashable (and make sure it works).
  • Do you have examples for other real-world hashable types except for str and tuple? (Would be good for testing purposes).

Environment

N/A

@mathause mathause added bug needs triage Issue that has not been reviewed by xarray team member API design design question topic-typing and removed bug needs triage Issue that has not been reviewed by xarray team member design question labels Jan 5, 2022
@mathause mathause changed the title Dimension names: Hashable vs. tuples dimensions: type as str | Iterable[Hashable]? Jan 5, 2022
@shoyer
Copy link
Member

shoyer commented Jan 5, 2022

@shoyer - did I get this right from the discussion?

Yes, this was my suggestion.

@max-sixty
Copy link
Collaborator

Thanks for writing this out. I realize something clever about the proposal:

Our very first line of code in the docs passes dims as a tuple dims=("x", "y"): https://xarray.pydata.org/en/stable/getting-started-guide/quick-overview.html#create-a-dataarray

This will still work as two dims, since having str | Iterable[Hashable] rather than Hashable | Iterable[Hashable] means it will fail the first check. So if someone really wants a single dim as ("x", "y"), they can pass (("x", "y"),).

Does this mean that this would be confusion free?

@mathause
Copy link
Collaborator Author

mathause commented Jan 7, 2022

This will still work as two dims, since having str | Iterable[Hashable] rather than Hashable | Iterable[Hashable] means it will fail the first check. So if someone really wants a single dim as ("x", "y"), they can pass (("x", "y"),).

Yes, I think that is the gist of the proposal. I also think that it is quite elegant.

Does this mean that this would be confusion free?

I think it would be relatively unambiguous but not necessarily entirely confusion free for the user ("Why is the type of a single dim more restricted than of several dimensions?"). Maybe this warrants an entry in our FAQ or somewhere? Also we need to go over our code and update our typing and make sure stuff like xr.DataArray([1], dims=[("x", "y")] works as proposed here.

@TomNicholas
Copy link
Member

We need to be careful typing functions where only one dim is allowed, e.g. xr.concat, which should probably set dim: Hashable (and make sure it works).

Sorry, I don't understand this - if I want to type hint concat so that no-one can pass more than one dim, isn't my only option to type hint it as str? Because if I type hint it as Hashable they could pass ("x", "y"), which is two dims? Maybe I'm not following properly...

@mathause
Copy link
Collaborator Author

mathause commented Jan 7, 2022

I agree this is confusing, what is meant here is to use a tuple as the name for one dimension. An example:

ds = xr.Dataset({"x": ([("a", "b")], [1])})
print(ds)
print(f"{ds.x.dims = }")
<xarray.Dataset>
Dimensions:  (('a', 'b'): 1)
Dimensions without coordinates: ('a', 'b')
Data variables:
    x        (('a', 'b')) int64 1

ds.x.dims = (('a', 'b'),)

I found the issue again where we discussed that we want to keep dim: Hashable: #4821

@headtr1ck
Copy link
Collaborator

In case you are missing a use case for tuples as dimension/variable names:

We have a system with two detectors, lets call them "a" and "b".
Both take 2D images with dimensions "x" and "y" but different sizes.

Now instead of calling the dimensions "x_a", "y_a", "x_b", "y_b" we call them ("x", "a") etc.
This makes looking for dimensions etc. more consistent (Even better with a custom hashable dimension class).

@headtr1ck
Copy link
Collaborator

I was just looking through the code a bit and I must say that the usage of dimension arguments is far from harmonized in the code....
Apart from arguments randomly being called dim or dims, there are so many inconsistencies that I even wonder if it is possible to use anything but strings in "production".

  • some parts of the code claim that hashable is allowed but then check explicitly for str else tuple(dims) which would straight up fail with a custom hashable class.
  • Some parts do the exact opposite, they check for iterable and not str first.

At least internally most objects use tuple[Hashable, ...] for dims.

The lazy shortcut to allow a single str is making this more complicated than one might think (but I have to admit, I use it all the time).

@max-sixty
Copy link
Collaborator

there are so many inconsistencies that I even wonder if it is possible to use anything but strings in "production".

I would have thought this is indeed the case. I think we're quite open to adjusting this so that it works, but it would probably needs some work, including defining common patterns — e.g. the title of this issue is a good approach

Apart from arguments randomly being called dim or dims

I also think we'd be up for reforming this! Generally this is the first arg and so only used positionally, but it still makes sense to have them consistent.

@headtr1ck
Copy link
Collaborator

headtr1ck commented May 19, 2022

Only remotely related to this issue, but would it help to type the dimension (or variable names etc) Mappings with covariant Hashable keys to prevent the issue that {"a":1} is incompatible with Mapping[Hashable, Any]?

Something like:

Hashable_co = TypeVar("Hashable_co", bound=Hashable, covariant=True)
Mapping[Hashable_co, Any]

Same for lists etc. (Or use Sequences which are already covariant, as opposed to lists)

I am not an expert on variance in types and don't know if this would break things, but at least in my tests mypy did not complain when using dict[str, str].

See https://mypy-play.net/?mypy=latest&python=3.10&gist=b2c27a5ca4aad9e8ce63cd74dc4032b0

@max-sixty
Copy link
Collaborator

Only remotely related to this issue, but would it help to type the dimension (or variable names etc) Mappings with covariant Hashable keys to prevent the issue that {"a":1} is incompatible with Mapping[Hashable, Any]?

I spent a non-trivial amount of time on this a year ago or so — my understanding is that Mapping was not covariant in its keys, exactly as you point out.

But I'm also not sure we lose anything with Mapping[Any, foo] — the key needs to be Hashable anyway. I don't have a strong objection, but do we gain anything?

e.g. https://mypy-play.net/?mypy=latest&python=3.10&gist=a3ea9f6fce5a678803bd7563a54cd9ca

See https://mypy-play.net/?mypy=latest&python=3.10&gist=b2c27a5ca4aad9e8ce63cd74dc4032b0

Very cool! I didn't even know about this.

@headtr1ck
Copy link
Collaborator

But I'm also not sure we lose anything with Mapping[Any, foo] — the key needs to be Hashable anyway. I don't have a strong objection, but do we gain anything?

Wow, totally forgot about this. Then you are right, Any as key is totally ok.

Although if I am not mistaken, one could define a custom Mapping class that allows non-hashable keys, this is quite a niche use case and there are more urgent things to type, like this issue.

@headtr1ck
Copy link
Collaborator

I have encountered another problem that is related to this:
DataArray.argmins dim argument is: Hashable | Sequence[Hashable] | None, where None is a special case that will apply over all dimensions.

So I tried to overload this:

    @overload
    def argmin(
        self,
        dim: Hashable,
        *,
        axis: int | None = None,
        keep_attrs: bool | None = None,
        skipna: bool | None = None,
    ) -> DataArray:
        ...

    @overload
    def argmin(
        self,
        dim: Sequence[Hashable] | None = None,
        axis: int | None = None,
        keep_attrs: bool | None = None,
        skipna: bool | None = None,
    ) -> dict[Hashable, DataArray]:
        ...

but mypy complains:

Overloaded function signatures 1 and 2 overlap with incompatible return types [misc]

I suspect that this is due to the fact that None is actually Hashable and therefore a valid dimension.

Unfortunately, using this proposal of str | Sequence[Hashable] won't work since str is actually a Sequence of Hashables... (I know that this proposal is meant for functions that always expect multiple dims and the str is only a lazy mans shortcut)

@max-sixty
Copy link
Collaborator

Ah, that's a good case @headtr1ck . So str | Iterable[Hashable] works when one is checked before the other, but not when they're required to be disjoint, like an overload.

I don't have an immediate solution in mind unfortunately. I'm not sure having different return types is a great design in general, though I can see the logic for it in this case...

@headtr1ck
Copy link
Collaborator

See python/typing#256 as reference for this problem.

@max-sixty
Copy link
Collaborator

Should we close?

I think this is mostly done. I see it as a big success!

Except for the overload case, which I don't think is actionable atm, is there anything remaining here?

@max-sixty max-sixty added the plan to close May be closeable, needs more eyeballs label Sep 25, 2024
@headtr1ck
Copy link
Collaborator

I think even the overload case is widely accepted to just type ignore it.

The next step would be to become generic in :)

We can close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API design plan to close May be closeable, needs more eyeballs topic-typing
Projects
None yet
Development

No branches or pull requests

5 participants