Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disallow boolean coordinates? #4892

Open
mathause opened this issue Feb 11, 2021 · 2 comments
Open

disallow boolean coordinates? #4892

mathause opened this issue Feb 11, 2021 · 2 comments

Comments

@mathause
Copy link
Collaborator

mathause commented Feb 11, 2021

Today I stumbled over a small pitfall, which I think could be avoided:

I am working with arrays that have axes labeled with categorical values and I ended up using True/False as labels for some binary categories:

test = xarray.DataArray(
     numpy.ones((3,2)), 
     dims=["binary","ternary"],
     coords={"ternary":[3,7,9],"binary":[False,True]}
)

now came the big surprise, when I wanted to reduce over selections of the data:

test.sel(ternary=[9,3,7]) # does exactly what I expect and gives me the correctly permuted 3x2 array
test.sel(binary=[True,False]) # does not do what I expect

Instead of using the coordinate values like with the ternary category, it uses the list as boolean mask and hence I get a 3x1 array at the binary=False coordinate.

I assume that this behavior is reasonable in most cases - And I for sure will stop using bools as binary category labels.
That said in the above case the conceptually identical call results in completely different outcome.

My (radical) proposal would be: forbid binary coordinates in general to avoid such confusion.

Curious about your thoughts! Hth,

Marti

Originally posted by @martinitus in #4861

@max-sixty
Copy link
Collaborator

It's a good issue!

I would have thought .sel should default to using labels. I recognize that boolean indexing is really helpful — and think we should try and support it more (e.g. #1887).

And I recognize .sel delegates to .isel in some cases — e.g. where there are no indexes on a dimension.

But here, I propose we prioritize the labels above the boolean indexing in .sel. People can still use .isel if they want bool indexing in the method call syntax:

In [4]: test = xr.DataArray(
   ...:      np.ones((3,2)),
   ...:      dims=["ternary","binary"],
   ...:      coords={"ternary":[3,7,9],"binary":[False,True]}
   ...: )

In [5]: test.sel(binary=[True,False])
Out[5]:
<xarray.DataArray (ternary: 3, binary: 1)>
array([[1.],
       [1.],
       [1.]])
Coordinates:
  * ternary  (ternary) int64 3 7 9
  * binary   (binary) bool False

In [6]: test.isel(binary=[True,False])
Out[6]:
<xarray.DataArray (ternary: 3, binary: 1)>
array([[1.],
       [1.],
       [1.]])
Coordinates:
  * ternary  (ternary) int64 3 7 9
  * binary   (binary) bool False

I would also be OK removing the .sel delegation to .isel if it causes complications. IIUC the main use case is a single method call to index labels and dimensions without labels. But that's easy to replace with two method calls, I think?

@martinitus
Copy link

I don't know the internals of delegation between .sel and .isel. But from the user side I would expect that boolean indexing requires me to use .isel naturally. I mean, I have to provide a boolean mask that fits the shape of the array, i.e. it is naturally index based and should only be used with .isel irrespective of the coordinate types.

While that probably be a breaking change for some people, I think it makes a quite complicated topic slightly easier to document, and figure out intentions in written code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants