sel along 1D non-index coordinates #3925

TomNicholas · 2020-04-02T02:23:56Z

As a user, I find not being able to select along one-dimensional non-dimensional coordinates actually comes up fairly often. I think it's quite common to use multiple coordinates to be able to choose between plotting in different coordinate systems (or units) easily.

I've tried to close #2028 in the simplest (but also least efficient) way which was suggested by @shoyer (suggestion 1 here).

This should be temporary anyway: it will get superseded by the explicit indexes refactor. If there is another approach which would achieve the same functionality as this PR but actually bring us closer to #1603 then I would be happy to take a stab at that instead.

I don't really know what to do about the failing test in groupby arithmetic - I think it's caused here but I'm not sure what to replace the triple error type catching (?!) with.

Closes slice using non-index coordinates #2028
Tests added
Passes isort -rc . && black . && mypy . && flake8
Fully documented, including whats-new.rst for all changes and api.rst for new API

pep8speaks · 2020-04-02T02:24:04Z

Hello @TomNicholas! Thanks for opening this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file xarray/core/indexing.py:

Line 219:12: E131 continuation line unaligned for hanging indent
Line 220:14: E131 continuation line unaligned for hanging indent
Line 223:10: E111 indentation is not a multiple of four
Line 223:10: E117 over-indented

max-sixty · 2020-04-02T17:14:41Z

I think the functionality is useful, but I'm concerned it would make the API more confusing. In particular, one maxim we've generally held is that changing the number of dimensions doesn't change functionality. (ofc it changes whether it's possible to create an index, though the indexing API doesn't change)

What are others' thoughts?

TomNicholas · 2020-04-02T17:44:39Z

In particular, one maxim we've generally held is that changing the number of dimensions doesn't change functionality.

Sorry, could you explain what you mean here? How would this PR violate that?

max-sixty · 2020-04-02T18:01:11Z

Sorry, could you explain what you mean here? How would this PR violate that?

Of course. I mean that if someone changes the dimensionality of a non-index coord from 1D to 2D, then running .sel over it will stop working.

Whereas now, neither would work. Which in some ways is worse, but also less surprising. I think these are always difficult trade-offs...

TomNicholas · 2020-04-02T18:08:32Z

Gotcha, thanks.

Hmm, this seems like a grey area... I think selecting along 1D non-dimension coords is probably way more common than along 2D coords, but I'm biased.

changing the number of dimensions doesn't change functionality

BTW I can immediately think of at least one other place where this rule is broken (#3774) - where going from 0D coords to 1D coords of length 1 changes whether combine_by_coords will accept them.

shoyer · 2020-04-05T19:30:33Z

xarray/core/indexing.py

+            if new_idx is not None:
+                new_indexes[dim] = new_idx
+
+        elif dim in data_obj.coords and len(data_obj.coords[dim] == 1):


I think this should be data_obj.coords[dim].ndim == 1 instead of len(data_obj.coords[dim] == 1)? The later is checking that the coordinate's first dimension has length 1.

shoyer · 2020-04-05T19:31:43Z

xarray/core/indexing.py

+            coords_dtype = data_obj.coords[dim].dtype
+            label = maybe_cast_to_coords_dtype(label, coords_dtype)
+            idxr, new_idx = convert_label_indexer(index, label, dim,
+                                                  method, tolerance)
+            pos_indexers[dim] = idxr
+            if new_idx is not None:
+                new_indexes[dim] = new_idx


Can you refactor so this code block doesn't need to be repeated?

Every index is a 1D coordinate, so I think you could do this by switching around the order of these if checks, e.g.,

if dim in data_obj.coords and data_obj.coords[dim].ndim == 1: if dim in data_obj.indexes: index = data_obj.indexes[dim] else: index = data_obj.coords[dim].to_index() (dim,) = data_obj[dim].dims coords_dtype = data_obj.coords[dim].dtype label = maybe_cast_to_coords_dtype(label, coords_dtype) idxr, new_idx = convert_label_indexer(index, label, dim, method, tolerance) pos_indexers[dim] = idxr if new_idx is not None: new_indexes[dim] = new_idx

shoyer · 2020-04-05T19:45:07Z

I think this is generally a good idea!

In the future, creating an index explicit would just mean:

Repeated lookups are more efficient, due to caching in a hash-table.
The coordinate values are immutable, to ensure that the index values can be cached.

One minor concern I have here is about efficiency: building an pd.Index and its hash-table from scratch can be quite expensive. If we're only doing a single lookup this is fine, but if it's being done in a loop this could be surprisingly slow, and we would do far better sticking with pure NumPy operations.

Here's an microbenchmark that hopefully illustrates the issue:

import pandas as pd
import numpy as np

def lookup_preindexed(needle, index):
  return index.get_loc(needle)

def lookup_newindex(needle, haystack):
  return lookup_preindexed(needle, pd.Index(haystack))

def lookup_numpy(needle, haystack):
  return (haystack == needle).argmax()

haystack = np.random.permutation(np.arange(1000000))
index = pd.Index(haystack)
%timeit lookup_newindex(0, haystack)  # 56.1 ms per loop
%timeit lookup_preindexed(0, index)  # 696 ns per loop
%timeit lookup_numpy(0, haystack)  # 517 µs per loop

pandas is 1000x faster than NumPy if the index is pre-existing, but 100x slower if the index is new. That's a 1e5 fold slow-down!

I think users will appreciate the flexibility, but if there's some way we warn users that they really should set the index ahead of time when they are doing repeating indexing that could also be welcome. Figuring out how to save the state for counting the number of times a new index is created could be pretty messy, though. I guess we could stuff it into Variable.encoding and issue a warning whenever the same variable has been converted into an index at least 100 times.

shoyer · 2020-04-05T19:53:44Z

Related to my microbenchmark, it might also be worth considering pure NumPy versions of common indexing operations, to avoid the need to repeatedly create hash-tables. But that could be quite a bit of work to do comprehensively.

max-sixty · 2020-04-06T14:14:34Z

OK! May be a sizeable change but I update on @shoyer 's view, let's do it.

benbovy · 2021-03-09T16:23:17Z

pandas is 1000x faster than NumPy if the index is pre-existing, but 100x slower if the index is new. That's a 1e5 fold slow-down!

I think users will appreciate the flexibility, but if there's some way we warn users that they really should set the index ahead of time when they are doing repeating indexing that could also be welcome.

I think it's a good use case for some kind of EphemeralIndex (or BasicIndex or NumpyIndex) once the explicit index refactoring is done, along with some good documentation on which index to choose for which purpose.

benbovy · 2021-09-23T10:37:07Z

In #5692 it is possible to perform selection using non-dimension coordinates with an index, although there's no easy way yet to set an index for such coordinates (this will be done in a follow-up PR by updating the API of set_index).

TomNicholas · 2021-09-23T18:20:50Z

FYI I probably am not going to work on this PR again - especially as I remember getting quite confused by the indexing internals, which have now changed. If we can just wait for this feature to be enabled via the indexing refactor then I would rather just do that. Perhaps I should close this?

ivanakcheurov · 2022-06-05T20:18:26Z

In #5692 it is possible to perform selection using non-dimension coordinates with an index, although there's no easy way yet to set an index for such coordinates (this will be done in a follow-up PR by updating the API of set_index).

@benbovy, please could you give an example how it is possible?
I would like sel based on a non-dim coordinate to be as fast as sel based on the dim itself as per the following timings:

# sel based on a non-dim coordinate 
# (using this coordinate directly .sel(product_id=26) would result in error "'no index found for coordinate product_id")
%timeit xds.sel(product=xds.product_id==26)
1.54 ms ± 64.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# sel based on the dim itself
%timeit xds.sel(product='GN91 Glove Medium')
499 µs ± 16.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit xds.where(xds.product_id==26, drop=True)
4.17 ms ± 39 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

mathause · 2022-09-07T14:31:58Z

With #6971 on the way I guess we can close this.

TomNicholas added 2 commits April 2, 2020 01:18

sel handles 1D non-dimension coords

5b037e5

basic test of selecting slices with 1d nonindex coord

558721e

TomNicholas added topic-indexing enhancement labels Apr 2, 2020

shoyer reviewed Apr 5, 2020

View reviewed changes

dcherian mentioned this pull request Apr 15, 2020

slice using non-index coordinates #2028

Closed

dcherian mentioned this pull request May 5, 2020

0.16.0 release #4031

Closed

23 tasks

mathause closed this Sep 7, 2022

benbovy mentioned this pull request Sep 7, 2022

Add documentation on custom indexes #6975

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sel along 1D non-index coordinates #3925

sel along 1D non-index coordinates #3925

TomNicholas commented Apr 2, 2020

pep8speaks commented Apr 2, 2020

max-sixty commented Apr 2, 2020

TomNicholas commented Apr 2, 2020

max-sixty commented Apr 2, 2020

TomNicholas commented Apr 2, 2020

shoyer Apr 5, 2020

shoyer Apr 5, 2020

shoyer commented Apr 5, 2020

shoyer commented Apr 5, 2020

max-sixty commented Apr 6, 2020

benbovy commented Mar 9, 2021

benbovy commented Sep 23, 2021

TomNicholas commented Sep 23, 2021 •

edited

Loading

ivanakcheurov commented Jun 5, 2022

mathause commented Sep 7, 2022

sel along 1D non-index coordinates #3925

sel along 1D non-index coordinates #3925

Conversation

TomNicholas commented Apr 2, 2020

pep8speaks commented Apr 2, 2020

max-sixty commented Apr 2, 2020

TomNicholas commented Apr 2, 2020

max-sixty commented Apr 2, 2020

TomNicholas commented Apr 2, 2020

shoyer Apr 5, 2020

Choose a reason for hiding this comment

shoyer Apr 5, 2020

Choose a reason for hiding this comment

shoyer commented Apr 5, 2020

shoyer commented Apr 5, 2020

max-sixty commented Apr 6, 2020

benbovy commented Mar 9, 2021

benbovy commented Sep 23, 2021

TomNicholas commented Sep 23, 2021 • edited Loading

ivanakcheurov commented Jun 5, 2022

mathause commented Sep 7, 2022

TomNicholas commented Sep 23, 2021 •

edited

Loading