Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Micro optimizations to improve indexing #9002

Merged
merged 20 commits into from
Jun 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@ Performance

- Small optimization to the netCDF4 and h5netcdf backends (:issue:`9058`, :pull:`9067`).
By `Deepak Cherian <https://github.com/dcherian>`_.
- Small optimizations to help reduce indexing speed of datasets (:pull:`9002`).
By `Mark Harfouche <https://github.com/hmaarrfk>`_.


Breaking changes
Expand Down Expand Up @@ -2903,7 +2905,7 @@ Bug fixes
process (:issue:`4045`, :pull:`4684`). It also enables encoding and decoding standard
calendar dates with time units of nanoseconds (:pull:`4400`).
By `Spencer Clark <https://github.com/spencerkclark>`_ and `Mark Harfouche
<http://github.com/hmaarrfk>`_.
<https://github.com/hmaarrfk>`_.
- :py:meth:`DataArray.astype`, :py:meth:`Dataset.astype` and :py:meth:`Variable.astype` support
the ``order`` and ``subok`` parameters again. This fixes a regression introduced in version 0.16.1
(:issue:`4644`, :pull:`4683`).
Expand Down
66 changes: 59 additions & 7 deletions xarray/core/indexes.py
Original file line number Diff line number Diff line change
Expand Up @@ -575,13 +575,24 @@ class PandasIndex(Index):

__slots__ = ("index", "dim", "coord_dtype")

def __init__(self, array: Any, dim: Hashable, coord_dtype: Any = None):
# make a shallow copy: cheap and because the index name may be updated
# here or in other constructors (cannot use pd.Index.rename as this
# constructor is also called from PandasMultiIndex)
index = safe_cast_to_index(array).copy()
def __init__(
self,
array: Any,
dim: Hashable,
coord_dtype: Any = None,
*,
fastpath: bool = False,
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit wary of adding fastpath everywhere. It makes things a bit hard to reason about. Is this one that consequential? We do first check for isinstance(pd.Index) in safe_cast_to_index...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anecdotally, I remember thinking that indexes are created and copied a few too many times. I believe this was done for correctness in a big refactor, but reducing any useless copies would be a big win since we would call _replace less.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anecdotally, I remember thinking that indexes are created and copied a few too many times.

this in general is my feeling after my repeated deep diving into the codebase.

if fastpath:
index = array
else:
index = safe_cast_to_index(array)

if index.name is None:
# make a shallow copy: cheap and because the index name may be updated
# here or in other constructors (cannot use pd.Index.rename as this
# constructor is also called from PandasMultiIndex)
index = index.copy()
index.name = dim

self.index = index
Expand All @@ -596,7 +607,7 @@ def _replace(self, index, dim=None, coord_dtype=None):
dim = self.dim
if coord_dtype is None:
coord_dtype = self.coord_dtype
return type(self)(index, dim, coord_dtype)
return type(self)(index, dim, coord_dtype, fastpath=True)

@classmethod
def from_variables(
Expand Down Expand Up @@ -642,6 +653,11 @@ def from_variables(

obj = cls(data, dim, coord_dtype=var.dtype)
assert not isinstance(obj.index, pd.MultiIndex)
# Rename safely
dcherian marked this conversation as resolved.
Show resolved Hide resolved
# make a shallow copy: cheap and because the index name may be updated
# here or in other constructors (cannot use pd.Index.rename as this
# constructor is also called from PandasMultiIndex)
obj.index = obj.index.copy()
obj.index.name = name

return obj
Expand Down Expand Up @@ -1773,6 +1789,36 @@ def check_variables():
return not not_equal


def _apply_indexes_fast(indexes: Indexes[Index], args: Mapping[Any, Any], func: str):
dcherian marked this conversation as resolved.
Show resolved Hide resolved
# This function avoids the call to indexes.group_by_index
# which is really slow when repeatidly iterating through
# an array. However, it fails to return the correct ID for
# multi-index arrays
indexes_fast, coords = indexes._indexes, indexes._variables

new_indexes: dict[Hashable, Index] = {k: v for k, v in indexes_fast.items()}
new_index_variables: dict[Hashable, Variable] = {}
for name, index in indexes_fast.items():
coord = coords[name]
if hasattr(coord, "_indexes"):
index_vars = {n: coords[n] for n in coord._indexes}
dcherian marked this conversation as resolved.
Show resolved Hide resolved
else:
index_vars = {name: coord}
index_dims = {d for var in index_vars.values() for d in var.dims}
index_args = {k: v for k, v in args.items() if k in index_dims}

if index_args:
new_index = getattr(index, func)(index_args)
if new_index is not None:
new_indexes.update({k: new_index for k in index_vars})
new_index_vars = new_index.create_variables(index_vars)
new_index_variables.update(new_index_vars)
else:
for k in index_vars:
new_indexes.pop(k, None)
return new_indexes, new_index_variables


def _apply_indexes(
indexes: Indexes[Index],
args: Mapping[Any, Any],
Expand Down Expand Up @@ -1801,7 +1847,13 @@ def isel_indexes(
indexes: Indexes[Index],
indexers: Mapping[Any, Any],
) -> tuple[dict[Hashable, Index], dict[Hashable, Variable]]:
return _apply_indexes(indexes, indexers, "isel")
# TODO: remove if clause in the future. It should be unnecessary.
# See failure introduced when removed
# https://github.com/pydata/xarray/pull/9002#discussion_r1590443756
if any(isinstance(v, PandasMultiIndex) for v in indexes._indexes.values()):
dcherian marked this conversation as resolved.
Show resolved Hide resolved
return _apply_indexes(indexes, indexers, "isel")
else:
return _apply_indexes_fast(indexes, indexers, "isel")


def roll_indexes(
Expand Down
23 changes: 13 additions & 10 deletions xarray/core/indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -296,15 +296,16 @@ def slice_slice(old_slice: slice, applied_slice: slice, size: int) -> slice:


def _index_indexer_1d(old_indexer, applied_indexer, size: int):
assert isinstance(applied_indexer, integer_types + (slice, np.ndarray))
if isinstance(applied_indexer, slice) and applied_indexer == slice(None):
# shortcut for the usual case
return old_indexer
if isinstance(old_indexer, slice):
if isinstance(applied_indexer, slice):
indexer = slice_slice(old_indexer, applied_indexer, size)
elif isinstance(applied_indexer, integer_types):
indexer = range(*old_indexer.indices(size))[applied_indexer] # type: ignore[assignment]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These kinds of # type: ignore[assignment] errors usually means indexer has changed type, which usually leads to requiring another costly isinstance again somewhere downstream.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is due to the ambiguity between slice and int for these indexers. I can look into it if you want.

However. It starts to touch some tuple to lists conversation that was identified as something I shouldn't adjust too much due to potential conflicts with an active in development branch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following patch would
diff --git a/xarray/core/indexing.py b/xarray/core/indexing.py
index e29d77da..44560dbd 100644
--- a/xarray/core/indexing.py
+++ b/xarray/core/indexing.py
@@ -9,7 +9,7 @@ from contextlib import suppress
 from dataclasses import dataclass, field
 from datetime import timedelta
 from html import escape
-from typing import TYPE_CHECKING, Any, Callable, overload
+from typing import TYPE_CHECKING, Any, Callable, overload, List

 import numpy as np
 import pandas as pd
@@ -295,20 +295,24 @@ def slice_slice(old_slice: slice, applied_slice: slice, size: int) -> slice:
     return slice(start, stop, step)


-def _index_indexer_1d(old_indexer, applied_indexer, size: int):
+def _index_indexer_1d(
+    old_indexer: slice | np.ndarray[Any, np.dtype[np.generic]],
+    applied_indexer: slice | int | np.integer | np.ndarray[Any, np.dtype[np.generic]],
+    size: int
+) -> slice | int | np.integer | np.ndarray[Any, np.dtype[np.generic]]:
+    """Helper routine to slice non integers into an entry for OuterIndexer"""
     if isinstance(applied_indexer, slice) and applied_indexer == slice(None):
         # shortcut for the usual case
         return old_indexer
     if isinstance(old_indexer, slice):
         if isinstance(applied_indexer, slice):
-            indexer = slice_slice(old_indexer, applied_indexer, size)
+            return slice_slice(old_indexer, applied_indexer, size)
         elif isinstance(applied_indexer, integer_types):
-            indexer = range(*old_indexer.indices(size))[applied_indexer]  # type: ignore[assignment]
+            return range(*old_indexer.indices(size))[applied_indexer]
         else:
-            indexer = _expand_slice(old_indexer, size)[applied_indexer]
+            return _expand_slice(old_indexer, size)[applied_indexer]
     else:
-        indexer = old_indexer[applied_indexer]
-    return indexer
+        return old_indexer[applied_indexer]


 class ExplicitIndexer:
@@ -630,12 +634,14 @@ class LazilyIndexedArray(ExplicitlyIndexedNDArrayMixin):

     def _updated_key(self, new_key: ExplicitIndexer) -> BasicIndexer | OuterIndexer:
         iter_new_key = iter(expanded_indexer(new_key.tuple, self.ndim))
-        full_key = []
+        full_key : List[slice | int | np.integer | np.ndarray[Any, np.dtype[np.generic]]] = []
+        outer_indexer = True
         for size, k in zip(self.array.shape, self.key.tuple):
             if isinstance(k, integer_types):
                 full_key.append(k)
             else:
                 full_key.append(_index_indexer_1d(k, next(iter_new_key), size))
+
         full_key_tuple = tuple(full_key)

         if all(isinstance(k, integer_types + (slice,)) for k in full_key_tuple):

would move the error to

xarray/core/indexing.py:648: error: Argument 1 to "BasicIndexer" has incompatible type "tuple[slice | int | integer[Any] | ndarray[Any, dtype[generic]], ...]"; expected "tuple[int | integer[Any] | slice, ...]"  [arg-type]

While I don't disagree that:

which usually leads to requiring another costly isinstance again somewhere downstream.

I would rather keep this for a future change once the branch backend-indexing has been merge in.

else:
indexer = _expand_slice(old_indexer, size)[applied_indexer] # type: ignore[assignment]
indexer = _expand_slice(old_indexer, size)[applied_indexer]
else:
indexer = old_indexer[applied_indexer]
return indexer
Expand Down Expand Up @@ -591,7 +592,7 @@ def __getitem__(self, key: Any):
class LazilyIndexedArray(ExplicitlyIndexedNDArrayMixin):
"""Wrap an array to make basic and outer indexing lazy."""

__slots__ = ("array", "key")
__slots__ = ("array", "key", "_shape")

def __init__(self, array: Any, key: ExplicitIndexer | None = None):
"""
Expand All @@ -614,6 +615,14 @@ def __init__(self, array: Any, key: ExplicitIndexer | None = None):
self.array = as_indexable(array)
self.key = key

shape: _Shape = ()
for size, k in zip(self.array.shape, self.key.tuple):
if isinstance(k, slice):
shape += (len(range(*k.indices(size))),)
elif isinstance(k, np.ndarray):
shape += (k.size,)
self._shape = shape

def _updated_key(self, new_key: ExplicitIndexer) -> BasicIndexer | OuterIndexer:
iter_new_key = iter(expanded_indexer(new_key.tuple, self.ndim))
full_key = []
Expand All @@ -630,13 +639,7 @@ def _updated_key(self, new_key: ExplicitIndexer) -> BasicIndexer | OuterIndexer:

@property
def shape(self) -> _Shape:
shape = []
for size, k in zip(self.array.shape, self.key.tuple):
if isinstance(k, slice):
shape.append(len(range(*k.indices(size))))
elif isinstance(k, np.ndarray):
shape.append(k.size)
return tuple(shape)
return self._shape

def get_duck_array(self):
if isinstance(self.array, ExplicitlyIndexedNDArrayMixin):
Expand Down
Loading