groupby+map performance regression on MultiIndex dataset #7376

ravwojdyla · 2022-12-14T03:56:06Z

What happened?

We have upgraded to 2022.12.0 version, and noticed a significant performance regression (orders of magnitude) in a code that involves a groupby+map. This seems to be the issue since the 2022.6.0 release, which I understand had a number of changes (including to the groupby code paths) (release notes).

What did you expect to happen?

Fix the performance regression.

Minimal Complete Verifiable Example

import contextlib
import os
import time
from collections.abc import Iterator

import numpy as np
import pandas as pd
import xarray as xr


@contextlib.contextmanager
def log_time(label: str) -> Iterator[None]:
    """Logs execution time of the context block"""
    t_0 = time.time()
    yield
    print(f"{label} took {time.time() - t_0} seconds")


def main() -> None:
    m = 100_000
    with log_time("creating df"):
        df = pd.DataFrame(
            {
                "i1": [1] * m + [2] * m + [3] * m + [4] * m,
                "i2": list(range(m)) * 4,
                "d3": np.random.randint(0, 2, 4 * m).astype(bool),
            }
        )

        ds = df.to_xarray().set_coords(["i1", "i2"]).set_index(index=["i1", "i2"])

    with log_time("groupby"):

        def per_grp(da: xr.DataArray) -> xr.DataArray:
            return da

        (ds.assign(x=lambda ds: ds["d3"].groupby("i1").map(per_grp)))


if __name__ == "__main__":
    main()

MVCE confirmation

Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
Complete example — the example is self-contained, including all data and the text of any traceback.
Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

xarray current main `2022.12.1.dev7+g021c73e1`, but affects all version since 2022.6.0 (inclusive). 

> creating df took 0.10657930374145508 seconds
> groupby took 129.5521149635315 seconds

<hr>

xarray 2022.3.0:

> creating df took 0.09968900680541992 seconds
> groupby took 0.19161295890808105 seconds

Anything else we need to know?

No response

Environment

Environment of the version installed from source (2022.12.1.dev7+g021c73e1):

INSTALLED VERSIONS

commit: None
python: 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:25:29) [Clang 14.0.6 ]
python-bits: 64
OS: Darwin
OS-release: 22.1.0
machine: arm64
processor: arm
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: None
libnetcdf: None

xarray: 2022.12.1.dev7+g021c73e1
pandas: 1.5.2
numpy: 1.23.5
scipy: None
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: None
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 65.5.1
pip: 22.3.1
conda: None
pytest: None
mypy: None
IPython: None
sphinx: None

The text was updated successfully, but these errors were encountered:

ravwojdyla · 2022-12-14T04:04:16Z

Also recorded py-spy flamegraphs and exported them in speedscope format at: https://gist.github.com/ravwojdyla/3b791debd3f97707d84748446dc07e39, you can view them in https://www.speedscope.app/

ravwojdyla · 2022-12-14T04:22:28Z

3ead17e (#5692) seems to be the commit that introduced this regression (cc: @benbovy)

ravwojdyla · 2022-12-14T04:44:04Z

And just want to point out that the stacktraces/profile look very different between 2022.3.0 and main/latest. Looks like

xarray/xarray/core/indexes.py

Line 185 in 021c73e

return _maybe_cast_to_cftimeindex(index)

might be fairly expensive operation. Separately there seem to be quite a bit of time spend in copy -> copy_indexes path (deep copy?).

ravwojdyla · 2022-12-14T06:03:15Z

FYI this might warrant a separate issue(?), but an assign of a new DataArray e.g.: ds.assign(foo=~ds["d3"]) is also a couple of times (e.g. on 4M elements, same keys as above, ~7x slower) slower since 2022.6.0 (same commit).

benbovy · 2022-12-14T09:40:57Z

Thanks for the report @ravwojdyla.

Since #5692, multi-indexes level have each their own coordinate variable so copying takes a bit more time as we need to create more variables. Not sure what's happening with _maybe_cast_to_cftimeindex, though.

The real issue here, however, is the same than in #6836. In your example, .groupby("i1") creates 400 000 groups whereas it should create only 4 groups.

ravwojdyla · 2022-12-14T14:40:18Z

👋 @benbovy thanks for the update. Looking at #5692, it must have been a huge effort, thank you for your work on that! Coming back to this issue, in the example above the version 2022.6.0 is about 600x slower, in our internal code, the code would not finish in a reasonable time, so that forced us to downgrade to 2022.3.0. Are you aware of any workarounds for this issue with the current code (assuming I would like to preserve MultiIndex).

benbovy · 2022-12-14T22:43:11Z

Are you aware of any workarounds for this issue with the current code (assuming I would like to preserve MultiIndex).

Unfortunately I don't know about any workaround that would preserve the MultiIndex. Depending on how you use the multi-index, you could instead set two single indexes for "i1" and "i2" respectively (it is supported now, use set_xindex()). I think that groupby will work well in that case. If you really need a multi-index, you could still build it afterwards from the groupby result.

ravwojdyla · 2022-12-15T01:18:32Z

Thanks @benbovy! Are you also aware of the issue with plain assign being slower on MultiIndex (comment above: #7376 (comment))? Do you know what could be the issue there by any chance?

benbovy · 2022-12-15T12:27:37Z

Thanks @benbovy! Are you also aware of the issue with plain assign being slower on MultiIndex (comment above: #7376 (comment))? Do you know what could be the issue there by any chance?

I see that in ds.assign(foo=~ds["d3"]), the coordinates of ~ds["d3"] are dropped (#2087), which triggers re-indexing of the multi-index when aligning ds with ~ds["d3"]. This is a quite expensive operation.

It is not clear to me what would be a clean fix (see, e.g., #2180), but we could probably optimize the alignment logic so that when all unindexed dimension sizes match with indexed dimension sizes (like your example) no re-indexing is performed.

ravwojdyla · 2022-12-15T13:33:44Z

@benbovy thanks for the context and the PR #7382, exciting to see the improvement!

benbovy · 2023-08-22T14:47:13Z

Closed in #7382 and #7830

ravwojdyla added bug needs triage Issue that has not been reviewed by xarray team member labels Dec 14, 2022

dcherian added topic-performance and removed needs triage Issue that has not been reviewed by xarray team member labels Dec 14, 2022

benbovy added the topic-indexing label Dec 14, 2022

benbovy mentioned this issue Dec 15, 2022

Some alignment optimizations #7382

Merged

2 tasks

benbovy closed this as completed Aug 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

groupby+map performance regression on MultiIndex dataset #7376

groupby+map performance regression on MultiIndex dataset #7376

ravwojdyla commented Dec 14, 2022 •

edited

Loading

INSTALLED VERSIONS

ravwojdyla commented Dec 14, 2022

ravwojdyla commented Dec 14, 2022 •

edited

Loading

ravwojdyla commented Dec 14, 2022 •

edited

Loading

ravwojdyla commented Dec 14, 2022 •

edited

Loading

benbovy commented Dec 14, 2022

ravwojdyla commented Dec 14, 2022 •

edited

Loading

benbovy commented Dec 14, 2022 •

edited

Loading

ravwojdyla commented Dec 15, 2022

benbovy commented Dec 15, 2022

ravwojdyla commented Dec 15, 2022

benbovy commented Aug 22, 2023

groupby+map performance regression on MultiIndex dataset #7376

groupby+map performance regression on MultiIndex dataset #7376

Comments

ravwojdyla commented Dec 14, 2022 • edited Loading

What happened?

What did you expect to happen?

Minimal Complete Verifiable Example

MVCE confirmation

Relevant log output

Anything else we need to know?

Environment

INSTALLED VERSIONS

ravwojdyla commented Dec 14, 2022

ravwojdyla commented Dec 14, 2022 • edited Loading

ravwojdyla commented Dec 14, 2022 • edited Loading

ravwojdyla commented Dec 14, 2022 • edited Loading

benbovy commented Dec 14, 2022

ravwojdyla commented Dec 14, 2022 • edited Loading

benbovy commented Dec 14, 2022 • edited Loading

ravwojdyla commented Dec 15, 2022

benbovy commented Dec 15, 2022

ravwojdyla commented Dec 15, 2022

benbovy commented Aug 22, 2023

ravwojdyla commented Dec 14, 2022 •

edited

Loading

ravwojdyla commented Dec 14, 2022 •

edited

Loading

ravwojdyla commented Dec 14, 2022 •

edited

Loading

ravwojdyla commented Dec 14, 2022 •

edited

Loading

ravwojdyla commented Dec 14, 2022 •

edited

Loading

benbovy commented Dec 14, 2022 •

edited

Loading