Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

groupby+map performance regression on MultiIndex dataset #7376

Closed
4 tasks done
ravwojdyla opened this issue Dec 14, 2022 · 11 comments
Closed
4 tasks done

groupby+map performance regression on MultiIndex dataset #7376

ravwojdyla opened this issue Dec 14, 2022 · 11 comments

Comments

@ravwojdyla
Copy link

ravwojdyla commented Dec 14, 2022

What happened?

We have upgraded to 2022.12.0 version, and noticed a significant performance regression (orders of magnitude) in a code that involves a groupby+map. This seems to be the issue since the 2022.6.0 release, which I understand had a number of changes (including to the groupby code paths) (release notes).

What did you expect to happen?

Fix the performance regression.

Minimal Complete Verifiable Example

import contextlib
import os
import time
from collections.abc import Iterator

import numpy as np
import pandas as pd
import xarray as xr


@contextlib.contextmanager
def log_time(label: str) -> Iterator[None]:
    """Logs execution time of the context block"""
    t_0 = time.time()
    yield
    print(f"{label} took {time.time() - t_0} seconds")


def main() -> None:
    m = 100_000
    with log_time("creating df"):
        df = pd.DataFrame(
            {
                "i1": [1] * m + [2] * m + [3] * m + [4] * m,
                "i2": list(range(m)) * 4,
                "d3": np.random.randint(0, 2, 4 * m).astype(bool),
            }
        )

        ds = df.to_xarray().set_coords(["i1", "i2"]).set_index(index=["i1", "i2"])

    with log_time("groupby"):

        def per_grp(da: xr.DataArray) -> xr.DataArray:
            return da

        (ds.assign(x=lambda ds: ds["d3"].groupby("i1").map(per_grp)))


if __name__ == "__main__":
    main()

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

xarray current main `2022.12.1.dev7+g021c73e1`, but affects all version since 2022.6.0 (inclusive). 

> creating df took 0.10657930374145508 seconds
> groupby took 129.5521149635315 seconds

<hr>

xarray 2022.3.0:

> creating df took 0.09968900680541992 seconds
> groupby took 0.19161295890808105 seconds

Anything else we need to know?

No response

Environment

Environment of the version installed from source (2022.12.1.dev7+g021c73e1):

INSTALLED VERSIONS

commit: None
python: 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:25:29) [Clang 14.0.6 ]
python-bits: 64
OS: Darwin
OS-release: 22.1.0
machine: arm64
processor: arm
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: None
libnetcdf: None

xarray: 2022.12.1.dev7+g021c73e1
pandas: 1.5.2
numpy: 1.23.5
scipy: None
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: None
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 65.5.1
pip: 22.3.1
conda: None
pytest: None
mypy: None
IPython: None
sphinx: None

@ravwojdyla ravwojdyla added bug needs triage Issue that has not been reviewed by xarray team member labels Dec 14, 2022
@ravwojdyla
Copy link
Author

Also recorded py-spy flamegraphs and exported them in speedscope format at: https://gist.github.com/ravwojdyla/3b791debd3f97707d84748446dc07e39, you can view them in https://www.speedscope.app/

@ravwojdyla
Copy link
Author

ravwojdyla commented Dec 14, 2022

3ead17e (#5692) seems to be the commit that introduced this regression (cc: @benbovy)

@ravwojdyla
Copy link
Author

ravwojdyla commented Dec 14, 2022

And just want to point out that the stacktraces/profile look very different between 2022.3.0 and main/latest. Looks like

return _maybe_cast_to_cftimeindex(index)
might be fairly expensive operation. Separately there seem to be quite a bit of time spend in copy -> copy_indexes path (deep copy?).

@dcherian dcherian added topic-performance and removed needs triage Issue that has not been reviewed by xarray team member labels Dec 14, 2022
@ravwojdyla
Copy link
Author

ravwojdyla commented Dec 14, 2022

FYI this might warrant a separate issue(?), but an assign of a new DataArray e.g.: ds.assign(foo=~ds["d3"]) is also a couple of times (e.g. on 4M elements, same keys as above, ~7x slower) slower since 2022.6.0 (same commit).

@benbovy
Copy link
Member

benbovy commented Dec 14, 2022

Thanks for the report @ravwojdyla.

Since #5692, multi-indexes level have each their own coordinate variable so copying takes a bit more time as we need to create more variables. Not sure what's happening with _maybe_cast_to_cftimeindex, though.

The real issue here, however, is the same than in #6836. In your example, .groupby("i1") creates 400 000 groups whereas it should create only 4 groups.

@ravwojdyla
Copy link
Author

ravwojdyla commented Dec 14, 2022

👋 @benbovy thanks for the update. Looking at #5692, it must have been a huge effort, thank you for your work on that! Coming back to this issue, in the example above the version 2022.6.0 is about 600x slower, in our internal code, the code would not finish in a reasonable time, so that forced us to downgrade to 2022.3.0. Are you aware of any workarounds for this issue with the current code (assuming I would like to preserve MultiIndex).

@benbovy
Copy link
Member

benbovy commented Dec 14, 2022

Are you aware of any workarounds for this issue with the current code (assuming I would like to preserve MultiIndex).

Unfortunately I don't know about any workaround that would preserve the MultiIndex. Depending on how you use the multi-index, you could instead set two single indexes for "i1" and "i2" respectively (it is supported now, use set_xindex()). I think that groupby will work well in that case. If you really need a multi-index, you could still build it afterwards from the groupby result.

@ravwojdyla
Copy link
Author

Thanks @benbovy! Are you also aware of the issue with plain assign being slower on MultiIndex (comment above: #7376 (comment))? Do you know what could be the issue there by any chance?

@benbovy
Copy link
Member

benbovy commented Dec 15, 2022

Thanks @benbovy! Are you also aware of the issue with plain assign being slower on MultiIndex (comment above: #7376 (comment))? Do you know what could be the issue there by any chance?

I see that in ds.assign(foo=~ds["d3"]), the coordinates of ~ds["d3"] are dropped (#2087), which triggers re-indexing of the multi-index when aligning ds with ~ds["d3"]. This is a quite expensive operation.

It is not clear to me what would be a clean fix (see, e.g., #2180), but we could probably optimize the alignment logic so that when all unindexed dimension sizes match with indexed dimension sizes (like your example) no re-indexing is performed.

@ravwojdyla
Copy link
Author

@benbovy thanks for the context and the PR #7382, exciting to see the improvement!

@benbovy
Copy link
Member

benbovy commented Aug 22, 2023

Closed in #7382 and #7830

@benbovy benbovy closed this as completed Aug 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants