-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
groupby+map performance regression on MultiIndex dataset #7376
Comments
Also recorded py-spy flamegraphs and exported them in |
And just want to point out that the stacktraces/profile look very different between 2022.3.0 and main/latest. Looks like Line 185 in 021c73e
copy -> copy_indexes path (deep copy?).
|
FYI this might warrant a separate issue(?), but an assign of a new DataArray e.g.: |
Thanks for the report @ravwojdyla. Since #5692, multi-indexes level have each their own coordinate variable so copying takes a bit more time as we need to create more variables. Not sure what's happening with The real issue here, however, is the same than in #6836. In your example, |
👋 @benbovy thanks for the update. Looking at #5692, it must have been a huge effort, thank you for your work on that! Coming back to this issue, in the example above the version 2022.6.0 is about 600x slower, in our internal code, the code would not finish in a reasonable time, so that forced us to downgrade to 2022.3.0. Are you aware of any workarounds for this issue with the current code (assuming I would like to preserve MultiIndex). |
Unfortunately I don't know about any workaround that would preserve the MultiIndex. Depending on how you use the multi-index, you could instead set two single indexes for "i1" and "i2" respectively (it is supported now, use |
Thanks @benbovy! Are you also aware of the issue with plain |
I see that in It is not clear to me what would be a clean fix (see, e.g., #2180), but we could probably optimize the alignment logic so that when all unindexed dimension sizes match with indexed dimension sizes (like your example) no re-indexing is performed. |
What happened?
We have upgraded to 2022.12.0 version, and noticed a significant performance regression (orders of magnitude) in a code that involves a groupby+map. This seems to be the issue since the 2022.6.0 release, which I understand had a number of changes (including to the groupby code paths) (release notes).
What did you expect to happen?
Fix the performance regression.
Minimal Complete Verifiable Example
MVCE confirmation
Relevant log output
Anything else we need to know?
No response
Environment
Environment of the version installed from source (
2022.12.1.dev7+g021c73e1
):INSTALLED VERSIONS
commit: None
python: 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:25:29) [Clang 14.0.6 ]
python-bits: 64
OS: Darwin
OS-release: 22.1.0
machine: arm64
processor: arm
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: None
libnetcdf: None
xarray: 2022.12.1.dev7+g021c73e1
pandas: 1.5.2
numpy: 1.23.5
scipy: None
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: None
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 65.5.1
pip: 22.3.1
conda: None
pytest: None
mypy: None
IPython: None
sphinx: None
The text was updated successfully, but these errors were encountered: