-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: correctly instantiate subclassed DataFrame/Series in groupby apply #45363
BUG: correctly instantiate subclassed DataFrame/Series in groupby apply #45363
Conversation
Running the relevant benchmark:
(I ran this several times, and it's quite consistent) So the ones with a dummy function ("scalar_function") and many small groups ("(4)") show a slowdown. But that also means that the other ones (with a non-dummy function ("copy_function"), or with larger groups) didn't actually show a significant difference. I think that's the most relevant. |
@@ -1243,13 +1245,7 @@ def _chop(self, sdata: Series, slice_obj: slice) -> Series: | |||
# fastpath equivalent to `sdata.iloc[slice_obj]` | |||
mgr = sdata._mgr.get_slice(slice_obj) | |||
# __finalize__ not called here, must be applied by caller if applicable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we now doing __finalize__
in all of the places this is called? if so, better to do it here? (and on L1260)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there is a comment there about this:
pandas/pandas/core/groupby/ops.py
Line 1263 in 83ea173
# __finalize__ not called here, must be applied by caller if applicable |
(from #37461)
But I don't know the reason that it was originally decided that finalize must be called by the caller. In any case I don't see a reason to not move it to _chop
now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I don't know the reason that it was originally decided that finalize must be called by the caller. In any case I don't see a reason to not move it to _chop now.
#37461 OP says it is for perf to not call it on each iteration. This PR removes that motivation, so go for it!
# fastpath equivalent to: | ||
# `return sdata._constructor(mgr, name=sdata.name, fastpath=True)` | ||
obj = type(sdata)._from_mgr(mgr) | ||
object.__setattr__(obj, "_flags", sdata._flags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these still getting pinned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__finalize__
should take care of that (there is a test_finalize.py
test that was changed in the diff, which was testing that this was not yet implemented. Although it was testing it with attrs
and not with flags
, but both should be handled by finalize).
thanks @jorisvandenbossche |
…ataFrame/Series in groupby apply
@meeseeksdev backport 1.4.x |
Something went wrong ... Please have a look at my logs. |
…eries in groupby apply (#45397)
@jorisvandenbossche can we have a brief release note for this change. |
Yes, this PR was not really ready for merging, see also Brock's comments |
Thanks @jorisvandenbossche. From the issue, it appears that the regression is from 1.3.x but the reproducer worked on 1.3.5? Does the test added here test the regression reported or perhaps some later change? |
Yes, so the pandas test that I added here in this PR only covers the regression in main (worked in 1.3.5). So for that we actually don't need a whatsnew notice. It also fixes the geopandas issue which was already broken in 1.3.5, but it's a bit complicated to add an exact test for it (but I added one on the geopandas side (geopandas/geopandas#2298), and we test against pandas main). Both issues are related to the instantiation of the sub-DataFrame (per group), and originally caused by no longer using |
@jorisvandenbossche if not ready pls make draft PRs |
It was a review comment of Brock that required work ... |
closes #45314
This reverts the "fastpath" construction in groupby.apply from #40236 and restores the code from #37461, and expands the metadata propagation that was added in that PR to more cases.
The problem is that subclasses can have logic in their
_constructor
that is not captured by the_from_mgr
fastpath. Will check the performance impact now, and if needed we could also do this only for subclasses, and keep_from_mgr
for base DataFrame.