BUG: correctly instantiate subclassed DataFrame/Series in groupby apply #45363

jorisvandenbossche · 2022-01-14T09:24:16Z

whatsnew entry

This reverts the "fastpath" construction in groupby.apply from #40236 and restores the code from #37461, and expands the metadata propagation that was added in that PR to more cases.

The problem is that subclasses can have logic in their _constructor that is not captured by the _from_mgr fastpath. Will check the performance impact now, and if needed we could also do this only for subclasses, and keep _from_mgr for base DataFrame.

jorisvandenbossche · 2022-01-14T11:52:21Z

Running the relevant benchmark:

$ asv continuous -f 1.01 upstream/main HEAD -b groupby.Apply
...
       before           after         ratio
     [5b2f4a53]       [7f9dfdec]
     <main>           <groupby-subclass-attr-regression>
+      9.33±0.5ms       13.6±0.2ms     1.46  groupby.ApplyDictReturn.time_groupby_apply_dict_return
+      13.0±0.7ms       18.5±0.2ms     1.43  groupby.Apply.time_scalar_function_single_col(4)
+        36.2±2ms       49.2±0.3ms     1.36  groupby.Apply.time_scalar_function_multi_col(4)

(I ran this several times, and it's quite consistent)

So the ones with a dummy function ("scalar_function") and many small groups ("(4)") show a slowdown. But that also means that the other ones (with a non-dummy function ("copy_function"), or with larger groups) didn't actually show a significant difference. I think that's the most relevant.

jbrockmendel · 2022-01-14T16:47:59Z

pandas/core/groupby/ops.py

@@ -1243,13 +1245,7 @@ def _chop(self, sdata: Series, slice_obj: slice) -> Series:
        # fastpath equivalent to `sdata.iloc[slice_obj]`
        mgr = sdata._mgr.get_slice(slice_obj)
        # __finalize__ not called here, must be applied by caller if applicable


are we now doing __finalize__ in all of the places this is called? if so, better to do it here? (and on L1260)?

Yeah, there is a comment there about this:

pandas/pandas/core/groupby/ops.py

Line 1263 in 83ea173

# __finalize__ not called here, must be applied by caller if applicable

(from #37461)
But I don't know the reason that it was originally decided that finalize must be called by the caller. In any case I don't see a reason to not move it to _chop now.

But I don't know the reason that it was originally decided that finalize must be called by the caller. In any case I don't see a reason to not move it to _chop now.

#37461 OP says it is for perf to not call it on each iteration. This PR removes that motivation, so go for it!

jbrockmendel · 2022-01-14T16:48:51Z

pandas/core/groupby/ops.py

-        # fastpath equivalent to:
-        # `return sdata._constructor(mgr, name=sdata.name, fastpath=True)`
-        obj = type(sdata)._from_mgr(mgr)
-        object.__setattr__(obj, "_flags", sdata._flags)


are these still getting pinned?

__finalize__ should take care of that (there is a test_finalize.py test that was changed in the diff, which was testing that this was not yet implemented. Although it was testing it with attrs and not with flags, but both should be handled by finalize).

jreback · 2022-01-16T16:58:33Z

thanks @jorisvandenbossche

…ataFrame/Series in groupby apply

jreback · 2022-01-16T16:58:43Z

@meeseeksdev backport 1.4.x

lumberbot-app · 2022-01-16T16:58:49Z

Something went wrong ... Please have a look at my logs.

…eries in groupby apply (#45397)

simonjayhawkins · 2022-01-17T09:13:51Z

@jorisvandenbossche can we have a brief release note for this change.

jorisvandenbossche · 2022-01-17T09:15:54Z

Yes, this PR was not really ready for merging, see also Brock's comments

simonjayhawkins · 2022-01-17T09:18:38Z

Thanks @jorisvandenbossche. From the issue, it appears that the regression is from 1.3.x but the reproducer worked on 1.3.5? Does the test added here test the regression reported or perhaps some later change?

jorisvandenbossche · 2022-01-17T09:30:59Z

Yes, so the pandas test that I added here in this PR only covers the regression in main (worked in 1.3.5). So for that we actually don't need a whatsnew notice.

It also fixes the geopandas issue which was already broken in 1.3.5, but it's a bit complicated to add an exact test for it (but I added one on the geopandas side (geopandas/geopandas#2298), and we test against pandas main).

Both issues are related to the instantiation of the sub-DataFrame (per group), and originally caused by no longer using _constructor and not properly calling __finalize__. But the reason that this only turned up in main (and was not yet broken in 1.3) for the pandas reproducer is because in main we removed the libreduction fastpath, which basically still masked the regression in the python path for the majority of cases in 1.3.x

…follow-up)

jreback · 2022-01-17T12:10:13Z

@jorisvandenbossche if not ready pls make draft PRs

jorisvandenbossche · 2022-01-17T12:14:09Z

It was a review comment of Brock that required work ...

#45415)

BUG: correctly instantiate subclassed DataFrame/Series in groupby apply

365ff86

jorisvandenbossche added Bug Groupby Regression Functionality that used to work in a prior pandas version labels Jan 14, 2022

jorisvandenbossche added this to the 1.4 milestone Jan 14, 2022

jorisvandenbossche requested a review from jbrockmendel January 14, 2022 09:24

jorisvandenbossche mentioned this pull request Jan 14, 2022

TST: test groupby apply with function that requires GeoDataFrame attributes geopandas/geopandas#2298

Merged

update finalize tests

7f9dfde

jbrockmendel reviewed Jan 14, 2022

View reviewed changes

jreback merged commit 5357f79 into pandas-dev:main Jan 16, 2022

meeseeksmachine mentioned this pull request Jan 16, 2022

Backport PR #45363 on branch 1.4.x (BUG: correctly instantiate subclassed DataFrame/Series in groupby apply) #45397

Merged

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Jan 16, 2022

Backport PR pandas-dev#45363: BUG: correctly instantiate subclassed D…

4a7b470

…ataFrame/Series in groupby apply

jreback pushed a commit that referenced this pull request Jan 16, 2022

Backport PR #45363: BUG: correctly instantiate subclassed DataFrame/S…

219811f

…eries in groupby apply (#45397)

jorisvandenbossche deleted the groupby-subclass-attr-regression branch January 17, 2022 09:15

jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this pull request Jan 17, 2022

CLN: move __finalize__ call into splitter (groupbt pandas-devGH-45363 …

37c406b

…follow-up)

jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this pull request Jan 17, 2022

CLN: move __finalize__ call into splitter (groupbt pandas-devGH-45363 …

eb07f10

…follow-up)

jorisvandenbossche mentioned this pull request Jan 17, 2022

CLN: remove _from_mgr (no longer used after GH-45363) #45415

Merged

jreback pushed a commit that referenced this pull request Jan 19, 2022

CLN: move __finalize__ call into splitter (groupbt GH-45363 follow-up) (

14e09c4

#45415)

aclarry mentioned this pull request Apr 22, 2022

LinkedDataFrame linkage indexes do not update on groupby wsp-sag/cheval#4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: correctly instantiate subclassed DataFrame/Series in groupby apply #45363

BUG: correctly instantiate subclassed DataFrame/Series in groupby apply #45363

jorisvandenbossche commented Jan 14, 2022

jorisvandenbossche commented Jan 14, 2022

jbrockmendel Jan 14, 2022

jorisvandenbossche Jan 14, 2022

jbrockmendel Jan 14, 2022

jbrockmendel Jan 14, 2022

jorisvandenbossche Jan 14, 2022

jreback commented Jan 16, 2022

jreback commented Jan 16, 2022

lumberbot-app bot commented Jan 16, 2022

simonjayhawkins commented Jan 17, 2022

jorisvandenbossche commented Jan 17, 2022

simonjayhawkins commented Jan 17, 2022

jorisvandenbossche commented Jan 17, 2022

jreback commented Jan 17, 2022

jorisvandenbossche commented Jan 17, 2022

BUG: correctly instantiate subclassed DataFrame/Series in groupby apply #45363

BUG: correctly instantiate subclassed DataFrame/Series in groupby apply #45363

Conversation

jorisvandenbossche commented Jan 14, 2022

jorisvandenbossche commented Jan 14, 2022

jbrockmendel Jan 14, 2022

Choose a reason for hiding this comment

jorisvandenbossche Jan 14, 2022

Choose a reason for hiding this comment

jbrockmendel Jan 14, 2022

Choose a reason for hiding this comment

jbrockmendel Jan 14, 2022

Choose a reason for hiding this comment

jorisvandenbossche Jan 14, 2022

Choose a reason for hiding this comment

jreback commented Jan 16, 2022

jreback commented Jan 16, 2022

lumberbot-app bot commented Jan 16, 2022

simonjayhawkins commented Jan 17, 2022

jorisvandenbossche commented Jan 17, 2022

simonjayhawkins commented Jan 17, 2022

jorisvandenbossche commented Jan 17, 2022

jreback commented Jan 17, 2022

jorisvandenbossche commented Jan 17, 2022