Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: correctly instantiate subclassed DataFrame/Series in groupby apply #45363

Merged

Conversation

jorisvandenbossche
Copy link
Member

closes #45314

  • whatsnew entry

This reverts the "fastpath" construction in groupby.apply from #40236 and restores the code from #37461, and expands the metadata propagation that was added in that PR to more cases.

The problem is that subclasses can have logic in their _constructor that is not captured by the _from_mgr fastpath. Will check the performance impact now, and if needed we could also do this only for subclasses, and keep _from_mgr for base DataFrame.

@jorisvandenbossche
Copy link
Member Author

Running the relevant benchmark:

$ asv continuous -f 1.01 upstream/main HEAD -b groupby.Apply
...
       before           after         ratio
     [5b2f4a53]       [7f9dfdec]
     <main>           <groupby-subclass-attr-regression>
+      9.33±0.5ms       13.6±0.2ms     1.46  groupby.ApplyDictReturn.time_groupby_apply_dict_return
+      13.0±0.7ms       18.5±0.2ms     1.43  groupby.Apply.time_scalar_function_single_col(4)
+        36.2±2ms       49.2±0.3ms     1.36  groupby.Apply.time_scalar_function_multi_col(4)

(I ran this several times, and it's quite consistent)

So the ones with a dummy function ("scalar_function") and many small groups ("(4)") show a slowdown. But that also means that the other ones (with a non-dummy function ("copy_function"), or with larger groups) didn't actually show a significant difference. I think that's the most relevant.

@@ -1243,13 +1245,7 @@ def _chop(self, sdata: Series, slice_obj: slice) -> Series:
# fastpath equivalent to `sdata.iloc[slice_obj]`
mgr = sdata._mgr.get_slice(slice_obj)
# __finalize__ not called here, must be applied by caller if applicable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we now doing __finalize__ in all of the places this is called? if so, better to do it here? (and on L1260)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there is a comment there about this:

# __finalize__ not called here, must be applied by caller if applicable

(from #37461)
But I don't know the reason that it was originally decided that finalize must be called by the caller. In any case I don't see a reason to not move it to _chop now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I don't know the reason that it was originally decided that finalize must be called by the caller. In any case I don't see a reason to not move it to _chop now.

#37461 OP says it is for perf to not call it on each iteration. This PR removes that motivation, so go for it!

# fastpath equivalent to:
# `return sdata._constructor(mgr, name=sdata.name, fastpath=True)`
obj = type(sdata)._from_mgr(mgr)
object.__setattr__(obj, "_flags", sdata._flags)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these still getting pinned?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__finalize__ should take care of that (there is a test_finalize.py test that was changed in the diff, which was testing that this was not yet implemented. Although it was testing it with attrs and not with flags, but both should be handled by finalize).

@jreback jreback merged commit 5357f79 into pandas-dev:main Jan 16, 2022
@jreback
Copy link
Contributor

jreback commented Jan 16, 2022

thanks @jorisvandenbossche

@jreback
Copy link
Contributor

jreback commented Jan 16, 2022

@meeseeksdev backport 1.4.x

@lumberbot-app
Copy link

lumberbot-app bot commented Jan 16, 2022

Something went wrong ... Please have a look at my logs.

jreback pushed a commit that referenced this pull request Jan 16, 2022
@simonjayhawkins
Copy link
Member

@jorisvandenbossche can we have a brief release note for this change.

@jorisvandenbossche jorisvandenbossche deleted the groupby-subclass-attr-regression branch January 17, 2022 09:15
@jorisvandenbossche
Copy link
Member Author

Yes, this PR was not really ready for merging, see also Brock's comments

@simonjayhawkins
Copy link
Member

Thanks @jorisvandenbossche. From the issue, it appears that the regression is from 1.3.x but the reproducer worked on 1.3.5? Does the test added here test the regression reported or perhaps some later change?

@jorisvandenbossche
Copy link
Member Author

Yes, so the pandas test that I added here in this PR only covers the regression in main (worked in 1.3.5). So for that we actually don't need a whatsnew notice.

It also fixes the geopandas issue which was already broken in 1.3.5, but it's a bit complicated to add an exact test for it (but I added one on the geopandas side (geopandas/geopandas#2298), and we test against pandas main).

Both issues are related to the instantiation of the sub-DataFrame (per group), and originally caused by no longer using _constructor and not properly calling __finalize__. But the reason that this only turned up in main (and was not yet broken in 1.3) for the pandas reproducer is because in main we removed the libreduction fastpath, which basically still masked the regression in the python path for the majority of cases in 1.3.x

jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this pull request Jan 17, 2022
jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this pull request Jan 17, 2022
@jreback
Copy link
Contributor

jreback commented Jan 17, 2022

@jorisvandenbossche if not ready pls make draft PRs

@jorisvandenbossche
Copy link
Member Author

It was a review comment of Brock that required work ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: GeoDataFrame attribute dropped by pandas when using groupby.apply
4 participants