-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix GH-29442 DataFrame.groupby doesn't preserve _metadata #35688
Fix GH-29442 DataFrame.groupby doesn't preserve _metadata #35688
Conversation
This bug is a regression in v1.1.0 and was introduced by the fix for pandas-devGH-34214 in commit [6f065b]. Underlying cause is that the `*Splitter` classes do not use the `._constructor` property and do not call `__finalize__`. Please note that the method name used for `__finalize__` calls was my best guess since documentation for the value has been hard to find. [6f065b]: pandas-dev@6f065b6
Hello @Japanuspus! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-10-09 12:42:14 UTC |
I don't know how to truncate remaining long lines without loosing information.
This should hopefully resolve remaining linter issues.
@jbrockmendel This PR is a two-line edit to fix a metadata regression in your recent work on groupby-performance. Would you be available to review? I am available to implement this is differently if needed. |
cc @TomAugspurger since this is about |
At a glance, this feels like too low of a level to call |
Thank you both for looking into this! I will start with a more concise testcase. Should I also have a go at moving the |
Regarding performance implications of I tried running the benchmark from GH-34214 , but this seems to hit Can either of you come up with a benchmark that hits (
|
Try setting |
Just running |
Perhaps run the groupby with a CategoricalIndex? |
Thanks for helping out! Managed to get a benchmark with some degradation: With
Result was a 4% performance hit:
I will not have time to learn enough of the inner workings of pandas to improve this on my own, but am willing to help out if you have some concrete pointers. |
Added testcase `test_groupby_sum_with_custom_metadata` for functionality exercised in the #pandas-devGH-29442. Testcase fails on current code.
In order to propagate metadata fields, the `__finalize__` method must be called for the resulting DataFrame with a reference to input. By implementing this in `_GroupBy._agg_general`, this is performed as late as possible for the `.sum()` (and similar) code-paths. Fixes #pandas-devGH-29442
I guess one way to get a sense for the performance impact is to count how many times |
Good point @TomAugspurger. With respect to my first commit, adding With respect to the Also seems I have a bunch of failing tests, so will take a look at that. Please let me know if you would rather take this in another direction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need a release note?
Added a release note for 1.1.4 |
@Japanuspus pls limit changes to one thing in a PR, esp as this is a backport. |
Merging from commit 'ec8c1c4ecf1f0d375d8d1287ee5cbd456852faea' which is most recent commit with green tests
I am slightly confused about @jreback mentioning this being a backport PR -- my intention was to target master. Or does this just mean that the patch will both apply to master and release branches? |
this si a regresion, so we target master, then its backported. so pls limit the change to the regression. if you want to do another change that targets master only that's ok, but separate PR |
Thank you for the explanation. |
lgtm. @TomAugspurger if any comments. |
thanks @Japanuspus |
This comment has been minimized.
This comment has been minimized.
…DataFrame.groupby doesn't preserve _metadata
…esn't preserve _metadata (#37122) Co-authored-by: Janus <janus@insignificancegalore.net>
This bug is a regression in v1.1.0 and was introduced by the fix for GH-34214 in commit 6f065b.
Underlying cause is that the
*Splitter
classes do not use the._constructor
property and do not call__finalize__
.Please note that the method name used for
__finalize__
calls was my best guess since documentation for the value has been hard to find.black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff