Fix GH-29442 DataFrame.groupby doesn't preserve _metadata #35688

Japanuspus · 2020-08-12T13:47:15Z

This bug is a regression in v1.1.0 and was introduced by the fix for GH-34214 in commit 6f065b.

Underlying cause is that the *Splitter classes do not use the ._constructor property and do not call __finalize__.

Please note that the method name used for __finalize__ calls was my best guess since documentation for the value has been hard to find.

closes DataFrame.groupby doesn't preserve _metadata #29442
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

This bug is a regression in v1.1.0 and was introduced by the fix for pandas-devGH-34214 in commit [6f065b]. Underlying cause is that the `*Splitter` classes do not use the `._constructor` property and do not call `__finalize__`. Please note that the method name used for `__finalize__` calls was my best guess since documentation for the value has been hard to find. [6f065b]: pandas-dev@6f065b6

pep8speaks · 2020-08-12T13:47:20Z

Hello @Japanuspus! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-10-09 12:42:14 UTC

I don't know how to truncate remaining long lines without loosing information.

This should hopefully resolve remaining linter issues.

Japanuspus · 2020-08-13T12:07:50Z

@jbrockmendel This PR is a two-line edit to fix a metadata regression in your recent work on groupby-performance. Would you be available to review?

I am available to implement this is differently if needed.

pandas/tests/groupby/test_custom_metadata.py

jbrockmendel · 2020-08-13T15:34:06Z

cc @TomAugspurger since this is about __finalize__ being called. I'm ambivalent, since this is Technically Correct, but also a place where we're really trying to avoid overhead since these methods get called in a loop.

TomAugspurger · 2020-08-13T18:40:10Z

At a glance, this feels like too low of a level to call __finalize__. Why can't we call it on the outer level, in whatever function is defining DataFrameGroupBy.sum()?

Japanuspus · 2020-08-13T20:29:26Z

Thank you both for looking into this! I will start with a more concise testcase.

Should I also have a go at moving the __finalize__ call to an outer layer, or are either of you looking into that in detail?

Japanuspus · 2020-08-14T09:32:08Z

Regarding performance implications of __finalize__ in _chop

I tried running the benchmark from GH-34214 , but this seems to hit .fast_apply and not ._chop.

Can either of you come up with a benchmark that hits ._chop in a hot loop where calling __finalize__ on the groups would really be wasted?

(fast_apply benchmark for reference)

import numpy as np
from pandas import DataFrame

N = 10 ** 4
labels = np.random.randint(0, 2000, size=N)
labels2 = np.random.randint(0, 3, size=N)
df = DataFrame(
    {
        "key": labels,
        "key2": labels2,
        "value1": np.random.randn(N),
        "value2": ["foo", "bar", "baz", "qux"] * (N // 4),
    }
)

%timeit df.groupby(["key", "key2"]).apply(lambda x: 1)
%timeit df.groupby("key").apply(lambda x: 1)

jbrockmendel · 2020-08-14T15:57:41Z

Can either of you come up with a benchmark that hits ._chop in a hot loop where calling finalize on the groups would really be wasted?

Try setting df.index = pd.CategoricalIndex(df.index)

Japanuspus · 2020-08-17T08:59:06Z

Just running df.index = pd.CategoricalIndex(df.index) for the df produced in the fast_apply benchmark does not hit _.chop. But I have a feeling that was maybe not what you meant?

TomAugspurger · 2020-08-17T20:28:52Z

Perhaps run the groupby with a CategoricalIndex?

Japanuspus · 2020-08-18T09:24:00Z

Thanks for helping out! Managed to get a benchmark with some degradation:

With df as above

df.index = pd.CategoricalIndex(df.key)
%timeit df.groupby(level='key').apply(lambda x: 1)

Result was a 4% performance hit:

This patch: 85.3 ms ± 570 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.1.x from source: 81.9 ms ± 1.48 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

I will not have time to learn enough of the inner workings of pandas to improve this on my own, but am willing to help out if you have some concrete pointers.
Alternatively you could accept as-is, and then have a look at performance once tests are in place? Or just drop this (I can copy-paste the testcase to the issue thread) -- for my own use case I have implemented a fallback to handle initialization without __finalize__, so all is good for me.

pandas/tests/groupby/test_custom_metadata.py

Added testcase `test_groupby_sum_with_custom_metadata` for functionality exercised in the #pandas-devGH-29442. Testcase fails on current code.

In order to propagate metadata fields, the `__finalize__` method must be called for the resulting DataFrame with a reference to input. By implementing this in `_GroupBy._agg_general`, this is performed as late as possible for the `.sum()` (and similar) code-paths. Fixes #pandas-devGH-29442

TomAugspurger · 2020-08-27T14:03:47Z

I guess one way to get a sense for the performance impact is to count how many times __finalize__ is called. I think ideally it's called exactly once, rather than once per group.

Japanuspus · 2020-08-27T21:58:50Z

Good point @TomAugspurger.

With respect to my first commit, adding __finalize__ in DataSplitter._chop, it seems this could be improved for the code paths tested here by moving __finalize__ out to _GroupBy.__iter__ and _GroupBy.apply. Should I make an attempt to do so, or would you rather take another approach?

With respect to the sum-test, I could not see how to move the call further out than _agg_general without having to duplicate it across all the aggregation methods? 8d3d896

Also seems I have a bunch of failing tests, so will take a look at that. Please let me know if you would rather take this in another direction.

simonjayhawkins

does this need a release note?

Japanuspus · 2020-10-07T11:22:45Z

does this need a release note?

Added a release note for 1.1.4

pandas/tests/generic/test_finalize.py

pandas/tests/groupby/test_custom_metadata.py

jreback · 2020-10-07T12:17:05Z

@Japanuspus pls limit changes to one thing in a PR, esp as this is a backport.

Merging from commit 'ec8c1c4ecf1f0d375d8d1287ee5cbd456852faea' which is most recent commit with green tests

doc/source/whatsnew/v1.1.4.rst

pandas/tests/groupby/test_custom_metadata.py

Japanuspus · 2020-10-09T12:51:55Z

I am slightly confused about @jreback mentioning this being a backport PR -- my intention was to target master. Or does this just mean that the patch will both apply to master and release branches?

jreback · 2020-10-10T22:14:43Z

I am slightly confused about @jreback mentioning this being a backport PR -- my intention was to target master. Or does this just mean that the patch will both apply to master and release branches?

this si a regresion, so we target master, then its backported. so pls limit the change to the regression. if you want to do another change that targets master only that's ok, but separate PR

Japanuspus · 2020-10-11T05:20:12Z

I am slightly confused about @jreback mentioning this being a backport PR -- my intention was to target master. Or does this just mean that the patch will both apply to master and release branches?

this si a regresion, so we target master, then its backported. so pls limit the change to the regression. if you want to do another change that targets master only that's ok, but separate PR

Thank you for the explanation.
I believe the current PR is cut as tight as possible -- will go ahead with a separate PR for the iter(...groupby()) issue.

jreback · 2020-10-14T12:56:17Z

lgtm. @TomAugspurger if any comments.

jreback · 2020-10-14T18:37:56Z

thanks @Japanuspus

…DataFrame.groupby doesn't preserve _metadata

…esn't preserve _metadata (#37122) Co-authored-by: Janus <janus@insignificancegalore.net>

…andas-dev#35688)

Japanuspus added 4 commits August 12, 2020 16:01

Fix most PEP8 issues

0ab766d

I don't know how to truncate remaining long lines without loosing information.

Fix remaining PEP8 issues

d7b42e3

Apply "black" styling and fix typo

fbc602c

This should hopefully resolve remaining linter issues.

Fix import sorting issues reported by isort

e9c64ac

jbrockmendel reviewed Aug 13, 2020

View reviewed changes

pandas/tests/groupby/test_custom_metadata.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Aug 13, 2020

View reviewed changes

pandas/tests/groupby/test_custom_metadata.py Outdated Show resolved Hide resolved

jreback added Compat pandas objects compatability with Numpy or Python functions Groupby labels Aug 13, 2020

Make testcase more concise

ec7fd00

simonjayhawkins mentioned this pull request Aug 18, 2020

RLS: 1.1.1 #35489

Closed

jreback requested changes Aug 19, 2020

View reviewed changes

pandas/tests/groupby/test_custom_metadata.py Outdated Show resolved Hide resolved

Japanuspus added 3 commits August 27, 2020 09:28

Include test from original issue

1405667

Added testcase `test_groupby_sum_with_custom_metadata` for functionality exercised in the #pandas-devGH-29442. Testcase fails on current code.

Apply black format

39e9f33

has2k1 mentioned this pull request Aug 31, 2020

group_by broken has2k1/plydata#23

Closed

simonjayhawkins requested changes Sep 7, 2020

View reviewed changes

simonjayhawkins mentioned this pull request Sep 8, 2020

DataFrame.groupby doesn't preserve _metadata #29442

Closed

simonjayhawkins added this to the 1.1.4 milestone Oct 7, 2020

jreback requested changes Oct 7, 2020

View reviewed changes

pandas/tests/generic/test_finalize.py Outdated Show resolved Hide resolved

pandas/tests/groupby/test_custom_metadata.py Outdated Show resolved Hide resolved

pandas/tests/groupby/test_custom_metadata.py Outdated Show resolved Hide resolved

Japanuspus added 4 commits October 8, 2020 15:34

Revert all changes not strictly related to issue

5548699

Use descriptive names for unit tests

18e8ff5

Remove extraneous imports

57694ae

Merge master into BUG_GH29442_frame_groupby_metadata

d3088a3

Merging from commit 'ec8c1c4ecf1f0d375d8d1287ee5cbd456852faea' which is most recent commit with green tests

jreback requested changes Oct 8, 2020

View reviewed changes

doc/source/whatsnew/v1.1.4.rst Outdated Show resolved Hide resolved

pandas/tests/groupby/test_custom_metadata.py Outdated Show resolved Hide resolved

Japanuspus added 2 commits October 9, 2020 14:40

Delete tests for custom metadata

d86ae78

Remove reference to internal class from whatsnew

6227894

Japanuspus requested a review from jreback October 10, 2020 19:10

jreback approved these changes Oct 14, 2020

View reviewed changes

jreback merged commit d7a5b83 into pandas-dev:master Oct 14, 2020

This comment has been minimized.

Sign in to view

lumberbot-app bot added the Still Needs Manual Backport label Oct 14, 2020

simonjayhawkins pushed a commit to simonjayhawkins/pandas that referenced this pull request Oct 14, 2020

Backport PR pandas-dev#35688 on branch 1.1.x: Fix pandas-devGH-29442 …

9ad9221

…DataFrame.groupby doesn't preserve _metadata

simonjayhawkins mentioned this pull request Oct 14, 2020

Backport PR #35688 on branch 1.1.x: Fix GH-29442 DataFrame.groupby doesn't preserve _metadata #37122

Merged

simonjayhawkins removed the Still Needs Manual Backport label Oct 14, 2020

simonjayhawkins added a commit that referenced this pull request Oct 15, 2020

Backport PR #35688 on branch 1.1.x: Fix GH-29442 DataFrame.groupby do…

c202736

…esn't preserve _metadata (#37122) Co-authored-by: Janus <janus@insignificancegalore.net>

Japanuspus mentioned this pull request Oct 20, 2020

Various methods don't call call __finalize__ #28283

Open

38 tasks

JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Oct 26, 2020

Fix pandas-devGH-29442 DataFrame.groupby doesn't preserve _metadata (p…

c3ed803

…andas-dev#35688)

jorisvandenbossche mentioned this pull request Oct 27, 2020

BUG: groupby __iter__ on pandas 1.1.x not propagating _metadata on DataFrame subclasses #37343

Closed

3 tasks

Japanuspus mentioned this pull request Oct 28, 2020

BUG: Metadata propagation for groupby iterator #37461

Merged

5 tasks

kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020

Fix pandas-devGH-29442 DataFrame.groupby doesn't preserve _metadata (p…

1895f78

…andas-dev#35688)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix GH-29442 DataFrame.groupby doesn't preserve _metadata #35688

Fix GH-29442 DataFrame.groupby doesn't preserve _metadata #35688

Japanuspus commented Aug 12, 2020 •

edited

Loading

pep8speaks commented Aug 12, 2020 •

edited

Loading

Japanuspus commented Aug 13, 2020

jbrockmendel commented Aug 13, 2020

TomAugspurger commented Aug 13, 2020

Japanuspus commented Aug 13, 2020

Japanuspus commented Aug 14, 2020

jbrockmendel commented Aug 14, 2020

Japanuspus commented Aug 17, 2020

TomAugspurger commented Aug 17, 2020

Japanuspus commented Aug 18, 2020

TomAugspurger commented Aug 27, 2020

Japanuspus commented Aug 27, 2020

simonjayhawkins left a comment

Japanuspus commented Oct 7, 2020

jreback commented Oct 7, 2020

Japanuspus commented Oct 9, 2020

jreback commented Oct 10, 2020

Japanuspus commented Oct 11, 2020

jreback commented Oct 14, 2020

jreback commented Oct 14, 2020

This comment has been minimized.

Fix GH-29442 DataFrame.groupby doesn't preserve _metadata #35688

Fix GH-29442 DataFrame.groupby doesn't preserve _metadata #35688

Conversation

Japanuspus commented Aug 12, 2020 • edited Loading

pep8speaks commented Aug 12, 2020 • edited Loading

Comment last updated at 2020-10-09 12:42:14 UTC

Japanuspus commented Aug 13, 2020

jbrockmendel commented Aug 13, 2020

TomAugspurger commented Aug 13, 2020

Japanuspus commented Aug 13, 2020

Japanuspus commented Aug 14, 2020

jbrockmendel commented Aug 14, 2020

Japanuspus commented Aug 17, 2020

TomAugspurger commented Aug 17, 2020

Japanuspus commented Aug 18, 2020

TomAugspurger commented Aug 27, 2020

Japanuspus commented Aug 27, 2020

simonjayhawkins left a comment

Choose a reason for hiding this comment

Japanuspus commented Oct 7, 2020

jreback commented Oct 7, 2020

Japanuspus commented Oct 9, 2020

jreback commented Oct 10, 2020

Japanuspus commented Oct 11, 2020

jreback commented Oct 14, 2020

jreback commented Oct 14, 2020

This comment has been minimized.

Japanuspus commented Aug 12, 2020 •

edited

Loading

pep8speaks commented Aug 12, 2020 •

edited

Loading