Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix GH-29442 DataFrame.groupby doesn't preserve _metadata #35688

Merged

Conversation

Japanuspus
Copy link
Contributor

@Japanuspus Japanuspus commented Aug 12, 2020

This bug is a regression in v1.1.0 and was introduced by the fix for GH-34214 in commit 6f065b.

Underlying cause is that the *Splitter classes do not use the ._constructor property and do not call __finalize__.

Please note that the method name used for __finalize__ calls was my best guess since documentation for the value has been hard to find.

This bug is a regression in v1.1.0 and was introduced by the fix for pandas-devGH-34214 in commit [6f065b].

Underlying cause is that the `*Splitter` classes do not use the `._constructor` property and do not call `__finalize__`.

Please note that the method name used for `__finalize__` calls was my best guess since documentation for the value has been hard to find.

[6f065b]: pandas-dev@6f065b6
@pep8speaks
Copy link

pep8speaks commented Aug 12, 2020

Hello @Japanuspus! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-10-09 12:42:14 UTC

I don't know how to truncate remaining long lines without loosing information.
This should hopefully resolve remaining linter issues.
@Japanuspus
Copy link
Contributor Author

@jbrockmendel This PR is a two-line edit to fix a metadata regression in your recent work on groupby-performance. Would you be available to review?

I am available to implement this is differently if needed.

@jbrockmendel
Copy link
Member

cc @TomAugspurger since this is about __finalize__ being called. I'm ambivalent, since this is Technically Correct, but also a place where we're really trying to avoid overhead since these methods get called in a loop.

@TomAugspurger
Copy link
Contributor

At a glance, this feels like too low of a level to call __finalize__. Why can't we call it on the outer level, in whatever function is defining DataFrameGroupBy.sum()?

@jreback jreback added Compat pandas objects compatability with Numpy or Python functions Groupby labels Aug 13, 2020
@Japanuspus
Copy link
Contributor Author

Thank you both for looking into this! I will start with a more concise testcase.

Should I also have a go at moving the __finalize__ call to an outer layer, or are either of you looking into that in detail?

@Japanuspus
Copy link
Contributor Author

Regarding performance implications of __finalize__ in _chop

I tried running the benchmark from GH-34214 , but this seems to hit .fast_apply and not ._chop.

Can either of you come up with a benchmark that hits ._chop in a hot loop where calling __finalize__ on the groups would really be wasted?

(fast_apply benchmark for reference)

import numpy as np
from pandas import DataFrame

N = 10 ** 4
labels = np.random.randint(0, 2000, size=N)
labels2 = np.random.randint(0, 3, size=N)
df = DataFrame(
    {
        "key": labels,
        "key2": labels2,
        "value1": np.random.randn(N),
        "value2": ["foo", "bar", "baz", "qux"] * (N // 4),
    }
)

%timeit df.groupby(["key", "key2"]).apply(lambda x: 1)
%timeit df.groupby("key").apply(lambda x: 1)

@jbrockmendel
Copy link
Member

Can either of you come up with a benchmark that hits ._chop in a hot loop where calling finalize on the groups would really be wasted?

Try setting df.index = pd.CategoricalIndex(df.index)

@Japanuspus
Copy link
Contributor Author

Just running df.index = pd.CategoricalIndex(df.index) for the df produced in the fast_apply benchmark does not hit _.chop. But I have a feeling that was maybe not what you meant?

@TomAugspurger
Copy link
Contributor

Perhaps run the groupby with a CategoricalIndex?

@Japanuspus
Copy link
Contributor Author

Thanks for helping out! Managed to get a benchmark with some degradation:

With df as above

df.index = pd.CategoricalIndex(df.key)
%timeit df.groupby(level='key').apply(lambda x: 1)

Result was a 4% performance hit:

  • This patch: 85.3 ms ± 570 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
  • 1.1.x from source: 81.9 ms ± 1.48 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

I will not have time to learn enough of the inner workings of pandas to improve this on my own, but am willing to help out if you have some concrete pointers.
Alternatively you could accept as-is, and then have a look at performance once tests are in place? Or just drop this (I can copy-paste the testcase to the issue thread) -- for my own use case I have implemented a fallback to handle initialization without __finalize__, so all is good for me.

@simonjayhawkins simonjayhawkins mentioned this pull request Aug 18, 2020
Added testcase `test_groupby_sum_with_custom_metadata` for functionality exercised in the #pandas-devGH-29442.
Testcase fails on current code.
In order to propagate metadata fields, the `__finalize__` method must be
called for the resulting DataFrame with a reference to input. By
implementing this in `_GroupBy._agg_general`, this is performed as late
as possible for the `.sum()` (and similar) code-paths.

Fixes #pandas-devGH-29442
@TomAugspurger
Copy link
Contributor

I guess one way to get a sense for the performance impact is to count how many times __finalize__ is called. I think ideally it's called exactly once, rather than once per group.

@Japanuspus
Copy link
Contributor Author

Good point @TomAugspurger.

With respect to my first commit, adding __finalize__ in DataSplitter._chop, it seems this could be improved for the code paths tested here by moving __finalize__ out to _GroupBy.__iter__ and _GroupBy.apply. Should I make an attempt to do so, or would you rather take another approach?

With respect to the sum-test, I could not see how to move the call further out than _agg_general without having to duplicate it across all the aggregation methods? 8d3d896

Also seems I have a bunch of failing tests, so will take a look at that. Please let me know if you would rather take this in another direction.

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need a release note?

@Japanuspus
Copy link
Contributor Author

does this need a release note?

Added a release note for 1.1.4

@simonjayhawkins simonjayhawkins added this to the 1.1.4 milestone Oct 7, 2020
pandas/tests/generic/test_finalize.py Outdated Show resolved Hide resolved
pandas/tests/groupby/test_custom_metadata.py Outdated Show resolved Hide resolved
pandas/tests/groupby/test_custom_metadata.py Outdated Show resolved Hide resolved
@jreback
Copy link
Contributor

jreback commented Oct 7, 2020

@Japanuspus pls limit changes to one thing in a PR, esp as this is a backport.

doc/source/whatsnew/v1.1.4.rst Outdated Show resolved Hide resolved
pandas/tests/groupby/test_custom_metadata.py Outdated Show resolved Hide resolved
@Japanuspus
Copy link
Contributor Author

I am slightly confused about @jreback mentioning this being a backport PR -- my intention was to target master. Or does this just mean that the patch will both apply to master and release branches?

@jreback
Copy link
Contributor

jreback commented Oct 10, 2020

I am slightly confused about @jreback mentioning this being a backport PR -- my intention was to target master. Or does this just mean that the patch will both apply to master and release branches?

this si a regresion, so we target master, then its backported. so pls limit the change to the regression. if you want to do another change that targets master only that's ok, but separate PR

@Japanuspus
Copy link
Contributor Author

I am slightly confused about @jreback mentioning this being a backport PR -- my intention was to target master. Or does this just mean that the patch will both apply to master and release branches?

this si a regresion, so we target master, then its backported. so pls limit the change to the regression. if you want to do another change that targets master only that's ok, but separate PR

Thank you for the explanation.
I believe the current PR is cut as tight as possible -- will go ahead with a separate PR for the iter(...groupby()) issue.

@jreback
Copy link
Contributor

jreback commented Oct 14, 2020

lgtm. @TomAugspurger if any comments.

@jreback jreback merged commit d7a5b83 into pandas-dev:master Oct 14, 2020
@jreback
Copy link
Contributor

jreback commented Oct 14, 2020

thanks @Japanuspus

@lumberbot-app

This comment has been minimized.

simonjayhawkins pushed a commit to simonjayhawkins/pandas that referenced this pull request Oct 14, 2020
…DataFrame.groupby doesn't preserve _metadata
simonjayhawkins added a commit that referenced this pull request Oct 15, 2020
…esn't preserve _metadata (#37122)

Co-authored-by: Janus <janus@insignificancegalore.net>
kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataFrame.groupby doesn't preserve _metadata
6 participants