-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: fix regression in creation of resulting index in RollingGroupby #38057
PERF: fix regression in creation of resulting index in RollingGroupby #38057
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @mroeschke
do we have an asv that covers this case?
We do but the ASV only has 1000 data points while the original example has 10000000 |
does this still show a decent time diff? (and should increase that asv to 10k or so) |
@jorisvandenbossche i something is actually failing here. can you merge master and see if you can fixup. |
I added an additional benchmark, instead of changing the existing one to include larger data. Because for the case of the benchmarks, we actually had a nice performance improvement in 1.0->1.1 (instead of regression). So it is clearly capturing different aspects. |
asv_bench/benchmarks/rolling.py
Outdated
@@ -225,6 +225,19 @@ def time_rolling_offset(self, method): | |||
getattr(self.groupby_roll_offset, method)() | |||
|
|||
|
|||
class Groupby2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you rename to GroupbyLowNumberOfGroups or similar
pandas/core/window/rolling.py
Outdated
codes = [c.take(indexer) for c in codes] | ||
|
||
if grouped_object_index is not None: | ||
if isinstance(grouped_object_index, MultiIndex): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe more clear
idx = grouped_object_index.take(indexer)
if not isinstance(grouped_object_index, MultiIndex):
idx = MultiIndex.from_arrays(idx)
So a summary of the failing tests. First case is on an empty dataframe, where the dtype of the level of the MultiIndex changed:
But for example, when doing a normal groupby on this empty dataframe, you also get back an empty float index:
So at least this is consistent with groupby itself (or, if it's a bug, it's probably one in groupby, and not in the rolling code) The second case is with groupby with
Again, this seems to be caused by groupby, as the same behaviour can be noticed there:
I don't know if we have a general rule about where missing values should be included? (in the levels vs the codes?) |
we likely don't support dropna=False in 1.1.x at all (or its just wrong), wouldn't worry about the 2nd case for now. (but can create an issue on master for this). |
Indeed, that second case was only introduced in #36842, which fixed |
I also added a few tests specifically related to the resulting MultiIndex (it's a bit hard to see to what extent the existing tests already cover this or not, but it are the examples I was using to test while implementing the fix). |
I changed the expected result for the first case as well (rolling operation on groupby). Since the original index of the dataframe is an (empty) RangeIndex, using a numeric index level seems actually more correct than an object type. A bit a corner case, though .. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, very minor comments (ok with merging as is though).
cc @mroeschke if you can have a quick look
pandas/tests/window/test_groupby.py
Outdated
tm.assert_frame_equal(result, expected) | ||
|
||
expected = DataFrame({"s1": [], "s2": []}) | ||
result = expected.groupby(["s1", "s2"]).rolling(window=1).sum() | ||
expected.index = MultiIndex.from_tuples([], names=["s1", "s2", None]) | ||
# expected.index = MultiIndex.from_tuples([], names=["s1", "s2", None]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you remove the commented code :-<
pandas/tests/window/test_groupby.py
Outdated
@@ -418,12 +426,24 @@ def test_groupby_rolling_empty_frame(self): | |||
# GH 36197 | |||
expected = DataFrame({"s1": []}) | |||
result = expected.groupby("s1").rolling(window=1).sum() | |||
expected.index = MultiIndex.from_tuples([], names=["s1", None]) | |||
# GH-38057 from_tupes gives empty object dtype, we now get float/int levels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from_tuples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to add a note that the return index is now 'fixed'?
For the dropna case certainly not, since that was only something in master. For the empty it could be mentioned, but not fully sure it is worth it (I am also not fully convinced it is completely fixed, I think groupby should maybe give an empty int64 index instead of empty float64 index). But the performance fix is certainly worth noting, will add a whatsnew for that. |
lgtm. just wanted to have @mroeschke give a glance. |
pandas/core/window/rolling.py
Outdated
|
||
group_indices = self._groupby.grouper.indices.values() | ||
if group_indices: | ||
indexer = np.concatenate(list(self._groupby.grouper.indices.values())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think this can simply be indexer = np.concatenate(list(group_indices))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor comment otherwise LGTM (can merge without it).
Thanks for tracking this down @jorisvandenbossche!
test failure unrelated and seen on other PRs
|
@meeseeksdev backport 1.1.x |
This comment has been minimized.
This comment has been minimized.
…ulting index in RollingGroupby
…ex in RollingGroupby (#38211) Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Thanks for the quick fix on this one, and I hope my simplified example wasn't misleading. I originally spotted this on a real life example with in the order of five times as much data and about 200 groups. Do you think this still classes as a low number of groups? |
@cpmbailey you're welcome (and thank you for reporting a easy reproducible example). |
This is much improved, I only get twice as long against 1.0.5 now. |
Closes #38038
TODO: fix corner cases, add benchmark, ..