PERF: fix regression in creation of resulting index in RollingGroupby #38057

jorisvandenbossche · 2020-11-25T09:03:16Z

Closes #38038

TODO: fix corner cases, add benchmark, ..

jreback

cc @mroeschke

do we have an asv that covers this case?

mroeschke · 2020-11-26T19:04:02Z

We do but the ASV only has 1000 data points while the original example has 10000000

jreback · 2020-11-28T18:06:38Z

We do but the ASV only has 1000 data points while the original example has 10000000

does this still show a decent time diff? (and should increase that asv to 10k or so)

jreback · 2020-11-29T19:08:57Z

@jorisvandenbossche i something is actually failing here. can you merge master and see if you can fixup.

jorisvandenbossche · 2020-11-30T13:31:45Z

We do but the ASV only has 1000 data points while the original example has 10000000

I added an additional benchmark, instead of changing the existing one to include larger data. Because for the case of the benchmarks, we actually had a nice performance improvement in 1.0->1.1 (instead of regression). So it is clearly capturing different aspects.
The newly added (and the report in #38038) is basically one with a limited number of groups, where the rolling apply calculation itself was quite cheap, and so the construction of the MulitIndex of the result dominated the profile.

asv_bench/benchmarks/rolling.py

jreback · 2020-11-30T13:43:22Z

asv_bench/benchmarks/rolling.py

@@ -225,6 +225,19 @@ def time_rolling_offset(self, method):
        getattr(self.groupby_roll_offset, method)()


+class Groupby2:


can you rename to GroupbyLowNumberOfGroups or similar

pandas/core/window/rolling.py

jreback · 2020-11-30T13:45:06Z

pandas/core/window/rolling.py

+        codes = [c.take(indexer) for c in codes]
+
+        if grouped_object_index is not None:
+            if isinstance(grouped_object_index, MultiIndex):


maybe more clear

idx = grouped_object_index.take(indexer) if not isinstance(grouped_object_index, MultiIndex): idx = MultiIndex.from_arrays(idx)

jorisvandenbossche · 2020-11-30T13:51:59Z

So a summary of the failing tests.

First case is on an empty dataframe, where the dtype of the level of the MultiIndex changed:

df = pd.DataFrame({"s1": []})
result = df.groupby("s1").rolling(window=1).sum()
expected = df.copy()
expected.index = pd.MultiIndex.from_tuples([], names=["s1", None])

In [5]: df.index
Out[5]: RangeIndex(start=0, stop=0, step=1)

In [6]: result.index.levels[0]
Out[6]: Float64Index([], dtype='float64', name='s1')

In [7]: expected.index.levels[0]
Out[7]: Index([], dtype='object', name='s1')

In [9]: result.index.levels[1]
Out[9]: Int64Index([], dtype='int64')

In [10]: expected.index.levels[1]
Out[10]: Index([], dtype='object')

But for example, when doing a normal groupby on this empty dataframe, you also get back an empty float index:

In [12]: df.groupby("s1").sum().index
Out[12]: Float64Index([], dtype='float64', name='s1')

So at least this is consistent with groupby itself (or, if it's a bug, it's probably one in groupby, and not in the rolling code)

The second case is with groupby with dropna=False, and the difference is about the NaN being included in the levels vs in the codes of the resulting MultiIndex:

df = pd.DataFrame(
    {"group": ["g1", np.nan, "g1", "g2", np.nan], "B": [0, 1, 2, 3, 4]}
)
result = df.groupby("group", dropna=False).rolling(1, min_periods=1).mean()
expected = pd.DataFrame(
    {"B": [0.0, 2.0, 3.0, 1.0, 4.0]},
    index=pd.MultiIndex.from_tuples(
        [("g1", 0), ("g1", 2), ("g2", 3), (np.nan, 1), (np.nan, 4)],
        names=["group", None],
    )
)

In [16]: result.index.levels[0]
Out[16]: Index(['g1', 'g2', nan], dtype='object', name='group')

In [17]: result.index.codes[0]
Out[17]: array([0, 0, 1, 2, 2], dtype=int8)

In [18]: expected.index.levels[0]
Out[18]: Index(['g1', 'g2'], dtype='object', name='group')

In [19]: expected.index.codes[0]
Out[19]: array([ 0,  0,  1, -1, -1], dtype=int8)

Again, this seems to be caused by groupby, as the same behaviour can be noticed there:

# adding a group column so we can get a MultiIndex as result
In [20]: df["group2"] = 1

In [21]: result2 = df.groupby(["group", "group2"], dropna=False).mean()

In [22]: result2
Out[22]: 
                B
group group2     
g1    1       1.0
g2    1       3.0
NaN   1       2.5

In [23]: result2.index.levels[0]
Out[23]: Index(['g1', 'g2', nan], dtype='object', name='group')

In [24]: result2.index.codes[0]
Out[24]: array([0, 1, 2], dtype=int8)

I don't know if we have a general rule about where missing values should be included? (in the levels vs the codes?)
I would expect that we generally prefer to have them in the codes, though (but again, this is then more a bug in groupby, and not rolling)

jreback · 2020-11-30T13:54:55Z

we likely don't support dropna=False in 1.1.x at all (or its just wrong), wouldn't worry about the 2nd case for now. (but can create an issue on master for this).

jorisvandenbossche · 2020-11-30T13:58:53Z

Indeed, that second case was only introduced in #36842, which fixed dropna=False case (so on master, not yet in a released version).
And I can confirm that on pandas 1.1.4, the NaNs were incorrectly dropped for the example above. So changing the test won't change behaviour compared to previous releases, so will update the test.

jorisvandenbossche · 2020-11-30T14:30:42Z

I also added a few tests specifically related to the resulting MultiIndex (it's a bit hard to see to what extent the existing tests already cover this or not, but it are the examples I was using to test while implementing the fix).
I wrote the examples and expected result using pandas 1.0.5, and also tested that they are passing on master (without this change) as well.

…ted MI

jorisvandenbossche · 2020-11-30T14:49:56Z

I changed the expected result for the first case as well (rolling operation on groupby). Since the original index of the dataframe is an (empty) RangeIndex, using a numeric index level seems actually more correct than an object type. A bit a corner case, though ..

jreback

lgtm, very minor comments (ok with merging as is though).

cc @mroeschke if you can have a quick look

jreback · 2020-11-30T18:45:57Z

pandas/tests/window/test_groupby.py

        tm.assert_frame_equal(result, expected)

        expected = DataFrame({"s1": [], "s2": []})
        result = expected.groupby(["s1", "s2"]).rolling(window=1).sum()
-        expected.index = MultiIndex.from_tuples([], names=["s1", "s2", None])
+        # expected.index = MultiIndex.from_tuples([], names=["s1", "s2", None])


can you remove the commented code :-<

jreback · 2020-11-30T18:46:01Z

pandas/tests/window/test_groupby.py

@@ -418,12 +426,24 @@ def test_groupby_rolling_empty_frame(self):
        # GH 36197
        expected = DataFrame({"s1": []})
        result = expected.groupby("s1").rolling(window=1).sum()
-        expected.index = MultiIndex.from_tuples([], names=["s1", None])
+        # GH-38057 from_tupes gives empty object dtype, we now get float/int levels


from_tuples

jreback

do we need to add a note that the return index is now 'fixed'?

jorisvandenbossche · 2020-11-30T19:27:47Z

do we need to add a note that the return index is now 'fixed'?

For the dropna case certainly not, since that was only something in master. For the empty it could be mentioned, but not fully sure it is worth it (I am also not fully convinced it is completely fixed, I think groupby should maybe give an empty int64 index instead of empty float64 index).

But the performance fix is certainly worth noting, will add a whatsnew for that.

jreback · 2020-12-01T00:12:38Z

lgtm. just wanted to have @mroeschke give a glance.

mroeschke · 2020-12-01T02:29:35Z

pandas/core/window/rolling.py

+
+        group_indices = self._groupby.grouper.indices.values()
+        if group_indices:
+            indexer = np.concatenate(list(self._groupby.grouper.indices.values()))


Nit: I think this can simply be indexer = np.concatenate(list(group_indices))

mroeschke

One minor comment otherwise LGTM (can merge without it).

Thanks for tracking this down @jorisvandenbossche!

simonjayhawkins · 2020-12-01T09:12:01Z

test failure unrelated and seen on other PRs

=========================== short test summary info ============================
FAILED pandas/tests/series/indexing/test_indexing.py::test_loc_setitem_2d_to_1d_raises
FAILED pandas/tests/util/test_show_versions.py::test_show_versions - Assertio...
= 2 failed, 133017 passed, 13911 skipped, 1390 xfailed, 41 xpassed in 1053.70s (0:17:33) =

simonjayhawkins · 2020-12-01T09:13:32Z

@meeseeksdev backport 1.1.x

…ulting index in RollingGroupby

…ex in RollingGroupby (#38211) Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

cpmbailey · 2020-12-05T08:34:04Z

Thanks for the quick fix on this one, and I hope my simplified example wasn't misleading. I originally spotted this on a real life example with in the order of five times as much data and about 200 groups. Do you think this still classes as a low number of groups?

jorisvandenbossche · 2020-12-05T19:50:10Z

@cpmbailey you're welcome (and thank you for reporting a easy reproducible example).
It's not the exact number of groups, but mostly the relative number compared to the total size that triggers the issue. So for large data, 200 groups can be relatively small.

cpmbailey · 2020-12-08T14:36:41Z

This is much improved, I only get twice as long against 1.0.5 now.

PERF: fix regression in creation of resulting index in RollingGroupby

1679318

jorisvandenbossche mentioned this pull request Nov 25, 2020

REGR: Performance regression on RollingGroupby #38038

Closed

simonjayhawkins added this to the 1.1.5 milestone Nov 25, 2020

jreback added Groupby Performance Memory or execution speed performance Window rolling, ewma, expanding labels Nov 25, 2020

jreback requested changes Nov 25, 2020

View reviewed changes

Merge remote-tracking branch 'upstream/master' into perf-rolling-groupby

30fb295

add specific benchmark

fd9991c

jreback requested changes Nov 30, 2020

View reviewed changes

jorisvandenbossche added 3 commits November 30, 2020 15:00

rename benchmark

413d701

update test_groupby_rolling_nan_included

f785ca4

add more tests specifically about resulting MI

19c9d3e

jorisvandenbossche marked this pull request as ready for review November 30, 2020 14:28

jorisvandenbossche added 2 commits November 30, 2020 15:36

add comment + simplify

062e298

update test_groupby_rolling_empty_frame test - change dtypes of expec…

4546fd7

…ted MI

simonjayhawkins mentioned this pull request Nov 30, 2020

RLS: 1.2 #37784

Closed

jreback requested changes Nov 30, 2020

View reviewed changes

add whatsnew

9e9c8de

cleanup test

cf64f20

jreback approved these changes Dec 1, 2020

View reviewed changes

mroeschke reviewed Dec 1, 2020

View reviewed changes

mroeschke approved these changes Dec 1, 2020

View reviewed changes

simplify

485f399

simonjayhawkins merged commit 6d0dab4 into pandas-dev:master Dec 1, 2020

This comment has been minimized.

Sign in to view

lumberbot-app bot added the Still Needs Manual Backport label Dec 1, 2020

jorisvandenbossche deleted the perf-rolling-groupby branch December 1, 2020 09:30

simonjayhawkins pushed a commit to simonjayhawkins/pandas that referenced this pull request Dec 1, 2020

Backport PR pandas-dev#38057: PERF: fix regression in creation of res…

83cb681

…ulting index in RollingGroupby

simonjayhawkins mentioned this pull request Dec 1, 2020

Backport PR #38057: PERF: fix regression in creation of resulting index in RollingGroupby #38211

Merged

simonjayhawkins removed the Still Needs Manual Backport label Dec 1, 2020

simonjayhawkins added a commit that referenced this pull request Dec 1, 2020

Backport PR #38057: PERF: fix regression in creation of resulting ind…

993557b

…ex in RollingGroupby (#38211) Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

dependabot bot mentioned this pull request Mar 16, 2021

Bump pandas from 1.1.4 to 1.1.5 kotamya/pandas-estat#82

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: fix regression in creation of resulting index in RollingGroupby #38057

PERF: fix regression in creation of resulting index in RollingGroupby #38057

jorisvandenbossche commented Nov 25, 2020 •

edited

Loading

jreback left a comment

mroeschke commented Nov 26, 2020

jreback commented Nov 28, 2020

jreback commented Nov 29, 2020

jorisvandenbossche commented Nov 30, 2020

jreback Nov 30, 2020

jreback Nov 30, 2020

jorisvandenbossche commented Nov 30, 2020

jreback commented Nov 30, 2020 •

edited

Loading

jorisvandenbossche commented Nov 30, 2020

jorisvandenbossche commented Nov 30, 2020

jorisvandenbossche commented Nov 30, 2020

jreback left a comment

jreback Nov 30, 2020

jreback Nov 30, 2020

jreback left a comment

jorisvandenbossche commented Nov 30, 2020

jreback commented Dec 1, 2020

mroeschke Dec 1, 2020 •

edited

Loading

mroeschke left a comment

simonjayhawkins commented Dec 1, 2020

simonjayhawkins commented Dec 1, 2020

This comment has been minimized.

cpmbailey commented Dec 5, 2020

jorisvandenbossche commented Dec 5, 2020

cpmbailey commented Dec 8, 2020

		@@ -225,6 +225,19 @@ def time_rolling_offset(self, method):
		getattr(self.groupby_roll_offset, method)()


		class Groupby2:

PERF: fix regression in creation of resulting index in RollingGroupby #38057

PERF: fix regression in creation of resulting index in RollingGroupby #38057

Conversation

jorisvandenbossche commented Nov 25, 2020 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

mroeschke commented Nov 26, 2020

jreback commented Nov 28, 2020

jreback commented Nov 29, 2020

jorisvandenbossche commented Nov 30, 2020

jreback Nov 30, 2020

Choose a reason for hiding this comment

jreback Nov 30, 2020

Choose a reason for hiding this comment

jorisvandenbossche commented Nov 30, 2020

jreback commented Nov 30, 2020 • edited Loading

jorisvandenbossche commented Nov 30, 2020

jorisvandenbossche commented Nov 30, 2020

jorisvandenbossche commented Nov 30, 2020

jreback left a comment

Choose a reason for hiding this comment

jreback Nov 30, 2020

Choose a reason for hiding this comment

jreback Nov 30, 2020

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Nov 30, 2020

jreback commented Dec 1, 2020

mroeschke Dec 1, 2020 • edited Loading

Choose a reason for hiding this comment

mroeschke left a comment

Choose a reason for hiding this comment

simonjayhawkins commented Dec 1, 2020

simonjayhawkins commented Dec 1, 2020

This comment has been minimized.

cpmbailey commented Dec 5, 2020

jorisvandenbossche commented Dec 5, 2020

cpmbailey commented Dec 8, 2020

jorisvandenbossche commented Nov 25, 2020 •

edited

Loading

jreback commented Nov 30, 2020 •

edited

Loading

mroeschke Dec 1, 2020 •

edited

Loading