PERF-#5369: `GroupBy.skew` implementation via MapReduce pattern #5318

dchigarev · 2022-12-02T17:33:14Z

What do these changes do?

Here are some performance comparisons of .groupby().skew() between the current master and MapReduce implementation. The comparison was run via ASV with our existing groupby scenarios. CPU: 2x 28 Cores (56 threads) Xeon Platinum-8276L @ 2.20

NCPUS=112

       before           after         ratio
     [a77a6464]       [05027610]
     <master>         <skew_impl>
+      1.46±0.04s        7.03±0.3s     4.83  benchmarks.TimeGroupByDefaultAggregations.time_groupby_skew([5000, 5000], ngroups=100)
+      1.85±0.06s       7.59±0.03s     4.09  benchmarks.TimeGroupByMultiColumn.time_groupby_agg_skew([5000, 5000], ngroups=100, by_ncols=6)
+       3.53±0.1s        7.11±0.1s     2.02  benchmarks.TimeGroupByDefaultAggregations.time_groupby_skew([5000, 5000], 'huge_amount_groups')
+      4.18±0.05s        8.30±0.4s     1.98  benchmarks.TimeGroupByMultiColumn.time_groupby_agg_skew([5000, 5000], 'huge_amount_groups', by_ncols=6)
-       4.38±0.1s       3.04±0.07s     0.69  benchmarks.TimeGroupByMultiColumn.time_groupby_agg_skew([1_000_000, 256], 'huge_amount_groups', by_ncols=6)
-       3.86±0.2s       2.60±0.02s     0.67  benchmarks.TimeGroupByDefaultAggregations.time_groupby_skew([1_000_000, 256], 'huge_amount_groups')
-      1.91±0.01s       1.24±0.04s     0.65  benchmarks.TimeGroupByMultiColumn.time_groupby_agg_skew([1_000_000, 256], ngroups=100, by_ncols=6)
-       1.66±0.1s       1.04±0.02s     0.63  benchmarks.TimeGroupByDefaultAggregations.time_groupby_skew([1_000_000, 256], 100)
-      3.33±0.07s       1.20±0.03s     0.36  benchmarks.TimeGroupByMultiColumn.time_groupby_agg_skew([1_000_000, 32], 'huge_amount_groups', by_ncols=6)
-      1.10±0.03s          388±7ms     0.35  benchmarks.TimeGroupByMultiColumn.time_groupby_agg_skew([1_000_000, 32], ngroups=100, by_ncols=6)
-        939±20ms         308±10ms     0.33  benchmarks.TimeGroupByDefaultAggregations.time_groupby_skew([1_000_000, 32], ngroups=100)
-      3.06±0.05s         982±70ms     0.32  benchmarks.TimeGroupByDefaultAggregations.time_groupby_skew([1_000_000, 32], 'huge_amount_groups')
-       10.0±0.2s       1.93±0.07s     0.19  benchmarks.TimeGroupByDefaultAggregations.time_groupby_skew([10_000_000, 32], 'huge_amount_groups')
-       12.2±0.3s        2.02±0.1s     0.17  benchmarks.TimeGroupByMultiColumn.time_groupby_agg_skew([10_000_000, 32], 'huge_amount_groups', by_ncols=6)
-       12.7±0.2s       1.14±0.06s     0.09  benchmarks.TimeGroupByMultiColumn.time_groupby_agg_skew([10_000_000, 32], ngroups=100, by_ncols=6)
-       12.0±0.8s       1.07±0.04s     0.09  benchmarks.TimeGroupByDefaultAggregations.time_groupby_skew([10_000_000, 32], ngroups=100)

NCPUS=16

       before           after         ratio
     [a77a6464]       [bb932f4d]
       <master>       <skew_impl>
+        202±20ms         504±20ms     2.49  benchmarks.TimeGroupByMultiColumn.time_groupby_agg_skew([5000, 5000], ngroups=100, by_ncols=6)
+        217±20ms         424±20ms     1.95  benchmarks.TimeGroupByDefaultAggregations.time_groupby_skew([5000, 5000], ngroups=100)
        1.61±0.1s       1.48±0.01s     0.92  benchmarks.TimeGroupByMultiColumn.time_groupby_agg_skew([1_000_000, 256], ngroups=100, by_ncols=6)
-      1.42±0.04s       1.28±0.01s     0.91  benchmarks.TimeGroupByDefaultAggregations.time_groupby_skew([1_000_000, 256], ngroups=100)
-      1.66±0.04s         906±50ms     0.55  benchmarks.TimeGroupByMultiColumn.time_groupby_agg_skew([5000, 5000], 'huge_amount_groups', by_ncols=6)
-      1.67±0.03s         832±30ms     0.50  benchmarks.TimeGroupByDefaultAggregations.time_groupby_skew([5000, 5000], 'huge_amount_groups')
-      3.47±0.06s       1.68±0.04s     0.48  benchmarks.TimeGroupByDefaultAggregations.time_groupby_skew([1_000_000, 256], 'huge_amount_groups')
-       3.93±0.1s       1.79±0.01s     0.46  benchmarks.TimeGroupByMultiColumn.time_groupby_agg_skew([1_000_000, 256], 'huge_amount_groups', by_ncols=6)
-        782±10ms         219±30ms     0.28  benchmarks.TimeGroupByDefaultAggregations.time_groupby_skew([1_000_000, 32], ngroups=100)
-        888±40ms         226±20ms     0.26  benchmarks.TimeGroupByMultiColumn.time_groupby_agg_skew([1_000_000, 32], ngroups=100, by_ncols=6)
-       9.56±0.3s        1.76±0.1s     0.18  benchmarks.TimeGroupByDefaultAggregations.time_groupby_skew([10_000_000, 32], 'huge_amount_groups')
-       10.5±0.3s        1.79±0.2s     0.17  benchmarks.TimeGroupByMultiColumn.time_groupby_agg_skew([10_000_000, 32], ngroups=100, 6)
-       11.1±0.1s       1.86±0.07s     0.17  benchmarks.TimeGroupByMultiColumn.time_groupby_agg_skew([10_000_000, 32], 'huge_amount_groups', by_ncols=6)
-       11.8±0.2s        1.68±0.1s     0.14  benchmarks.TimeGroupByDefaultAggregations.time_groupby_skew([10_000_000, 32], ngroups=100)
-      2.60±0.08s         345±20ms     0.13  benchmarks.TimeGroupByDefaultAggregations.time_groupby_skew([1_000_000, 32], 'huge_amount_groups')
-      2.91±0.06s         361±10ms     0.12  benchmarks.TimeGroupByMultiColumn.time_groupby_agg_skew([1_000_000, 32], 'huge_amount_groups', by_ncols=6)

How to run this?

1. Add an ASV benchmark for a skew function by applying the following patch:

From bb932f4d3eb58f2c83155aaddcd1d538a5eb879e Mon Sep 17 00:00:00 2001
From: Dmitry Chigarev <dmitry.chigarev@intel.com>
Date: Tue, 6 Dec 2022 15:23:17 -0600
Subject: [PATCH] Skew benchmarks

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
---
 asv_bench/benchmarks/benchmarks.py | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/asv_bench/benchmarks/benchmarks.py b/asv_bench/benchmarks/benchmarks.py
index 3ec5d197..ff8b11d8 100644
--- a/asv_bench/benchmarks/benchmarks.py
+++ b/asv_bench/benchmarks/benchmarks.py
@@ -54,6 +54,7 @@ class BaseTimeGroupBy:
 
 
 class TimeGroupByMultiColumn(BaseTimeGroupBy):
+    timeout = 720
     param_names = ["shape", "ngroups", "groupby_ncols"]
     params = [
         get_benchmark_shapes("TimeGroupByMultiColumn"),
@@ -66,9 +67,13 @@ class TimeGroupByMultiColumn(BaseTimeGroupBy):
 
     def time_groupby_agg_mean(self, *args, **kwargs):
         execute(self.df.groupby(by=self.groupby_columns).apply(lambda df: df.mean()))
+    
+    def time_groupby_agg_skew(self, *args, **kwargs):
+        execute(self.df.groupby(by=self.groupby_columns).skew())
 
 
 class TimeGroupByDefaultAggregations(BaseTimeGroupBy):
+    timeout = 720
     param_names = ["shape", "ngroups"]
     params = [
         get_benchmark_shapes("TimeGroupByDefaultAggregations"),
@@ -86,6 +91,9 @@ class TimeGroupByDefaultAggregations(BaseTimeGroupBy):
 
     def time_groupby_mean(self, *args, **kwargs):
         execute(self.df.groupby(by=self.groupby_columns).mean())
+    
+    def time_groupby_skew(self, *args, **kwargs):
+        execute(self.df.groupby(by=self.groupby_columns).skew())
 
 
 class TimeGroupByDictionaryAggregation(BaseTimeGroupBy):
-- 
2.25.1

Specify custom data shapes for groupby benchmarks by creating the following json file:

{
    "TimeGroupByDefaultAggregations": [[1000000, 32], [10000000, 32], [5000, 5000], [1000000, 256]],
    "TimeGroupByMultiColumn": [[1000000, 32], [10000000, 32], [5000, 5000], [1000000, 256]]
}

Run ASV with the following command substituting the $ASV_CONFIG_PATH with the path to the JSON file created in the previous step:

MODIN_TEST_DATASET_SIZE="Big" MODIN_ASV_DATASIZE_CONFIG=$ASV_CONFIG_PATH asv continuous origin/master skew_impl --launch-method=spawn -b TimeGroupByDefaultAggregations.time_groupby_skew -b TimeGroupByMultiColumn.time_groupby_agg_skew --no-only-changed -a repeat=5

Square-like frames are some kind of anti-pattern for Map and MapReduce implementation in Modin (see optimization notes) so the new implementation is slower than the previous one on these cases. Filled an issue to resolve this problem generally (#5394).

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves Implement groupby.skew() via GroupbyReduce pattern #5369
~~test added~~ existing tests for .skew() are passing
module layout described at docs/development/architecture.rst is up-to-date

modin/pandas/groupby.py

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev · 2022-12-06T21:45:11Z

modin/core/dataframe/algebra/groupby.py

+            # Other is a broadcasted partition that represents 'by' data to group on.
+            # If 'drop' then the 'by' data came from the 'self' frame, thus
+            # inserting missed columns to the partition to group on them.
+            if drop or isinstance(other := other.squeeze(axis=1), pandas.DataFrame):


this previously was handled incorrectly which caused 'by' columns to be aggregated (they were dropped afterward so no effect to result), the 'skew' aggregation doesn't tolerate non-numeric columns ('by') to be aggregated thus this correction is required for proper behavior.

Why was squeeze(axis=axis ^ 1) changed to squeeze(axis=1)?

my bad, reverted to axis ^ 1

this was not alerted by tests because we only support groupby along '0' axis for now

dchigarev · 2022-12-06T21:48:42Z

modin/pandas/groupby.py

-            numeric_only=True,
+            numeric_only=NumericOnly.TRUE_EXCL_NUMERIC_CATEGORIES,


previously, native pandas .skew() implementation called by qc.groupby_agg was actually dropping unsuitable categorical columns, now we need to drop them manually

dchigarev · 2022-12-06T21:50:04Z

modin/pandas/utils.py

@@ -380,5 +381,48 @@ def _doc_binary_op(operation, bin_op, left="Series", right="right", returns="Ser
    return doc_op


+class NumericOnly(IntEnum):  # noqa: PR01


there's possibly more than one method that doesn't tolerate numeric categories, so decided to add this enum

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

vnlitvinov · 2022-12-07T20:03:01Z

Square-like frames are some kind of anti-pattern for Map and MapReduce implementation in Modin (see optimization notes) so the new implementation is slower than the previous one on these cases.

I wonder if it's possible to use old implementation for mostly square dataframes to not introduce such a speed loss...

modin/core/storage_formats/pandas/query_compiler.py

modin/pandas/groupby.py

vnlitvinov · 2022-12-07T20:11:18Z

modin/pandas/groupby.py

            )

-        if numeric_only and self.ndim == 2:
+        if numeric_only > NumericOnly.FALSE and self.ndim == 2:


What?.. I don't think one should really be comparing enums on anything but (in)equality.

If you want to check on classes of enum values, I'd suggest doing IntFlag instead

I thought that the whole point of IntEnum is to work with them like an integer, e.g. if day > DayEnum.TUESDAY: ... or as in our case with 'numeric only' if tolerance_level > ToleranceEnum.LEVEL2: ....

I've changed the IntEnum to IntFlag as you suggested, though don't see much difference between them for our use-case (even their docstrings from py-docs are almost identical).

this is not what I meant... an IntFlag enum should be like

class NumericOnly(IntFlag): AUTO = 0b000 FALSE = 0b001 TRUE = 0b010 TRUE_EXCL_NUMERIC_CATEGORIES = 0b011

(I've stated the fields in bit notation for readability), and then you check stuff like

if numeric_only & NumericOnly.TRUE: # do things

do we really want this? this seems much more complicated than a simple > check. I don't get why this is a bad approach since IntEnum API allows this?

reverted the changes introducing NumericOnly enum as it appeared as some kind of a blocker

Not everything which is technically allowed should be done. 🙃

Co-authored-by: Vasily Litvinov <fam1ly.n4me@yandex.ru> Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev · 2022-12-08T16:48:43Z

@vnlitvinov

I wonder if it's possible to use old implementation for mostly square dataframes to not introduce such a speed loss...

I've created a separate tracker (#5394) as this perf-issue affects every Map/MapReduce function so there should be a general solution for this.

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

vnlitvinov

LGTM!

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

anmyachev

LGTM!

dchigarev changed the title ~~PERF-#0000 GroupBy.skew implementation via MapReduce pattern~~ PERF-#0000: GroupBy.skew implementation via MapReduce pattern Dec 2, 2022

dchigarev force-pushed the skew_impl branch from 64d2067 to 31eec1b Compare December 2, 2022 17:35

github-advanced-security bot found potential problems Dec 2, 2022

View reviewed changes

modin/pandas/groupby.py Fixed Show fixed Hide fixed

dchigarev added 5 commits December 6, 2022 05:13

Initial 'skew' implementation

c0f2d1a

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

Add comments

5c13e7c

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

Fix '.skew()' on categorical

0eee32f

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

Fix pydocstyle warnings

0301760

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

Vectorize 'pow' operation over the whole frame

49d6df9

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev force-pushed the skew_impl branch from 31eec1b to 49d6df9 Compare December 6, 2022 11:14

dchigarev changed the title ~~PERF-#0000: GroupBy.skew implementation via MapReduce pattern~~ PERF-#5369: GroupBy.skew implementation via MapReduce pattern Dec 6, 2022

dchigarev commented Dec 6, 2022

View reviewed changes

Fix multi-column case

47f39fc

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev marked this pull request as ready for review December 6, 2022 23:29

dchigarev requested a review from a team as a code owner December 6, 2022 23:29

vnlitvinov requested changes Dec 7, 2022

View reviewed changes

Apply suggestions from review

e02e456

Co-authored-by: Vasily Litvinov <fam1ly.n4me@yandex.ru> Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev requested a review from vnlitvinov December 8, 2022 17:08

Revert 'NumericOnly' enum

f5c8df8

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

vnlitvinov previously approved these changes Dec 12, 2022

View reviewed changes

Squeeze on 'axis^1' axis

0d1df41

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev dismissed vnlitvinov’s stale review via 0d1df41 December 14, 2022 15:31

anmyachev approved these changes Dec 14, 2022

View reviewed changes

anmyachev merged commit 8b5452d into modin-project:master Dec 14, 2022

dchigarev mentioned this pull request Dec 23, 2022

FEAT-#5481: Implement dictionary groupby aggregation via TreeReduce #5503

Merged

7 tasks

This was referenced Jan 17, 2023

groupby.skew on Linux produces different from pandas results in case of incorrect input #5545

Closed

FIX-#5545: Align error-handling logic with pandas for groupby.skew #5558

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF-#5369: `GroupBy.skew` implementation via MapReduce pattern #5318

PERF-#5369: `GroupBy.skew` implementation via MapReduce pattern #5318

dchigarev commented Dec 2, 2022 •

edited

Loading

dchigarev Dec 6, 2022

anmyachev Dec 14, 2022

dchigarev Dec 14, 2022

dchigarev Dec 6, 2022

dchigarev Dec 6, 2022

vnlitvinov commented Dec 7, 2022

vnlitvinov Dec 7, 2022

dchigarev Dec 8, 2022

vnlitvinov Dec 8, 2022

dchigarev Dec 8, 2022

dchigarev Dec 12, 2022

vnlitvinov Dec 12, 2022

dchigarev commented Dec 8, 2022 •

edited

Loading

vnlitvinov left a comment

anmyachev left a comment

		numeric_only=True,
		numeric_only=NumericOnly.TRUE_EXCL_NUMERIC_CATEGORIES,

		@@ -380,5 +381,48 @@ def _doc_binary_op(operation, bin_op, left="Series", right="right", returns="Ser
		return doc_op


		class NumericOnly(IntEnum): # noqa: PR01

PERF-#5369: GroupBy.skew implementation via MapReduce pattern #5318

PERF-#5369: GroupBy.skew implementation via MapReduce pattern #5318

Conversation

dchigarev commented Dec 2, 2022 • edited Loading

What do these changes do?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vnlitvinov commented Dec 7, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dchigarev commented Dec 8, 2022 • edited Loading

vnlitvinov left a comment

Choose a reason for hiding this comment

anmyachev left a comment

Choose a reason for hiding this comment

PERF-#5369: `GroupBy.skew` implementation via MapReduce pattern #5318

PERF-#5369: `GroupBy.skew` implementation via MapReduce pattern #5318

dchigarev commented Dec 2, 2022 •

edited

Loading

dchigarev commented Dec 8, 2022 •

edited

Loading