-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF-#5369: GroupBy.skew
implementation via MapReduce pattern
#5318
Conversation
GroupBy.skew
implementation via MapReduce patternGroupBy.skew
implementation via MapReduce pattern
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
GroupBy.skew
implementation via MapReduce patternGroupBy.skew
implementation via MapReduce pattern
# Other is a broadcasted partition that represents 'by' data to group on. | ||
# If 'drop' then the 'by' data came from the 'self' frame, thus | ||
# inserting missed columns to the partition to group on them. | ||
if drop or isinstance(other := other.squeeze(axis=1), pandas.DataFrame): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this previously was handled incorrectly which caused 'by' columns to be aggregated (they were dropped afterward so no effect to result), the 'skew' aggregation doesn't tolerate non-numeric columns ('by') to be aggregated thus this correction is required for proper behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was squeeze(axis=axis ^ 1)
changed to squeeze(axis=1)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my bad, reverted to axis ^ 1
this was not alerted by tests because we only support groupby along '0' axis for now
modin/pandas/groupby.py
Outdated
numeric_only=True, | ||
numeric_only=NumericOnly.TRUE_EXCL_NUMERIC_CATEGORIES, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
previously, native pandas .skew()
implementation called by qc.groupby_agg
was actually dropping unsuitable categorical columns, now we need to drop them manually
modin/pandas/utils.py
Outdated
@@ -380,5 +381,48 @@ def _doc_binary_op(operation, bin_op, left="Series", right="right", returns="Ser | |||
return doc_op | |||
|
|||
|
|||
class NumericOnly(IntEnum): # noqa: PR01 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's possibly more than one method that doesn't tolerate numeric categories, so decided to add this enum
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
I wonder if it's possible to use old implementation for mostly square dataframes to not introduce such a speed loss... |
modin/pandas/groupby.py
Outdated
) | ||
|
||
if numeric_only and self.ndim == 2: | ||
if numeric_only > NumericOnly.FALSE and self.ndim == 2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What?.. I don't think one should really be comparing enums on anything but (in)equality.
If you want to check on classes of enum values, I'd suggest doing IntFlag
instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that the whole point of IntEnum
is to work with them like an integer, e.g. if day > DayEnum.TUESDAY: ...
or as in our case with 'numeric only' if tolerance_level > ToleranceEnum.LEVEL2: ...
.
I've changed the IntEnum
to IntFlag
as you suggested, though don't see much difference between them for our use-case (even their docstrings from py-docs are almost identical).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not what I meant... an IntFlag
enum should be like
class NumericOnly(IntFlag):
AUTO = 0b000
FALSE = 0b001
TRUE = 0b010
TRUE_EXCL_NUMERIC_CATEGORIES = 0b011
(I've stated the fields in bit notation for readability), and then you check stuff like
if numeric_only & NumericOnly.TRUE:
# do things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really want this? this seems much more complicated than a simple >
check. I don't get why this is a bad approach since IntEnum
API allows this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverted the changes introducing NumericOnly
enum as it appeared as some kind of a blocker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not everything which is technically allowed should be done. 🙃
Co-authored-by: Vasily Litvinov <fam1ly.n4me@yandex.ru> Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
I've created a separate tracker (#5394) as this perf-issue affects every Map/MapReduce function so there should be a general solution for this. |
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
What do these changes do?
Here are some performance comparisons of
.groupby().skew()
between the current master and MapReduce implementation. The comparison was run via ASV with our existing groupby scenarios. CPU: 2x 28 Cores (56 threads) Xeon Platinum-8276L @ 2.20NCPUS=112
NCPUS=16
How to run this?
1. Add an ASV benchmark for a skew function by applying the following patch:$ASV_CONFIG_PATH
with the path to the JSON file created in the previous step:Square-like frames are some kind of anti-pattern for Map and MapReduce implementation in Modin (see optimization notes) so the new implementation is slower than the previous one on these cases. Filled an issue to resolve this problem generally (#5394).
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
groupby.skew()
via GroupbyReduce pattern #5369test addedexisting tests for.skew()
are passingdocs/development/architecture.rst
is up-to-date