-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT-#5394: Reduce amount of remote calls for TreeReduce and GroupByReduce operators #7245
Conversation
… and GroupByReduce operators Signed-off-by: Kirill Suvorov <kirill.suvorov@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Retribution98 do you have any performance numbers?
It's also a good idea to add tests for the new operators, which now work a little differently.
@@ -2205,46 +2205,12 @@ def map( | |||
PandasDataframe | |||
A new dataframe. | |||
""" | |||
if self.num_parts <= 1.5 * CpuCount.get(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was the implementation moved to a lower level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was moved here to avoid duplicating logic, and map_partitions
in the partition manager is only used in these cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, your implementation was also used instead of self._partition_mgr_cls.lazy_map_partitions
function, under some condition, but now it is not. Is that how it was intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tree_reduce and groupby_reduce call map_partitions at the partition mgr. That's why @Retribution98 moved the logic there I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I see it. I think lazy map will be better in a lazy pipeline because some partitions may not be calculated further, so this issue is not relevant for this case.
@anmyachev
|
Since the logic is at a lower level, I modified the test to test this and it covered all cases where map_partitions is used. |
) | ||
else: | ||
# splitting by full axis partitions | ||
new_partitions = cls.map_axis_partitions( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using map_axis_partitions
function in map_partitions
function does not seem obvious and defeats the purpose of map_partitions
function, which declares that it applies to every partition.
The dataframe level for choosing a suitable strategy seems more appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since partition mgr rather than core df is designed to play around with partitions, maybe we should just update the docstring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I added base_map_partitions
to keep the simplest implementation, but by default we will use new approaches. Are you agree?
@Retribution98, could you also check performance for dtypes, which is part of #2751? |
Apply approaches from PR-7136 for TreeReduce and GroupByReduce operators
What do these changes do?
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
docs/development/architecture.rst
is up-to-date