FEAT-#7117: Support building range-partitioning from an index level #7120

dchigarev · 2024-03-25T18:06:45Z

What do these changes do?

Adds support for building range-partitioning based on an index level(s) and enables range-partitioning support for df.groupby(level=...) cases.

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves Support building range-partitioning from an index level #7117
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date

…index level Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev · 2024-03-28T10:28:16Z

modin/pandas/test/test_groupby.py

    if RangePartitioningGroupby.get():
-        return tuple(df.sort_index() for df in dfs)


previously, the function was only sorting by index, however in case of the low amount of unique values in the index this approach may not work well, so sorting on column values as it appears to be more stable for dataframes

dchigarev · 2024-03-28T10:30:10Z

modin/core/dataframe/pandas/dataframe/dataframe.py

@@ -2453,7 +2454,7 @@ def _apply_func_to_range_partitioning(
        Parameters
        ----------
        key_columns : list of hashables
-            Columns to build the range partitioning for.
+            Columns to build the range partitioning for. Can't be specified along with `level`.


pandas.groupby.level doc says: "Do not specify both by and level.", so we're putting the same restriction

dchigarev · 2024-03-28T10:31:22Z

modin/pandas/test/test_groupby.py

@@ -1726,7 +1746,7 @@ def test_groupby_getitem_preserves_key_order_issue_6154():
    ],
 )
 def test_groupby_multiindex(groupby_kwargs):
-    frame_data = np.random.randint(0, 100, size=(2**6, 2**4))
+    frame_data = np.random.randint(0, 100, size=(2**6, 2**6))


increased the number from 16 to 64 to use multiple partitions

dchigarev added 4 commits March 25, 2024 19:00

FEAT-modin-project#7117: Support building range-partitioning from an …

fbe8998

…index level Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

add comments

7c6c7ff

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

fix tests

5f5f179

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

simplify the sorting function

419b0ff

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev commented Mar 28, 2024

View reviewed changes

dchigarev marked this pull request as ready for review March 28, 2024 11:07

dchigarev requested review from devin-petersohn, mvashishtha, RehanSD, YarShev, vnlitvinov, anmyachev and a team as code owners March 28, 2024 11:07

dchigarev mentioned this pull request Apr 2, 2024

FEAT-#7118: Add range-partitioning impl for 'df.resample()' #7140

Merged

7 tasks

YarShev approved these changes Apr 2, 2024

View reviewed changes

YarShev merged commit 9afc049 into modin-project:master Apr 2, 2024
37 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT-#7117: Support building range-partitioning from an index level #7120

FEAT-#7117: Support building range-partitioning from an index level #7120

dchigarev commented Mar 25, 2024 •

edited

Loading

dchigarev Mar 28, 2024

dchigarev Mar 28, 2024

dchigarev Mar 28, 2024

		if RangePartitioningGroupby.get():
		return tuple(df.sort_index() for df in dfs)

FEAT-#7117: Support building range-partitioning from an index level #7120

FEAT-#7117: Support building range-partitioning from an index level #7120

Conversation

dchigarev commented Mar 25, 2024 • edited Loading

What do these changes do?

dchigarev Mar 28, 2024

Choose a reason for hiding this comment

dchigarev Mar 28, 2024

Choose a reason for hiding this comment

dchigarev Mar 28, 2024

Choose a reason for hiding this comment

dchigarev commented Mar 25, 2024 •

edited

Loading