Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

groupby.count fails with virtual partitions #4464

Closed
jeffreykennethli opened this issue May 16, 2022 · 0 comments · Fixed by #4490
Closed

groupby.count fails with virtual partitions #4464

jeffreykennethli opened this issue May 16, 2022 · 0 comments · Fixed by #4490
Assignees
Labels
bug 🦗 Something isn't working Internals Internal modin functionality pandas.groupby

Comments

@jeffreykennethli
Copy link

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): MacOS 12.0.1
  • Modin version (modin.__version__): 0.14
  • Python version: 3.9.9
  • Code we can use to reproduce:
import modin.pandas as pd
import numpy as np
import ray

ray.init()
data = np.random.randint(0, 100, size=(2**10, 2**4))
df = pd.DataFrame(data) # if we add add_prefix("col") here, the partition types change and this script succeeds
# df.reset_index(inplace=True) also makes this script succeed by changing the partition type.
big_df = pd.concat([df for _ in range(5)])
big_df.groupby(1).count()

Describe the problem

When a DataFrame is created by concatenating multiple DataFrames column-wise, the resulting DataFrame is composed of partitions that are of type PandasOnRayDataframeColumnPartition, and groupby.count fails. When the DataFrames partitions are of type PandasOnRayDataframePartition, the groupby.count succeeds. You can inspect this by running the two versions of the above script and checking big_df._query_compiler._modin_frame._partitions.

Even more confusingly, this doesn't fail with methods like groupby.mean, groupby.sum, and when the internal method for groupby.count (_wrap_aggregation) is called with numeric_only=True instead of False, it succeeds. See https://github.com/modin-project/modin/blob/master/modin/pandas/groupby.py#L798-L802.

Opening the issue because I don't understand groupby behavior well enough to debug this.

Source code / logs

@jeffreykennethli jeffreykennethli added bug 🦗 Something isn't working pandas.groupby Internals Internal modin functionality labels May 16, 2022
@jeffreykennethli jeffreykennethli self-assigned this May 17, 2022
jeffreykennethli pushed a commit to jeffreykennethli/modin that referenced this issue May 19, 2022
…y args

Signed-off-by: jeffreykennethli <jkli@ponder.io>
jeffreykennethli pushed a commit to jeffreykennethli/modin that referenced this issue Jun 2, 2022
…t failing on virtual partitions

Signed-off-by: jeffreykennethli <jkli@ponder.io>
devin-petersohn added a commit that referenced this issue Jun 7, 2022
…virtual partitions (#4490)

Co-authored-by: Devin Petersohn <devin-petersohn@users.noreply.github.com>
Signed-off-by: jeffreykennethli <jkli@ponder.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working Internals Internal modin functionality pandas.groupby
Projects
None yet
1 participant