You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): MacOS 12.0.1
Modin version (modin.__version__): 0.14
Python version: 3.9.9
Code we can use to reproduce:
import modin.pandas as pd
import numpy as np
import ray
ray.init()
data = np.random.randint(0, 100, size=(2**10, 2**4))
df = pd.DataFrame(data) # if we add add_prefix("col") here, the partition types change and this script succeeds
# df.reset_index(inplace=True) also makes this script succeed by changing the partition type.
big_df = pd.concat([df for _ in range(5)])
big_df.groupby(1).count()
Describe the problem
When a DataFrame is created by concatenating multiple DataFrames column-wise, the resulting DataFrame is composed of partitions that are of type PandasOnRayDataframeColumnPartition, and groupby.count fails. When the DataFrames partitions are of type PandasOnRayDataframePartition, the groupby.count succeeds. You can inspect this by running the two versions of the above script and checking big_df._query_compiler._modin_frame._partitions.
System information
modin.__version__
): 0.14Describe the problem
When a DataFrame is created by concatenating multiple DataFrames column-wise, the resulting DataFrame is composed of partitions that are of type
PandasOnRayDataframeColumnPartition
, andgroupby.count
fails. When the DataFrames partitions are of typePandasOnRayDataframePartition
, thegroupby.count
succeeds. You can inspect this by running the two versions of the above script and checkingbig_df._query_compiler._modin_frame._partitions
.Even more confusingly, this doesn't fail with methods like
groupby.mean
,groupby.sum
, and when the internal method for groupby.count (_wrap_aggregation
) is called withnumeric_only=True
instead ofFalse
, it succeeds. See https://github.com/modin-project/modin/blob/master/modin/pandas/groupby.py#L798-L802.Opening the issue because I don't understand groupby behavior well enough to debug this.
Source code / logs
The text was updated successfully, but these errors were encountered: