You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As part of geopandas/dask-geopandas#285, we found that dask-expr will lose the type of a pandas DataFrame subclass in groupby.agg if (and only if?) the split_out parameter is used.
Minimal Complete Verifiable Example:
Given this file:
# file: test.pyimportdask.dataframe.backendsimportpandasaspdimportdask_exprasdximportdask.dataframeasddfromdask.dataframe.dispatchimportmake_meta_dispatch, meta_nonemptyfromdask.dataframe.coreimportget_parallel_typeimportdask.dataframe.backendsdask.config.set(scheduler="single-threaded")
classMySeries(pd.Series):
@propertydef_constructor(self):
returnMySeries@propertydef_constructor_expanddim(self):
returnMyDataFrameclassMyDataFrame(pd.DataFrame):
@propertydef_constructor(self):
returnMyDataFrame@propertydef_constructor_sliced(self):
returnMySeriesclassMyIndex(pd.Index): ...
classMyDaskSeries(dx.Series):
_partition_type=MySeriesclassMyDaskDataFrame(dx.DataFrame):
_partition_type=MyDataFrameclassMyDaskIndex(dx.Index):
_partition_type=MyIndex# Unclear if any of get_parallel_type and make_meta_dispatch are needed.# Reproduces with or without them.@get_parallel_type.register(MyDataFrame)defget_parallel_type_dataframe(df):
returnMyDataFrame@get_parallel_type.register(MySeries)defget_parallel_type_series(s):
returnMyDaskSeries@get_parallel_type.register(MyIndex)defget_parallel_type_index(ind):
returnMyDaskIndex@make_meta_dispatch.register(MyDataFrame)defmake_meta_dataframe(df, index=None):
returndf.head(0)
@make_meta_dispatch.register(MySeries)defmake_meta_series(s, index=None):
returns.head(0)
@make_meta_dispatch.register(MyIndex)defmake_meta_index(ind, index=None):
returnind[:0]
@meta_nonempty.register(MyDataFrame)defmake_meta_nonempty_dataframe(x):
returnMyDataFrame(dask.dataframe.backends.meta_nonempty_dataframe(x))
df=dx.from_dict(
{"a": [1, 1, 2, 2], "b": [1, 2, 3, 4]}, npartitions=4, constructor=MyDataFrame
)
a=df.groupby("a").agg("first")
b=df.groupby("a").agg("first", split_out=2)
print("split-out=None", type(a.compute()))
print("split-out=2 ", type(b.compute()))
I would expect the type there to be __main__.MyDataFrame regardless of split_out.
Anything else we need to know?:
Environment:
dask 2024.4.1
dask-expr 1.0.11
Edit: I made one addition to the script: adding a @meta_nonempty.register(MyDataFrame). I noticed that in DecomposableGroupbyAggregation.combine and DecomposableGroupbyAggregation.aggregate the types were regular pandas DataFrames, instead of the subclass.
Registering that meta_nonempty does keep it as MyDataFrame initially. I put some print statements in those methods to print the type of inputs[0] and type(_concat(inputs)) and get
So initially we're OK, but by the time we do the final aggregate we've lost the subclass.
The text was updated successfully, but these errors were encountered:
TomAugspurger
changed the title
DataFrame subclass in groupby.agg with split_out set.
DataFrame subclass lost in groupby.agg with split_out set.
Apr 14, 2024
This is a shuffle issue (and also present on the current implementation if I am not mistaken?)
df.shuffle("a") will lose your type, that's what we do under the hood if split_out != 1. shuffle_method="tasks" keeps it, disk and p2p lose it.
I can patch that so that your resulting DataFrame will have the correct type, but I don't know if we can guarantee that we keep whatever you might add to the subclass through shuffles without you overriding the shuffle specific methods
Describe the issue:
As part of geopandas/dask-geopandas#285, we found that dask-expr will lose the type of a pandas DataFrame subclass in
groupby.agg
if (and only if?) thesplit_out
parameter is used.Minimal Complete Verifiable Example:
Given this file:
running that produces
I would expect the
type
there to be__main__.MyDataFrame
regardless ofsplit_out
.Anything else we need to know?:
Environment:
Edit: I made one addition to the script: adding a
@meta_nonempty.register(MyDataFrame)
. I noticed that inDecomposableGroupbyAggregation.combine
andDecomposableGroupbyAggregation.aggregate
the types were regular pandas DataFrames, instead of the subclass.Registering that
meta_nonempty
does keep it asMyDataFrame
initially. I put some print statements in those methods to print the type ofinputs[0]
andtype(_concat(inputs))
and getSo initially we're OK, but by the time we do the final
aggregate
we've lost the subclass.The text was updated successfully, but these errors were encountered: