-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Make grouped and hierarchical dataframe-agnostic #667
Conversation
X_ = ( | ||
pd.DataFrame(X, columns=[f"x_{i}" for i in range(X.shape[1])]) | ||
.assign( | ||
g_0=1, | ||
g_1=["A"] * (n_samples // 2) + ["B"] * (n_samples // 2), | ||
g_2=["X"] * (n_samples // 4) + ["Y"] * (n_samples // 2) + ["Z"] * (n_samples // 4), | ||
) | ||
.pipe(frame_func) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarcoGorelli here is where I encountered the with_columns
issue. My first intention was to do it as:
(
frame_func(dict(zip([f"x_{i}" for i in range(X.shape[1])]), X.T)
.pipe(nw.from_native)
.with_columns(
g_0=1,
g_1=["A"] * (n_samples // 2) + ["B"] * (n_samples // 2),
g_2=["X"] * (n_samples // 4) + ["Y"] * (n_samples // 2) + ["Z"] * (n_samples // 4),
)
.pipe(nw.to_native)
)
and tried with a bunch of variations (lists, tuples, numpy arrays)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Polars doesn't do what you're expecting it to here 😉
In [34]: frame = pl.DataFrame({"a": [1,2,3]})
In [35]: frame.with_columns(b=['a', 'b', 'c'])
Out[35]:
shape: (3, 2)
┌─────┬─────────────────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ list[str] │
╞═════╪═════════════════╡
│ 1 ┆ ["a", "b", "c"] │
│ 2 ┆ ["a", "b", "c"] │
│ 3 ┆ ["a", "b", "c"] │
└─────┴─────────────────┘
I think what we need is a Series constructor. Something like
df.with_columns(
g_1=nw.Series(
["A"] * (n_samples // 2) + ["B"] * (n_samples // 2),
implementation=nw.get_implementation(X),
)
)
?
We need to pass something to nw.Series
, otherwise it doesn't know which library's Series to create
An alternative API could be
plx = nw.get_namespace(X)
df.with_columns(
g_1=plx.Series(["A"] * (n_samples // 2) + ["B"] * (n_samples // 2))
)
which...might be a bit cleaner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should move this conversation to the issue 😇
I think Polars doesn't do what you're expecting it to here 😉
You are right, but it does with numpy.
We need to pass something to nw.Series, otherwise it doesn't know which library's Series to create
Wouldn't with_columns
has some context (from self)? (As mentioned on Discord I need to catch up a lot with the internals, currently this is black magic to me)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it does with numpy.
Just pushed out a release which can handle numpy 😎 It may take a few minutes to appear
Wouldn't with_columns has some context (from self)?
Yeah, I'm just trying to figure out how to stay compatible with the Polars API. Because if you want to create a new column with [1, 2, 3]
in Polars, you'd do
df.with_columns(d=pl.Series([1,2,3]))
So I think we need to instruct the user to construct a nw.Series
, in such a way that the series can exist in a self-standing way without the df.with_columns
context. I'm not really sure here...will sleep on it
But the numpy array case is easy, so let's start with that 😇
🤔 are you sure about adding pyarrow as required? I think we should be able to solve this one
Narwhals' |
@MarcoGorelli oh I see what you mean! Regarding hierarchical I am 100% sure it is doable, I can do it tomorrow myself (Edit: ok maybe let's say 99 as I need some missing features) For Grouped, internals are a bit messy, but I will give it a try. |
I am really pushing the boundaries for Current status is:
|
_X = grp_frame.drop(columns=self.groups_ + [self._TARGET_NAME]) | ||
_y = grp_frame[self._TARGET_NAME] | ||
_X = nw.to_native(grp_frame.drop([*self.groups_, self._TARGET_NAME])) | ||
_y = nw.to_native(grp_frame.select(self._TARGET_NAME)) | ||
return clone(self.estimator).fit(_X, _y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I am ending up with the following issue in case of polars.
If _X is empty (e.g. the dummy case in test_expected_output
test), while for pandas dropping all the columns, still maintains the original shape (I guess because index is preserved?), for polars we end up with _X.shape
being (0, 0).
Which then leads to the estimator raising:
ValueError: Found input variables with inconsistent numbers of samples
We could say this is kind of an edge case, but users use this as I remember a related issue (and Vincent tutorial on it).
@MarcoGorelli do you have any ideas on how to deal with this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to have more context, do you happen to remember which issue it was? or the tutorial? I searched "hierarchical empty" here and came out with nothing
numpy does allow for things like np.zeros((6, 0))
, and I presume things are getting converted to numpy at some point here anyway? So maybe, making an empty numpy array with a given shape somewhere can be a solution?
(I haven't yet looked into this code so sorry if this doesn't make sense)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may also need a reminder 😅 what tutorial are you recalling here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In issue #573 someone reported that specific case (ok maybe not directly from a tutorial 😂)
Fixed all the issues I had in the conversion and marking this as ready to review. I am quite happy with the results 🙌🏼🙌🏼 The only breaking change from previous version/implementation is the unfeasibility of providing negative indices in |
Wow, amazing!
This is only when starting from non-dataframe input (e.g. numpy) right? If so, then would a patch like diff --git a/sklego/meta/_grouped_utils.py b/sklego/meta/_grouped_utils.py
index 97180ab..e744f5a 100644
--- a/sklego/meta/_grouped_utils.py
+++ b/sklego/meta/_grouped_utils.py
@@ -20,7 +20,6 @@ def parse_X_y(X, y, groups, check_X=True, **kwargs) -> nw.DataFrame:
_data_format_checks(X)
# Convert X to Narwhals frame
- X = nw.from_native(X, strict=False, eager_only=True)
if not isinstance(X, nw.DataFrame):
X = nw.from_native(pd.DataFrame(X))
diff --git a/sklego/meta/grouped_transformer.py b/sklego/meta/grouped_transformer.py
index 9f6b0d9..9c2eac3 100644
--- a/sklego/meta/grouped_transformer.py
+++ b/sklego/meta/grouped_transformer.py
@@ -109,6 +109,9 @@ class GroupedTransformer(BaseEstimator, TransformerMixin):
self.fallback_ = None
self.groups_ = as_list(self.groups) if self.groups is not None else None
+ X = nw.from_native(X, strict=False, eager_only=True)
+ if not isinstance(X, nw.DataFrame) and self.groups_ is not None:
+ self.groups_ = [X.shape[1]+group if isinstance(group, int) and group<0 else group for group in self.groups_]
frame = parse_X_y(X, y, self.groups_, check_X=self.check_X, **self._check_kwargs)
if self.groups is None:
@@ -181,6 +184,7 @@ class GroupedTransformer(BaseEstimator, TransformerMixin):
"""
check_is_fitted(self, ["fallback_", "transformers_"])
+ X = nw.from_native(X, strict=False, eager_only=True)
frame = parse_X_y(X, y=None, groups=self.groups_, check_X=self.check_X, **self._check_kwargs).drop(
"__sklego_target__"
)
diff --git a/tests/test_meta/test_grouped_transformer.py b/tests/test_meta/test_grouped_transformer.py
index b9aae2c..18b28d5 100644
--- a/tests/test_meta/test_grouped_transformer.py
+++ b/tests/test_meta/test_grouped_transformer.py
@@ -291,7 +291,7 @@ def test_array_with_multiple_string_cols(penguins):
X = penguins
# BROKEN: Failing due to negative indexing... kind of an edge case
- meta = GroupedTransformer(StandardScaler(), groups=[0, X.shape[1] - 1])
+ meta = GroupedTransformer(StandardScaler(), groups=[0, -1])
transformed = meta.fit_transform(X) allow you to preserve current behaviour? |
That's a nice trick! Thanks Marco! Pushing it now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had been glancing these changes over the last two weeks but have not seen anything that stands out, or at least ... I recall making some comments but these have all since been adressed.
I say we push!
* placeholder to develop narwhals features * feat: make `ColumnDropper` dataframe-agnostic (#655) * feat: make ColumnDropped dataframe-agnostic * use narwhals[polars] in pyproject.toml, link to list of supported libraries * note that narwhals is used for cross-dataframe support * test refactor * docstrings --------- Co-authored-by: FBruzzesi <francesco.bruzzesi.93@gmail.com> * feat: make ColumnSelector dataframe-agnostic (#659) * columnselector with test rufformatted * adding whitespace * fixed the fit and transform * removed intendation in examples * font:false * feat: make `add_lags` dataframe-agnostic (#661) * make add_lags dataframe-agnostic * try getting tests to run? * patch: cvxpy 1.5.0 support (#663) --------- Co-authored-by: Francesco Bruzzesi <42817048+FBruzzesi@users.noreply.github.com> * Make `RegressionOutlier` dataframe-agnostic (#665) * make regression outlier df-agnostic * need to use eager-only for this one * pass native to check_array * remove cudf, link to check_X_y * feat: Make InformationFilter dataframe-agnostic * Make Timegapsplit dataframe-agnostic (#668) * make timegapsplit dataframe-agnostic * actually, include cuDF * feat: make FairClassifier data-agnostic (#669) * start all over * fixture working * wip * passing tests - again * pre-commit complaining * changed fixture on test_demographic_parity * feat: Make PandasTypeSelector selector dataframe-agnostic (#670) * make pandas dtype selector df-agnostic * bump version * 3.8 compat * Update sklego/preprocessing/pandastransformers.py Co-authored-by: Francesco Bruzzesi <42817048+FBruzzesi@users.noreply.github.com> * fixup pyproject.toml * unify (and test!) error message * deprecate * update readme * undo contribution.md change --------- Co-authored-by: Francesco Bruzzesi <42817048+FBruzzesi@users.noreply.github.com> * format typeselector and bump version * feat: Make grouped and hierarchical dataframe-agnostic (#667) * feat: make grouped and hierarchical dataframe-agnostic * add pyarrow * narwhals grouped_transformer * grouped transformer eureka * hierarchical narwhalified * so close but so far * return series instead of DataFrame for y * grouped WIP * merge branch and fix grouped * future annotations * format * handling negative indices * solve conflicts * hacking C * fairness: change C values in tests --------- Co-authored-by: Marco Edward Gorelli <marcogorelli@protonmail.com> Co-authored-by: Magdalena Anopsy <74981211+anopsy@users.noreply.github.com> Co-authored-by: Dea María Léon <deamarialeon@gmail.com>
Description
Addresses Grouped* and Hierarchical, following up to the comment I added in the thread. I will paste it for easier visibility: