feat: Make grouped and hierarchical dataframe-agnostic #667

FBruzzesi · 2024-05-11T20:12:50Z

Description

Addresses Grouped* and Hierarchical, following up to the comment I added in the thread. I will paste it for easier visibility:

I am having a hard time to think of another way of implementing meta estimators that use pandas just for it's grouping functionalities. Namely HierarchicalPredictors and Grouped* are not necessarily taking a dataframe input, but use pandas groupby's to iterate over different batch of data.

Would this be possible using pure numpy so that we don't need the pandas dependency? yes

Would this be significantly more inconvenient to maintain? absolutely yes

My opinion/suggestion is to leave them as they are in their logic.

If anything, we can allow arbitrary dataframe types as input, and use Narwhals to convert them to pandas and let the rest flow as is.

pyproject.toml

sklego/meta/hierarchical_predictor.py

FBruzzesi · 2024-05-11T20:17:21Z

tests/test_meta/test_hierarchical_predictor.py

+    X_ = (
+        pd.DataFrame(X, columns=[f"x_{i}" for i in range(X.shape[1])])
+        .assign(
+            g_0=1,
+            g_1=["A"] * (n_samples // 2) + ["B"] * (n_samples // 2),
+            g_2=["X"] * (n_samples // 4) + ["Y"] * (n_samples // 2) + ["Z"] * (n_samples // 4),
+        )
+        .pipe(frame_func)


@MarcoGorelli here is where I encountered the with_columns issue. My first intention was to do it as:

( frame_func(dict(zip([f"x_{i}" for i in range(X.shape[1])]), X.T) .pipe(nw.from_native) .with_columns( g_0=1, g_1=["A"] * (n_samples // 2) + ["B"] * (n_samples // 2), g_2=["X"] * (n_samples // 4) + ["Y"] * (n_samples // 2) + ["Z"] * (n_samples // 4), ) .pipe(nw.to_native) )

and tried with a bunch of variations (lists, tuples, numpy arrays)

I think Polars doesn't do what you're expecting it to here 😉

In [34]: frame = pl.DataFrame({"a": [1,2,3]}) In [35]: frame.with_columns(b=['a', 'b', 'c']) Out[35]: shape: (3, 2) ┌─────┬─────────────────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ list[str] │ ╞═════╪═════════════════╡ │ 1 ┆ ["a", "b", "c"] │ │ 2 ┆ ["a", "b", "c"] │ │ 3 ┆ ["a", "b", "c"] │ └─────┴─────────────────┘

I think what we need is a Series constructor. Something like

df.with_columns( g_1=nw.Series( ["A"] * (n_samples // 2) + ["B"] * (n_samples // 2), implementation=nw.get_implementation(X), ) )

?

We need to pass something to nw.Series, otherwise it doesn't know which library's Series to create

An alternative API could be

plx = nw.get_namespace(X) df.with_columns( g_1=plx.Series(["A"] * (n_samples // 2) + ["B"] * (n_samples // 2)) )

which...might be a bit cleaner?

Maybe we should move this conversation to the issue 😇

I think Polars doesn't do what you're expecting it to here 😉

You are right, but it does with numpy.

We need to pass something to nw.Series, otherwise it doesn't know which library's Series to create

Wouldn't with_columns has some context (from self)? (As mentioned on Discord I need to catch up a lot with the internals, currently this is black magic to me)?

it does with numpy.

Just pushed out a release which can handle numpy 😎 It may take a few minutes to appear

Wouldn't with_columns has some context (from self)?

Yeah, I'm just trying to figure out how to stay compatible with the Polars API. Because if you want to create a new column with [1, 2, 3] in Polars, you'd do

df.with_columns(d=pl.Series([1,2,3]))

So I think we need to instruct the user to construct a nw.Series, in such a way that the series can exist in a self-standing way without the df.with_columns context. I'm not really sure here...will sleep on it

But the numpy array case is easy, so let's start with that 😇

MarcoGorelli · 2024-05-11T20:22:54Z

🤔 are you sure about adding pyarrow as required? I think we should be able to solve this one

but use pandas groupby's to iterate over different batch of data.

Narwhals' DataFrame.GroupBy also lets you iterate over groups - I'll take a closer look at what's required, but this seems feasible

FBruzzesi · 2024-05-11T20:34:15Z

@MarcoGorelli oh I see what you mean! Regarding hierarchical I am 100% sure it is doable, I can do it tomorrow myself (Edit: ok maybe let's say 99 as I need some missing features)

For Grouped, internals are a bit messy, but I will give it a try.

sklego/meta/grouped_transformer.py

FBruzzesi · 2024-05-12T21:44:30Z

I am really pushing the boundaries for GroupedPredictor, I think within another swipe I can finish this.

Current status is:

GroupedTransformer passes all the tests, except one in which the group index is provided as negative (honestly, who would do that 😂)
HierarchicalPredictor is failing only one test as well (tests/test_meta/test_hierarchical_predictor.py::test_expected_output), but only for the polars case, which could be due to the fact that the estimator cannot deal with polars?
GroupedPredictor needs some more patches to close the deal.

FBruzzesi · 2024-05-13T06:57:46Z

sklego/meta/hierarchical_predictor.py

-        _X = grp_frame.drop(columns=self.groups_ + [self._TARGET_NAME])
-        _y = grp_frame[self._TARGET_NAME]
+        _X = nw.to_native(grp_frame.drop([*self.groups_, self._TARGET_NAME]))
+        _y = nw.to_native(grp_frame.select(self._TARGET_NAME))
        return clone(self.estimator).fit(_X, _y)


Here I am ending up with the following issue in case of polars.

If _X is empty (e.g. the dummy case in test_expected_output test), while for pandas dropping all the columns, still maintains the original shape (I guess because index is preserved?), for polars we end up with _X.shape being (0, 0).

Which then leads to the estimator raising:

ValueError: Found input variables with inconsistent numbers of samples

We could say this is kind of an edge case, but users use this as I remember a related issue (and Vincent tutorial on it).

@MarcoGorelli do you have any ideas on how to deal with this?

just to have more context, do you happen to remember which issue it was? or the tutorial? I searched "hierarchical empty" here and came out with nothing

numpy does allow for things like np.zeros((6, 0)), and I presume things are getting converted to numpy at some point here anyway? So maybe, making an empty numpy array with a given shape somewhere can be a solution?

(I haven't yet looked into this code so sorry if this doesn't make sense)

I may also need a reminder 😅 what tutorial are you recalling here?

In issue #573 someone reported that specific case (ok maybe not directly from a tutorial 😂)

…ng/scikit-lego into feat/hierarchical-agnostic

FBruzzesi · 2024-05-19T13:09:14Z

Fixed all the issues I had in the conversion and marking this as ready to review. I am quite happy with the results 🙌🏼🙌🏼

The only breaking change from previous version/implementation is the unfeasibility of providing negative indices in groups.

cc: @koaning @MarcoGorelli

MarcoGorelli · 2024-05-19T15:04:47Z

Wow, amazing!

The only breaking change from previous version/implementation is the unfeasibility of providing negative indices in groups.

This is only when starting from non-dataframe input (e.g. numpy) right? If so, then would a patch like

diff --git a/sklego/meta/_grouped_utils.py b/sklego/meta/_grouped_utils.py
index 97180ab..e744f5a 100644
--- a/sklego/meta/_grouped_utils.py
+++ b/sklego/meta/_grouped_utils.py
@@ -20,7 +20,6 @@ def parse_X_y(X, y, groups, check_X=True, **kwargs) -> nw.DataFrame:
     _data_format_checks(X)
 
     # Convert X to Narwhals frame
-    X = nw.from_native(X, strict=False, eager_only=True)
     if not isinstance(X, nw.DataFrame):
         X = nw.from_native(pd.DataFrame(X))
 
diff --git a/sklego/meta/grouped_transformer.py b/sklego/meta/grouped_transformer.py
index 9f6b0d9..9c2eac3 100644
--- a/sklego/meta/grouped_transformer.py
+++ b/sklego/meta/grouped_transformer.py
@@ -109,6 +109,9 @@ class GroupedTransformer(BaseEstimator, TransformerMixin):
         self.fallback_ = None
         self.groups_ = as_list(self.groups) if self.groups is not None else None
 
+        X = nw.from_native(X, strict=False, eager_only=True)
+        if not isinstance(X, nw.DataFrame) and self.groups_ is not None:
+            self.groups_ = [X.shape[1]+group if isinstance(group, int) and group<0 else group for group in self.groups_]
         frame = parse_X_y(X, y, self.groups_, check_X=self.check_X, **self._check_kwargs)
 
         if self.groups is None:
@@ -181,6 +184,7 @@ class GroupedTransformer(BaseEstimator, TransformerMixin):
         """
         check_is_fitted(self, ["fallback_", "transformers_"])
 
+        X = nw.from_native(X, strict=False, eager_only=True)
         frame = parse_X_y(X, y=None, groups=self.groups_, check_X=self.check_X, **self._check_kwargs).drop(
             "__sklego_target__"
         )
diff --git a/tests/test_meta/test_grouped_transformer.py b/tests/test_meta/test_grouped_transformer.py
index b9aae2c..18b28d5 100644
--- a/tests/test_meta/test_grouped_transformer.py
+++ b/tests/test_meta/test_grouped_transformer.py
@@ -291,7 +291,7 @@ def test_array_with_multiple_string_cols(penguins):
     X = penguins
 
     # BROKEN: Failing due to negative indexing... kind of an edge case
-    meta = GroupedTransformer(StandardScaler(), groups=[0, X.shape[1] - 1])
+    meta = GroupedTransformer(StandardScaler(), groups=[0, -1])
 
     transformed = meta.fit_transform(X)

allow you to preserve current behaviour?

FBruzzesi · 2024-05-19T17:08:38Z

That's a nice trick! Thanks Marco! Pushing it now

FBruzzesi · 2024-05-23T13:35:52Z

This should be ready for review. Once we merge into narwhals-development then I would say that we covered large majority of #658 and it would be time for a (quite large) release. @koaning WDYT?

koaning

I had been glancing these changes over the last two weeks but have not seen anything that stands out, or at least ... I recall making some comments but these have all since been adressed.

I say we push!

* placeholder to develop narwhals features * feat: make `ColumnDropper` dataframe-agnostic (#655) * feat: make ColumnDropped dataframe-agnostic * use narwhals[polars] in pyproject.toml, link to list of supported libraries * note that narwhals is used for cross-dataframe support * test refactor * docstrings --------- Co-authored-by: FBruzzesi <francesco.bruzzesi.93@gmail.com> * feat: make ColumnSelector dataframe-agnostic (#659) * columnselector with test rufformatted * adding whitespace * fixed the fit and transform * removed intendation in examples * font:false * feat: make `add_lags` dataframe-agnostic (#661) * make add_lags dataframe-agnostic * try getting tests to run? * patch: cvxpy 1.5.0 support (#663) --------- Co-authored-by: Francesco Bruzzesi <42817048+FBruzzesi@users.noreply.github.com> * Make `RegressionOutlier` dataframe-agnostic (#665) * make regression outlier df-agnostic * need to use eager-only for this one * pass native to check_array * remove cudf, link to check_X_y * feat: Make InformationFilter dataframe-agnostic * Make Timegapsplit dataframe-agnostic (#668) * make timegapsplit dataframe-agnostic * actually, include cuDF * feat: make FairClassifier data-agnostic (#669) * start all over * fixture working * wip * passing tests - again * pre-commit complaining * changed fixture on test_demographic_parity * feat: Make PandasTypeSelector selector dataframe-agnostic (#670) * make pandas dtype selector df-agnostic * bump version * 3.8 compat * Update sklego/preprocessing/pandastransformers.py Co-authored-by: Francesco Bruzzesi <42817048+FBruzzesi@users.noreply.github.com> * fixup pyproject.toml * unify (and test!) error message * deprecate * update readme * undo contribution.md change --------- Co-authored-by: Francesco Bruzzesi <42817048+FBruzzesi@users.noreply.github.com> * format typeselector and bump version * feat: Make grouped and hierarchical dataframe-agnostic (#667) * feat: make grouped and hierarchical dataframe-agnostic * add pyarrow * narwhals grouped_transformer * grouped transformer eureka * hierarchical narwhalified * so close but so far * return series instead of DataFrame for y * grouped WIP * merge branch and fix grouped * future annotations * format * handling negative indices * solve conflicts * hacking C * fairness: change C values in tests --------- Co-authored-by: Marco Edward Gorelli <marcogorelli@protonmail.com> Co-authored-by: Magdalena Anopsy <74981211+anopsy@users.noreply.github.com> Co-authored-by: Dea María Léon <deamarialeon@gmail.com>

FBruzzesi added 2 commits May 11, 2024 20:03

feat: make grouped and hierarchical dataframe-agnostic

91482e1

add pyarrow

f623c71

FBruzzesi commented May 11, 2024

View reviewed changes

pyproject.toml Show resolved Hide resolved

FBruzzesi commented May 11, 2024

View reviewed changes

sklego/meta/hierarchical_predictor.py Outdated Show resolved Hide resolved

FBruzzesi commented May 11, 2024

View reviewed changes

FBruzzesi mentioned this pull request May 11, 2024

[FEATURE] Narwhals migration for dataframe-agnostic codebase #658

Closed

FBruzzesi changed the title ~~feat: make grouped and hierarchical dataframe-agnostic~~ WIP: make grouped and hierarchical dataframe-agnostic May 12, 2024

narwhals grouped_transformer

7bee656

FBruzzesi commented May 12, 2024

View reviewed changes

sklego/meta/grouped_transformer.py Show resolved Hide resolved

FBruzzesi added 3 commits May 12, 2024 19:48

grouped transformer eureka

014cc86

hierarchical narwhalified

d1cafd6

so close but so far

cfec03e

return series instead of DataFrame for y

c1417a7

FBruzzesi commented May 13, 2024

View reviewed changes

FBruzzesi and others added 4 commits May 15, 2024 06:23

grouped WIP

b4d9dec

Merge branch 'narwhals-development' into feat/hierarchical-agnostic

3151d6d

Merge branch 'feat/hierarchical-agnostic' of https://github.com/koani…

c91a47c

…ng/scikit-lego into feat/hierarchical-agnostic

merge branch and fix grouped

d3e7deb

FBruzzesi changed the title ~~WIP: make grouped and hierarchical dataframe-agnostic~~ feat: Make grouped and hierarchical dataframe-agnostic May 19, 2024

FBruzzesi marked this pull request as ready for review May 19, 2024 13:06

FBruzzesi added 2 commits May 19, 2024 15:11

future annotations

3b5235c

format

7f94522

handling negative indices

7196e2a

Merge branch 'narwhals-development' into feat/hierarchical-agnostic

774187b

solve conflicts

0847d07

koaning approved these changes May 23, 2024

View reviewed changes

koaning merged commit 3d1e996 into narwhals-development May 23, 2024
17 checks passed

koaning deleted the feat/hierarchical-agnostic branch May 23, 2024 14:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Make grouped and hierarchical dataframe-agnostic #667

feat: Make grouped and hierarchical dataframe-agnostic #667

FBruzzesi commented May 11, 2024 •

edited

Loading

FBruzzesi May 11, 2024 •

edited

Loading

MarcoGorelli May 11, 2024 •

edited

Loading

FBruzzesi May 11, 2024 •

edited

Loading

MarcoGorelli May 11, 2024

MarcoGorelli commented May 11, 2024

FBruzzesi commented May 11, 2024 •

edited

Loading

FBruzzesi commented May 12, 2024

FBruzzesi May 13, 2024

MarcoGorelli May 13, 2024

koaning May 13, 2024

FBruzzesi May 13, 2024 •

edited

Loading

FBruzzesi commented May 19, 2024

MarcoGorelli commented May 19, 2024

FBruzzesi commented May 19, 2024

FBruzzesi commented May 23, 2024 •

edited

Loading

koaning left a comment •

edited

Loading

feat: Make grouped and hierarchical dataframe-agnostic #667

feat: Make grouped and hierarchical dataframe-agnostic #667

Conversation

FBruzzesi commented May 11, 2024 • edited Loading

Description

FBruzzesi May 11, 2024 • edited Loading

Choose a reason for hiding this comment

MarcoGorelli May 11, 2024 • edited Loading

Choose a reason for hiding this comment

FBruzzesi May 11, 2024 • edited Loading

Choose a reason for hiding this comment

MarcoGorelli May 11, 2024

Choose a reason for hiding this comment

MarcoGorelli commented May 11, 2024

FBruzzesi commented May 11, 2024 • edited Loading

FBruzzesi commented May 12, 2024

FBruzzesi May 13, 2024

Choose a reason for hiding this comment

MarcoGorelli May 13, 2024

Choose a reason for hiding this comment

koaning May 13, 2024

Choose a reason for hiding this comment

FBruzzesi May 13, 2024 • edited Loading

Choose a reason for hiding this comment

FBruzzesi commented May 19, 2024

MarcoGorelli commented May 19, 2024

FBruzzesi commented May 19, 2024

FBruzzesi commented May 23, 2024 • edited Loading

koaning left a comment • edited Loading

Choose a reason for hiding this comment

FBruzzesi commented May 11, 2024 •

edited

Loading

FBruzzesi May 11, 2024 •

edited

Loading

MarcoGorelli May 11, 2024 •

edited

Loading

FBruzzesi May 11, 2024 •

edited

Loading

FBruzzesi commented May 11, 2024 •

edited

Loading

FBruzzesi May 13, 2024 •

edited

Loading

FBruzzesi commented May 23, 2024 •

edited

Loading

koaning left a comment •

edited

Loading