Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Make grouped and hierarchical dataframe-agnostic #667

Merged
merged 16 commits into from
May 23, 2024

Conversation

FBruzzesi
Copy link
Collaborator

@FBruzzesi FBruzzesi commented May 11, 2024

Description

Addresses Grouped* and Hierarchical, following up to the comment I added in the thread. I will paste it for easier visibility:

I am having a hard time to think of another way of implementing meta estimators that use pandas just for it's grouping functionalities. Namely HierarchicalPredictors and Grouped* are not necessarily taking a dataframe input, but use pandas groupby's to iterate over different batch of data.

  • Would this be possible using pure numpy so that we don't need the pandas dependency? yes
  • Would this be significantly more inconvenient to maintain? absolutely yes

My opinion/suggestion is to leave them as they are in their logic.

If anything, we can allow arbitrary dataframe types as input, and use Narwhals to convert them to pandas and let the rest flow as is.

Comment on lines +45 to +52
X_ = (
pd.DataFrame(X, columns=[f"x_{i}" for i in range(X.shape[1])])
.assign(
g_0=1,
g_1=["A"] * (n_samples // 2) + ["B"] * (n_samples // 2),
g_2=["X"] * (n_samples // 4) + ["Y"] * (n_samples // 2) + ["Z"] * (n_samples // 4),
)
.pipe(frame_func)
Copy link
Collaborator Author

@FBruzzesi FBruzzesi May 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoGorelli here is where I encountered the with_columns issue. My first intention was to do it as:

(
frame_func(dict(zip([f"x_{i}" for i in range(X.shape[1])]), X.T)
.pipe(nw.from_native)
.with_columns(
    g_0=1,
    g_1=["A"] * (n_samples // 2) + ["B"] * (n_samples // 2),
    g_2=["X"] * (n_samples // 4) + ["Y"] * (n_samples // 2) + ["Z"] * (n_samples // 4),
)
.pipe(nw.to_native)
)

and tried with a bunch of variations (lists, tuples, numpy arrays)

Copy link
Contributor

@MarcoGorelli MarcoGorelli May 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Polars doesn't do what you're expecting it to here 😉

In [34]: frame = pl.DataFrame({"a": [1,2,3]})

In [35]: frame.with_columns(b=['a', 'b', 'c'])
Out[35]:
shape: (3, 2)
┌─────┬─────────────────┐
│ ab               │
│ ------             │
│ i64list[str]       │
╞═════╪═════════════════╡
│ 1   ┆ ["a", "b", "c"] │
│ 2   ┆ ["a", "b", "c"] │
│ 3   ┆ ["a", "b", "c"] │
└─────┴─────────────────┘

I think what we need is a Series constructor. Something like

df.with_columns(
    g_1=nw.Series(
        ["A"] * (n_samples // 2) + ["B"] * (n_samples // 2),
        implementation=nw.get_implementation(X),
    )
)

?

We need to pass something to nw.Series, otherwise it doesn't know which library's Series to create

An alternative API could be

plx = nw.get_namespace(X)
df.with_columns(
    g_1=plx.Series(["A"] * (n_samples // 2) + ["B"] * (n_samples // 2))
)

which...might be a bit cleaner?

Copy link
Collaborator Author

@FBruzzesi FBruzzesi May 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should move this conversation to the issue 😇

I think Polars doesn't do what you're expecting it to here 😉

You are right, but it does with numpy.

We need to pass something to nw.Series, otherwise it doesn't know which library's Series to create

Wouldn't with_columns has some context (from self)? (As mentioned on Discord I need to catch up a lot with the internals, currently this is black magic to me)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does with numpy.

Just pushed out a release which can handle numpy 😎 It may take a few minutes to appear

Wouldn't with_columns has some context (from self)?

Yeah, I'm just trying to figure out how to stay compatible with the Polars API. Because if you want to create a new column with [1, 2, 3] in Polars, you'd do

df.with_columns(d=pl.Series([1,2,3]))

So I think we need to instruct the user to construct a nw.Series, in such a way that the series can exist in a self-standing way without the df.with_columns context. I'm not really sure here...will sleep on it

But the numpy array case is easy, so let's start with that 😇

@MarcoGorelli
Copy link
Contributor

🤔 are you sure about adding pyarrow as required? I think we should be able to solve this one

but use pandas groupby's to iterate over different batch of data.

Narwhals' DataFrame.GroupBy also lets you iterate over groups - I'll take a closer look at what's required, but this seems feasible

@FBruzzesi
Copy link
Collaborator Author

FBruzzesi commented May 11, 2024

@MarcoGorelli oh I see what you mean! Regarding hierarchical I am 100% sure it is doable, I can do it tomorrow myself (Edit: ok maybe let's say 99 as I need some missing features)

For Grouped, internals are a bit messy, but I will give it a try.

@FBruzzesi FBruzzesi changed the title feat: make grouped and hierarchical dataframe-agnostic WIP: make grouped and hierarchical dataframe-agnostic May 12, 2024
@FBruzzesi
Copy link
Collaborator Author

I am really pushing the boundaries for GroupedPredictor, I think within another swipe I can finish this.

Current status is:

  • GroupedTransformer passes all the tests, except one in which the group index is provided as negative (honestly, who would do that 😂)
  • HierarchicalPredictor is failing only one test as well (tests/test_meta/test_hierarchical_predictor.py::test_expected_output), but only for the polars case, which could be due to the fact that the estimator cannot deal with polars?
  • GroupedPredictor needs some more patches to close the deal.

_X = grp_frame.drop(columns=self.groups_ + [self._TARGET_NAME])
_y = grp_frame[self._TARGET_NAME]
_X = nw.to_native(grp_frame.drop([*self.groups_, self._TARGET_NAME]))
_y = nw.to_native(grp_frame.select(self._TARGET_NAME))
return clone(self.estimator).fit(_X, _y)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I am ending up with the following issue in case of polars.

If _X is empty (e.g. the dummy case in test_expected_output test), while for pandas dropping all the columns, still maintains the original shape (I guess because index is preserved?), for polars we end up with _X.shape being (0, 0).

Which then leads to the estimator raising:

ValueError: Found input variables with inconsistent numbers of samples

We could say this is kind of an edge case, but users use this as I remember a related issue (and Vincent tutorial on it).

@MarcoGorelli do you have any ideas on how to deal with this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to have more context, do you happen to remember which issue it was? or the tutorial? I searched "hierarchical empty" here and came out with nothing

numpy does allow for things like np.zeros((6, 0)), and I presume things are getting converted to numpy at some point here anyway? So maybe, making an empty numpy array with a given shape somewhere can be a solution?

(I haven't yet looked into this code so sorry if this doesn't make sense)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may also need a reminder 😅 what tutorial are you recalling here?

Copy link
Collaborator Author

@FBruzzesi FBruzzesi May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In issue #573 someone reported that specific case (ok maybe not directly from a tutorial 😂)

@FBruzzesi FBruzzesi changed the title WIP: make grouped and hierarchical dataframe-agnostic feat: Make grouped and hierarchical dataframe-agnostic May 19, 2024
@FBruzzesi FBruzzesi marked this pull request as ready for review May 19, 2024 13:06
@FBruzzesi
Copy link
Collaborator Author

Fixed all the issues I had in the conversion and marking this as ready to review. I am quite happy with the results 🙌🏼🙌🏼

The only breaking change from previous version/implementation is the unfeasibility of providing negative indices in groups.

cc: @koaning @MarcoGorelli

@MarcoGorelli
Copy link
Contributor

Wow, amazing!

The only breaking change from previous version/implementation is the unfeasibility of providing negative indices in groups.

This is only when starting from non-dataframe input (e.g. numpy) right? If so, then would a patch like

diff --git a/sklego/meta/_grouped_utils.py b/sklego/meta/_grouped_utils.py
index 97180ab..e744f5a 100644
--- a/sklego/meta/_grouped_utils.py
+++ b/sklego/meta/_grouped_utils.py
@@ -20,7 +20,6 @@ def parse_X_y(X, y, groups, check_X=True, **kwargs) -> nw.DataFrame:
     _data_format_checks(X)
 
     # Convert X to Narwhals frame
-    X = nw.from_native(X, strict=False, eager_only=True)
     if not isinstance(X, nw.DataFrame):
         X = nw.from_native(pd.DataFrame(X))
 
diff --git a/sklego/meta/grouped_transformer.py b/sklego/meta/grouped_transformer.py
index 9f6b0d9..9c2eac3 100644
--- a/sklego/meta/grouped_transformer.py
+++ b/sklego/meta/grouped_transformer.py
@@ -109,6 +109,9 @@ class GroupedTransformer(BaseEstimator, TransformerMixin):
         self.fallback_ = None
         self.groups_ = as_list(self.groups) if self.groups is not None else None
 
+        X = nw.from_native(X, strict=False, eager_only=True)
+        if not isinstance(X, nw.DataFrame) and self.groups_ is not None:
+            self.groups_ = [X.shape[1]+group if isinstance(group, int) and group<0 else group for group in self.groups_]
         frame = parse_X_y(X, y, self.groups_, check_X=self.check_X, **self._check_kwargs)
 
         if self.groups is None:
@@ -181,6 +184,7 @@ class GroupedTransformer(BaseEstimator, TransformerMixin):
         """
         check_is_fitted(self, ["fallback_", "transformers_"])
 
+        X = nw.from_native(X, strict=False, eager_only=True)
         frame = parse_X_y(X, y=None, groups=self.groups_, check_X=self.check_X, **self._check_kwargs).drop(
             "__sklego_target__"
         )
diff --git a/tests/test_meta/test_grouped_transformer.py b/tests/test_meta/test_grouped_transformer.py
index b9aae2c..18b28d5 100644
--- a/tests/test_meta/test_grouped_transformer.py
+++ b/tests/test_meta/test_grouped_transformer.py
@@ -291,7 +291,7 @@ def test_array_with_multiple_string_cols(penguins):
     X = penguins
 
     # BROKEN: Failing due to negative indexing... kind of an edge case
-    meta = GroupedTransformer(StandardScaler(), groups=[0, X.shape[1] - 1])
+    meta = GroupedTransformer(StandardScaler(), groups=[0, -1])
 
     transformed = meta.fit_transform(X)

allow you to preserve current behaviour?

@FBruzzesi
Copy link
Collaborator Author

That's a nice trick! Thanks Marco! Pushing it now

@FBruzzesi
Copy link
Collaborator Author

FBruzzesi commented May 23, 2024

This should be ready for review. Once we merge into narwhals-development then I would say that we covered large majority of #658 and it would be time for a (quite large) release. @koaning WDYT?

Copy link
Owner

@koaning koaning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had been glancing these changes over the last two weeks but have not seen anything that stands out, or at least ... I recall making some comments but these have all since been adressed.

I say we push!

@koaning koaning merged commit 3d1e996 into narwhals-development May 23, 2024
17 checks passed
@koaning koaning deleted the feat/hierarchical-agnostic branch May 23, 2024 14:28
koaning pushed a commit that referenced this pull request May 24, 2024
* placeholder to develop narwhals features

* feat: make `ColumnDropper` dataframe-agnostic (#655)

* feat: make ColumnDropped dataframe-agnostic

* use narwhals[polars] in pyproject.toml, link to list of supported libraries

* note that narwhals is used for cross-dataframe support

* test refactor

* docstrings

---------

Co-authored-by: FBruzzesi <francesco.bruzzesi.93@gmail.com>

* feat: make ColumnSelector dataframe-agnostic (#659)

* columnselector with test rufformatted

* adding whitespace

* fixed the fit and transform

* removed intendation in examples

* font:false

* feat: make `add_lags` dataframe-agnostic (#661)

* make add_lags dataframe-agnostic

* try getting tests to run?

* patch: cvxpy 1.5.0 support (#663)

---------

Co-authored-by: Francesco Bruzzesi <42817048+FBruzzesi@users.noreply.github.com>

* Make `RegressionOutlier` dataframe-agnostic (#665)

* make regression outlier df-agnostic

* need to use eager-only for this one

* pass native to check_array

* remove cudf, link to check_X_y

* feat: Make InformationFilter dataframe-agnostic

* Make Timegapsplit dataframe-agnostic (#668)

* make timegapsplit dataframe-agnostic

* actually, include cuDF

* feat: make FairClassifier data-agnostic (#669)

* start all over

* fixture working

* wip

* passing tests - again

* pre-commit complaining

* changed fixture on test_demographic_parity

* feat: Make PandasTypeSelector selector dataframe-agnostic (#670)

* make pandas dtype selector df-agnostic

* bump version

* 3.8 compat

* Update sklego/preprocessing/pandastransformers.py

Co-authored-by: Francesco Bruzzesi <42817048+FBruzzesi@users.noreply.github.com>

* fixup pyproject.toml

* unify (and test!) error message

* deprecate

* update readme

* undo contribution.md change

---------

Co-authored-by: Francesco Bruzzesi <42817048+FBruzzesi@users.noreply.github.com>

* format typeselector and bump version

* feat: Make grouped and hierarchical dataframe-agnostic (#667)

* feat: make grouped and hierarchical dataframe-agnostic

* add pyarrow

* narwhals grouped_transformer

* grouped transformer eureka

* hierarchical narwhalified

* so close but so far

* return series instead of DataFrame for y

* grouped WIP

* merge branch and fix grouped

* future annotations

* format

* handling negative indices

* solve conflicts

* hacking C

* fairness: change C values in tests

---------

Co-authored-by: Marco Edward Gorelli <marcogorelli@protonmail.com>
Co-authored-by: Magdalena Anopsy <74981211+anopsy@users.noreply.github.com>
Co-authored-by: Dea María Léon <deamarialeon@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants