Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: make ColumnDropper dataframe-agnostic #655

Conversation

MarcoGorelli
Copy link
Contributor

@MarcoGorelli MarcoGorelli commented May 5, 2024

Hey 👋 I discussed a bit with Francesco, who discussed it with Vincent. So, I was keen to get the ball rolling

Description

Making a start towards dataframe-agnosticism. For now, this only makes ColumnDropped dataframe-agnostic. It requires adding an extremely lightweight dependecy (Narwhals), but the result is that computation can happen natively for pandas/Polars/modin/cuDF (and any other library which may want to become Narwhals-compliant), without any data conversion.

pandas is still a required dependency, as it's required for the rest of scikit-lego. But, it doesn't seem that far-fetched to be able to make all of scikit-lego dataframe-agnostic. If you're open to this, we could do it in stages?

There is no impact on current users, other than that Narwhals would be a required dependency (though it's really lightweight, and without any dependencies itself, so there's no risk for conflicts)

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

  • My code follows the style guidelines (ruff)
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (also to the readme.md)
  • I have added tests that prove my fix is effective or that my feature works
  • I have added tests to check whether the new feature adheres to the sklearn convention
  • New and existing unit tests pass locally with my changes

Not sure about some of these items so I haven't checked them.


Demo

pandas users, they can keep using scikit-lego, without needing Polars to be installed

$ pip uninstall polars -y
Found existing installation: polars 0.20.23
Uninstalling polars-0.20.23:
  Successfully uninstalled polars-0.20.23
$ cat t.py 
import pandas as pd
from sklego.preprocessing import ColumnDropper

df = pd.DataFrame({
    "name": ["Swen", "Victor", "Alex"],
    "length": [1.82, 1.85, 1.80],
    "shoesize": [42, 44, 45]
})
print(ColumnDropper(["name"]).fit_transform(df))
$ python t.py 
   length  shoesize
0    1.82        42
1    1.85        44
2    1.80        45

But, Polars users can use this, without it doing any conversion to pandas. In particular, Polars LazyFrame is supported (and stays lazy, but here's I'm calling .collect to show you the output):

$ cat t.py 
import polars as pl
from sklego.preprocessing import ColumnDropper

df = pl.LazyFrame({
    "name": ["Swen", "Victor", "Alex"],
    "length": [1.82, 1.85, 1.80],
    "shoesize": [42, 44, 45]
})
result = ColumnDropper(["name"]).fit_transform(df)
print(result.explain())
print(result.collect())
$ python t.py 
DF ["name", "length", "shoesize"]; PROJECT 2/3 COLUMNS; SELECTION: "None"
shape: (3, 2)
┌────────┬──────────┐
│ length ┆ shoesize │
│ ---    ┆ ---      │
│ f64    ┆ i64      │
╞════════╪══════════╡
│ 1.82   ┆ 42       │
│ 1.85   ┆ 44       │
│ 1.8    ┆ 45       │
└────────┴──────────┘

@MarcoGorelli MarcoGorelli force-pushed the dataframe-agnostic-columndropper branch from 97225a1 to 8f22154 Compare May 5, 2024 13:50
Comment on lines -154 to -158
@staticmethod
def _check_X_for_type(X):
"""Checks if input of the Selector is of the required dtype"""
if not isinstance(X, pd.DataFrame):
raise TypeError("Provided variable X is not of type pandas.DataFrame")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm removing this here, as nw.from_native already raises a similar error. For example, if you tried passing in a numpy array here, you'd get:

TypeError: Expected pandas-like dataframe, Polars dataframe, or Polars lazyframe, got: <class 'numpy.ndarray'>

@MarcoGorelli MarcoGorelli marked this pull request as ready for review May 5, 2024 13:57
@MarcoGorelli MarcoGorelli changed the title feat: make ColumnDropped dataframe-agnostic feat: make ColumnDropper dataframe-agnostic May 5, 2024
@MarcoGorelli MarcoGorelli force-pushed the dataframe-agnostic-columndropper branch from 8f22154 to f626eaf Compare May 5, 2024 18:52
Copy link
Collaborator

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly it looks quite neat! I left some considerations for better sync with narwhals in the long term.

Edit: Regarding macos tests, we are getting a lot of fails since latest upgraded to macos-14. We definitely need to get a closer look at those, but completely unrelated to the PR

pyproject.toml Outdated Show resolved Hide resolved
sklego/preprocessing/pandastransformers.py Outdated Show resolved Hide resolved
@koaning
Copy link
Owner

koaning commented May 6, 2024

@FBruzzesi notice those failing test runs, I have a creeping suspicion that these may be caused by uv and they feel similar to the failing cronjobs. Just to check, did you have a look?

@MarcoGorelli MarcoGorelli force-pushed the dataframe-agnostic-columndropper branch from bf23559 to a311696 Compare May 6, 2024 20:02
@FBruzzesi
Copy link
Collaborator

@koaning I finally took a look at the CI...and you are right 😁

As a side note: even with pip, the --pre flag break as of scikit-learn==1.5.0rc1 (proof on my fork).

@FBruzzesi
Copy link
Collaborator

FBruzzesi commented May 7, 2024

@koaning another food for thought is that... all these transformers are in a ../pandastransformers.py module, however this should be transparent to the user as they are import-able from sklego.preprocessing directly. I would feel fairly safe to change the filename once all of them are dataframe agnostic

@koaning
Copy link
Owner

koaning commented May 7, 2024

@FBruzzesi any objections to moving back to pip for now? I prefer to wait until uv is a bit more reliable before relying our CI on it.

I made an issue for that v1.5rc01 test failure on the sklearn side, feels like it might be breaking so upstream should know about it.

I would feel fairly safe to change the filename once all of them are dataframe agnostic.

Totally!

@FBruzzesi FBruzzesi mentioned this pull request May 7, 2024
3 tasks
Copy link
Collaborator

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am biased as I added a few commits myself 😂

@koaning
Copy link
Owner

koaning commented May 7, 2024

Do we want to release early with a micro release by the way or would we prefer to make a bigger splash? There's something to be said for both, but given that we are introducing a dependency ... might be good to make a larger release?

@anopsy
Copy link
Contributor

anopsy commented May 7, 2024

If everybody is fine with it, I could finish pandastransformers.py, or at least have a shot at ColumnSelector and see how it goes

@FBruzzesi
Copy link
Collaborator

@koaning I would personally opt for one somewhat big release. We can wait to merge the PRs related to Narwhals migration to have some space for bug fixes if needed.

@anopsy Of course feel free to develop that. I will create an issue to track progress 😇

@MarcoGorelli
Copy link
Contributor Author

Just got a question about

We can wait to merge the PRs related to Narwhals migration to have some space for bug fixes if needed.

No objections to waiting for a bigger release to include this, I'm just worried that it's going to become quite hard to review if everything is done in a single PR, or if PRs build on top of other open PRs

May I suggest that you (i.e. a scikit-lego committer) make a narwhals branch in scikit-lego, we make PRs to that branch, and then if/when you're ready, you merge the narwhals branch into main and include it in the release?

@FBruzzesi
Copy link
Collaborator

@MarcoGorelli that's very reasonable! Here you have it: narwhals-development branch.

@MarcoGorelli MarcoGorelli changed the base branch from main to narwhals-development May 7, 2024 18:19
@koaning
Copy link
Owner

koaning commented May 7, 2024

FWIW I think pushing to main is fine too. It's more of a "oh, we have one new feature, usually that means deploy". But we can also collect a bunch of things to main ... this is the main thing we're going to work on short-term.

@FBruzzesi FBruzzesi merged commit aac1d9d into koaning:narwhals-development May 7, 2024
17 checks passed
koaning pushed a commit that referenced this pull request May 24, 2024
* placeholder to develop narwhals features

* feat: make `ColumnDropper` dataframe-agnostic (#655)

* feat: make ColumnDropped dataframe-agnostic

* use narwhals[polars] in pyproject.toml, link to list of supported libraries

* note that narwhals is used for cross-dataframe support

* test refactor

* docstrings

---------

Co-authored-by: FBruzzesi <francesco.bruzzesi.93@gmail.com>

* feat: make ColumnSelector dataframe-agnostic (#659)

* columnselector with test rufformatted

* adding whitespace

* fixed the fit and transform

* removed intendation in examples

* font:false

* feat: make `add_lags` dataframe-agnostic (#661)

* make add_lags dataframe-agnostic

* try getting tests to run?

* patch: cvxpy 1.5.0 support (#663)

---------

Co-authored-by: Francesco Bruzzesi <42817048+FBruzzesi@users.noreply.github.com>

* Make `RegressionOutlier` dataframe-agnostic (#665)

* make regression outlier df-agnostic

* need to use eager-only for this one

* pass native to check_array

* remove cudf, link to check_X_y

* feat: Make InformationFilter dataframe-agnostic

* Make Timegapsplit dataframe-agnostic (#668)

* make timegapsplit dataframe-agnostic

* actually, include cuDF

* feat: make FairClassifier data-agnostic (#669)

* start all over

* fixture working

* wip

* passing tests - again

* pre-commit complaining

* changed fixture on test_demographic_parity

* feat: Make PandasTypeSelector selector dataframe-agnostic (#670)

* make pandas dtype selector df-agnostic

* bump version

* 3.8 compat

* Update sklego/preprocessing/pandastransformers.py

Co-authored-by: Francesco Bruzzesi <42817048+FBruzzesi@users.noreply.github.com>

* fixup pyproject.toml

* unify (and test!) error message

* deprecate

* update readme

* undo contribution.md change

---------

Co-authored-by: Francesco Bruzzesi <42817048+FBruzzesi@users.noreply.github.com>

* format typeselector and bump version

* feat: Make grouped and hierarchical dataframe-agnostic (#667)

* feat: make grouped and hierarchical dataframe-agnostic

* add pyarrow

* narwhals grouped_transformer

* grouped transformer eureka

* hierarchical narwhalified

* so close but so far

* return series instead of DataFrame for y

* grouped WIP

* merge branch and fix grouped

* future annotations

* format

* handling negative indices

* solve conflicts

* hacking C

* fairness: change C values in tests

---------

Co-authored-by: Marco Edward Gorelli <marcogorelli@protonmail.com>
Co-authored-by: Magdalena Anopsy <74981211+anopsy@users.noreply.github.com>
Co-authored-by: Dea María Léon <deamarialeon@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants