perf: make sure X.schema is only calculated once per dataframe in TypeSelector #676

MarcoGorelli · 2024-06-02T14:15:21Z

Description

Refactors TypeSelector to use Narwhals selectors, as opposed to hard-coding types

This is preferable because, for Polars LazyFrames, getting the schema isn't a free operation (especially if the original dataframe is on the cloud - LazyFrame.schema may get renamed to LazyFrame.collect_schema for this reason)

I've also expanded the tests

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

My code follows the style guidelines (ruff)
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation (also to the readme.md)
I have added tests that prove my fix is effective or that my feature works
I have added tests to check whether the new feature adheres to the sklearn convention
New and existing unit tests pass locally with my changes

MarcoGorelli · 2024-06-02T14:16:40Z

sklego/preprocessing/pandastransformers.py

+    if exclude:
+        return df.select(functools.reduce(lambda x, y: x | y, include) - functools.reduce(lambda x, y: x | y, exclude))
+    return df.select(functools.reduce(lambda x, y: x | y, include))


unfortunately Polars doesn't have an empty selector, else this could've been simpler

That would have been sooooo much cleaner!

FBruzzesi

LGTM!

I have one concern thought: I realized that polars selectors select in the order they are given and do not maintain the original order.

Consider the following example:

import polars
import polars.selectors as cs

df = pl.DataFrame({
    "a": list("abc"),
    "x": [1,2,3],
    "z": [True, False, True],
})

df.select(cs.numeric() | cs.string())

shape: (3, 2)
┌─────┬─────┐
│ x   ┆ a   │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 1   ┆ a   │
│ 2   ┆ b   │
│ 3   ┆ c   │
└─────┴─────┘

Is this something we should warn the user or take into consideration?
cc: @koaning

FBruzzesi · 2024-06-02T16:39:26Z

sklego/preprocessing/pandastransformers.py

@@ -10,26 +11,15 @@
 from sklego.common import as_list


-def _nw_match_dtype(dtype, selection):
+def _nw_match_dtype(selection):


Big +1 here 👌

FBruzzesi · 2024-06-02T16:41:32Z

sklego/preprocessing/pandastransformers.py

+    if exclude:
+        return df.select(functools.reduce(lambda x, y: x | y, include) - functools.reduce(lambda x, y: x | y, exclude))
+    return df.select(functools.reduce(lambda x, y: x | y, include))


That would have been sooooo much cleaner!

MarcoGorelli · 2024-06-05T06:40:24Z

I have one concern thought: I realized that polars selectors select in the order they are given and do not maintain the original order.

that's a good point, thanks - it should be possible to preserve the order. df.schema needs calling here anyway - i'll update, making sure that df.schema is called no more than once per dataframe

MarcoGorelli

thanks for your review!

I don't think it's worth using selectors here, especially if you want to preserve the order of the columns and if X.schema needs calculating anyway

So instead, I've refactored the code to make sure that X.schema is only ever calculated once per dataframe, and is passed to _nw_select_dtypes. This is better for LazyFrames

I've also expanded test coverage

MarcoGorelli · 2024-06-09T20:14:23Z

sklego/preprocessing/pandastransformers.py

@@ -43,16 +44,22 @@ def _nw_select_dtypes(df, include: str | list[str], exclude: str | list[str]):
    if isinstance(exclude, str):
        exclude = [exclude]

-    include = include or ["string", "number", "bool", "category"]


I don't think this logic was right to begin with. If include isn't passed, it should keep everything except for what's in exclude - not just the string/number/bool/category cols not in exclude...
I've added a test which covers this

MarcoGorelli · 2024-06-09T20:15:56Z

sklego/preprocessing/pandastransformers.py

-def _nw_select_dtypes(df, include: str | list[str], exclude: str | list[str]):
+def _nw_select_dtypes(include: str | list[str], exclude: str | list[str], schema: dict[str, Any]):


rather than passing df and repeatedly calling df.schema, we can just calculate schema once upfront and pass that

MarcoGorelli · 2024-06-09T20:16:36Z

sklego/preprocessing/pandastransformers.py

-            transformed_df = _nw_select_dtypes(X, include=self.include, exclude=self.exclude)
+            transformed_df = X.select(
+                _nw_select_dtypes(include=self.include, exclude=self.exclude, schema=X_schema)
+            ).pipe(nw.to_native)


looks like we (well, probably I 😳 ) forgot the to_native part at the end

I've added a test which asserts that the output is actually in the native class 👍

FBruzzesi

I probably lost the "ready for review" notification! Well... I finally made it 😁
It looks amazing! Thanks @MarcoGorelli 👌

use selectors in TypeSelector

6563839

MarcoGorelli commented Jun 2, 2024

View reviewed changes

FBruzzesi reviewed Jun 2, 2024

View reviewed changes

MarcoGorelli marked this pull request as draft June 5, 2024 06:40

you only calculate schema once

50d7f1e

MarcoGorelli changed the title ~~chore: use selectors in TypeSelector~~ perf: make sure X.schema is only calculated once per dataframe in TypeSelector Jun 9, 2024

simplify

906de55

MarcoGorelli commented Jun 9, 2024

View reviewed changes

MarcoGorelli marked this pull request as ready for review June 10, 2024 20:30

FBruzzesi approved these changes Jun 14, 2024

View reviewed changes

FBruzzesi merged commit 2b32533 into koaning:main Jun 14, 2024
17 checks passed

FBruzzesi mentioned this pull request Jun 15, 2024

patch: .drop(columns=...) test hotfix #678

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: make sure X.schema is only calculated once per dataframe in TypeSelector #676

perf: make sure X.schema is only calculated once per dataframe in TypeSelector #676

MarcoGorelli commented Jun 2, 2024

MarcoGorelli Jun 2, 2024 •

edited

Loading

FBruzzesi Jun 2, 2024 •

edited

Loading

FBruzzesi left a comment

FBruzzesi Jun 2, 2024

FBruzzesi Jun 2, 2024 •

edited

Loading

MarcoGorelli commented Jun 5, 2024

MarcoGorelli left a comment

MarcoGorelli Jun 9, 2024

MarcoGorelli Jun 9, 2024

MarcoGorelli Jun 9, 2024

FBruzzesi left a comment

		def _nw_select_dtypes(df, include: str \| list[str], exclude: str \| list[str]):
		def _nw_select_dtypes(include: str \| list[str], exclude: str \| list[str], schema: dict[str, Any]):

perf: make sure X.schema is only calculated once per dataframe in TypeSelector #676

perf: make sure X.schema is only calculated once per dataframe in TypeSelector #676

Conversation

MarcoGorelli commented Jun 2, 2024

Description

Type of change

Checklist:

MarcoGorelli Jun 2, 2024 • edited Loading

Choose a reason for hiding this comment

FBruzzesi Jun 2, 2024 • edited Loading

Choose a reason for hiding this comment

FBruzzesi left a comment

Choose a reason for hiding this comment

FBruzzesi Jun 2, 2024

Choose a reason for hiding this comment

FBruzzesi Jun 2, 2024 • edited Loading

Choose a reason for hiding this comment

MarcoGorelli commented Jun 5, 2024

MarcoGorelli left a comment

Choose a reason for hiding this comment

MarcoGorelli Jun 9, 2024

Choose a reason for hiding this comment

MarcoGorelli Jun 9, 2024

Choose a reason for hiding this comment

MarcoGorelli Jun 9, 2024

Choose a reason for hiding this comment

FBruzzesi left a comment

Choose a reason for hiding this comment

MarcoGorelli Jun 2, 2024 •

edited

Loading

FBruzzesi Jun 2, 2024 •

edited

Loading

FBruzzesi Jun 2, 2024 •

edited

Loading