Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: make sure X.schema is only calculated once per dataframe in TypeSelector #676

Merged
merged 3 commits into from
Jun 14, 2024

Conversation

MarcoGorelli
Copy link
Contributor

Description

Refactors TypeSelector to use Narwhals selectors, as opposed to hard-coding types

This is preferable because, for Polars LazyFrames, getting the schema isn't a free operation (especially if the original dataframe is on the cloud - LazyFrame.schema may get renamed to LazyFrame.collect_schema for this reason)

I've also expanded the tests

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

  • My code follows the style guidelines (ruff)
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (also to the readme.md)
  • I have added tests that prove my fix is effective or that my feature works
  • I have added tests to check whether the new feature adheres to the sklearn convention
  • New and existing unit tests pass locally with my changes

Comment on lines 43 to 45
if exclude:
return df.select(functools.reduce(lambda x, y: x | y, include) - functools.reduce(lambda x, y: x | y, exclude))
return df.select(functools.reduce(lambda x, y: x | y, include))
Copy link
Contributor Author

@MarcoGorelli MarcoGorelli Jun 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately Polars doesn't have an empty selector, else this could've been simpler

Copy link
Collaborator

@FBruzzesi FBruzzesi Jun 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would have been sooooo much cleaner!

Copy link
Collaborator

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

I have one concern thought: I realized that polars selectors select in the order they are given and do not maintain the original order.

Consider the following example:

import polars
import polars.selectors as cs

df = pl.DataFrame({
    "a": list("abc"),
    "x": [1,2,3],
    "z": [True, False, True],
})

df.select(cs.numeric() | cs.string())
shape: (3, 2)
┌─────┬─────┐
│ x   ┆ a   │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 1   ┆ a   │
│ 2   ┆ b   │
│ 3   ┆ c   │
└─────┴─────┘

Is this something we should warn the user or take into consideration?
cc: @koaning

@@ -10,26 +11,15 @@
from sklego.common import as_list


def _nw_match_dtype(dtype, selection):
def _nw_match_dtype(selection):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big +1 here 👌

Comment on lines 43 to 45
if exclude:
return df.select(functools.reduce(lambda x, y: x | y, include) - functools.reduce(lambda x, y: x | y, exclude))
return df.select(functools.reduce(lambda x, y: x | y, include))
Copy link
Collaborator

@FBruzzesi FBruzzesi Jun 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would have been sooooo much cleaner!

@MarcoGorelli
Copy link
Contributor Author

I have one concern thought: I realized that polars selectors select in the order they are given and do not maintain the original order.

that's a good point, thanks - it should be possible to preserve the order. df.schema needs calling here anyway - i'll update, making sure that df.schema is called no more than once per dataframe

@MarcoGorelli MarcoGorelli marked this pull request as draft June 5, 2024 06:40
@MarcoGorelli MarcoGorelli changed the title chore: use selectors in TypeSelector perf: make sure X.schema is only calculated once per dataframe in TypeSelector Jun 9, 2024
Copy link
Contributor Author

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for your review!

I don't think it's worth using selectors here, especially if you want to preserve the order of the columns and if X.schema needs calculating anyway

So instead, I've refactored the code to make sure that X.schema is only ever calculated once per dataframe, and is passed to _nw_select_dtypes. This is better for LazyFrames

I've also expanded test coverage

@@ -43,16 +44,22 @@ def _nw_select_dtypes(df, include: str | list[str], exclude: str | list[str]):
if isinstance(exclude, str):
exclude = [exclude]

include = include or ["string", "number", "bool", "category"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this logic was right to begin with. If include isn't passed, it should keep everything except for what's in exclude - not just the string/number/bool/category cols not in exclude...
I've added a test which covers this

Comment on lines -37 to +38
def _nw_select_dtypes(df, include: str | list[str], exclude: str | list[str]):
def _nw_select_dtypes(include: str | list[str], exclude: str | list[str], schema: dict[str, Any]):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than passing df and repeatedly calling df.schema, we can just calculate schema once upfront and pass that

Comment on lines -387 to +395
transformed_df = _nw_select_dtypes(X, include=self.include, exclude=self.exclude)
transformed_df = X.select(
_nw_select_dtypes(include=self.include, exclude=self.exclude, schema=X_schema)
).pipe(nw.to_native)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like we (well, probably I 😳 ) forgot the to_native part at the end

I've added a test which asserts that the output is actually in the native class 👍

@MarcoGorelli MarcoGorelli marked this pull request as ready for review June 10, 2024 20:30
Copy link
Collaborator

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably lost the "ready for review" notification! Well... I finally made it 😁
It looks amazing! Thanks @MarcoGorelli 👌

@FBruzzesi FBruzzesi merged commit 2b32533 into koaning:main Jun 14, 2024
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants