-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: make sure X.schema is only calculated once per dataframe in TypeSelector #676
perf: make sure X.schema is only calculated once per dataframe in TypeSelector #676
Conversation
if exclude: | ||
return df.select(functools.reduce(lambda x, y: x | y, include) - functools.reduce(lambda x, y: x | y, exclude)) | ||
return df.select(functools.reduce(lambda x, y: x | y, include)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unfortunately Polars doesn't have an empty selector, else this could've been simpler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would have been sooooo much cleaner!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
I have one concern thought: I realized that polars selectors select in the order they are given and do not maintain the original order.
Consider the following example:
import polars
import polars.selectors as cs
df = pl.DataFrame({
"a": list("abc"),
"x": [1,2,3],
"z": [True, False, True],
})
df.select(cs.numeric() | cs.string())
shape: (3, 2)
┌─────┬─────┐
│ x ┆ a │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 1 ┆ a │
│ 2 ┆ b │
│ 3 ┆ c │
└─────┴─────┘
Is this something we should warn the user or take into consideration?
cc: @koaning
@@ -10,26 +11,15 @@ | |||
from sklego.common import as_list | |||
|
|||
|
|||
def _nw_match_dtype(dtype, selection): | |||
def _nw_match_dtype(selection): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Big +1 here 👌
if exclude: | ||
return df.select(functools.reduce(lambda x, y: x | y, include) - functools.reduce(lambda x, y: x | y, exclude)) | ||
return df.select(functools.reduce(lambda x, y: x | y, include)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would have been sooooo much cleaner!
that's a good point, thanks - it should be possible to preserve the order. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for your review!
I don't think it's worth using selectors here, especially if you want to preserve the order of the columns and if X.schema
needs calculating anyway
So instead, I've refactored the code to make sure that X.schema
is only ever calculated once per dataframe, and is passed to _nw_select_dtypes
. This is better for LazyFrames
I've also expanded test coverage
@@ -43,16 +44,22 @@ def _nw_select_dtypes(df, include: str | list[str], exclude: str | list[str]): | |||
if isinstance(exclude, str): | |||
exclude = [exclude] | |||
|
|||
include = include or ["string", "number", "bool", "category"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this logic was right to begin with. If include
isn't passed, it should keep everything except for what's in exclude
- not just the string/number/bool/category cols not in exclude
...
I've added a test which covers this
def _nw_select_dtypes(df, include: str | list[str], exclude: str | list[str]): | ||
def _nw_select_dtypes(include: str | list[str], exclude: str | list[str], schema: dict[str, Any]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than passing df
and repeatedly calling df.schema
, we can just calculate schema
once upfront and pass that
transformed_df = _nw_select_dtypes(X, include=self.include, exclude=self.exclude) | ||
transformed_df = X.select( | ||
_nw_select_dtypes(include=self.include, exclude=self.exclude, schema=X_schema) | ||
).pipe(nw.to_native) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like we (well, probably I 😳 ) forgot the to_native
part at the end
I've added a test which asserts that the output is actually in the native class 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I probably lost the "ready for review" notification! Well... I finally made it 😁
It looks amazing! Thanks @MarcoGorelli 👌
Description
Refactors
TypeSelector
to use Narwhals selectors, as opposed to hard-coding typesThis is preferable because, for Polars LazyFrames, getting the schema isn't a free operation (especially if the original dataframe is on the cloud -
LazyFrame.schema
may get renamed toLazyFrame.collect_schema
for this reason)I've also expanded the tests
Type of change
Checklist: