-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add selectors #895
Add selectors #895
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jeromedockes !
I mostly have small nitpicks about markdown, but it looks very good to me !
|
||
|
||
@is_integer.specialize("pandas") | ||
def _is_integer_pandas(column): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nitpick here, your function has a parameter column
whereas some other funcs use col
, like _to_list_pandas
for instance
I'd harmonize this for consistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I 100% agree, however looking at the module there seems to be an almost balanced mix of col
and columns
(my bad). so let's harmonize it in another PR as this one isn't really related to the dataframe
module so doing it here would obfuscate the diff with a ton of unrelated changes
return column.dtype.is_integer() | ||
|
||
|
||
@dispatch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as for _is_integer
Co-authored-by: Théo Jolivet <57430673+TheooJ@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks a lot for the thorough review @TheooJ !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM :) thanks
This adds the (private)
skrub._selectors
module.It is kind of a weird PR because we decided to keep it private for now, and it is not yet used anywhere in skrub 😅
However it is a step towards column-wise transformations (#877), and in the longer term towards the skrub Pipe/Recipe/..., so selectors will have a reason to exist at some point.
Moreover, I think in the future we will probably want to expose them to advanced users and accept them in skrub interfaces that currently accept a list of column names such as SelectCols (in addition to the list of columns, not instead).
The selectors achieve 2 things:
1: Delayed selection: passing a selection rule that describes what columns should be selected, when the dataframe on which the rule is evaluated and expanded into a concrete list of column names is not available yet. For example, instantiating a
SelectCols
that selects "all numeric columns except the column named 'ID'", even though the actual dataframe (and concrete list of column names) is only known atfit
time, not instantiation timeA consequence is 1b: have a good value to represent "all columns" as a default argument for interfaces that expect a column list (
None
to mean "all" is not great,_selectors.all()
is better).2: Easily expressing complex selection rules because (i) the
_selectors
module offers a range of useful pre-defined rules such asnumeric()
,cardinality_below(30)
orglob('User_*')
, and (ii) because selectors can be combined with the operators~
,|
,&
,-
,^
.The docstring of
skrub/_selectors/__init__.py
provides more details but here is a summary:.expand(dataframe)
method that returns the list of column names that should be selected.make_selector
turns a column name, list of column names, or selector into a selector.select
returns the part of a dataframe matched by a selector (ie indexes the dataframe with the result ofexpand
).