Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add selectors #895

Merged
merged 17 commits into from
Apr 29, 2024
Merged

Add selectors #895

merged 17 commits into from
Apr 29, 2024

Conversation

jeromedockes
Copy link
Member

This adds the (private) skrub._selectors module.
It is kind of a weird PR because we decided to keep it private for now, and it is not yet used anywhere in skrub 😅

However it is a step towards column-wise transformations (#877), and in the longer term towards the skrub Pipe/Recipe/..., so selectors will have a reason to exist at some point.

Moreover, I think in the future we will probably want to expose them to advanced users and accept them in skrub interfaces that currently accept a list of column names such as SelectCols (in addition to the list of columns, not instead).

The selectors achieve 2 things:

1: Delayed selection: passing a selection rule that describes what columns should be selected, when the dataframe on which the rule is evaluated and expanded into a concrete list of column names is not available yet. For example, instantiating a SelectCols that selects "all numeric columns except the column named 'ID'", even though the actual dataframe (and concrete list of column names) is only known at fit time, not instantiation time

A consequence is 1b: have a good value to represent "all columns" as a default argument for interfaces that expect a column list (None to mean "all" is not great, _selectors.all() is better).

2: Easily expressing complex selection rules because (i) the _selectors module offers a range of useful pre-defined rules such as numeric(), cardinality_below(30) or glob('User_*'), and (ii) because selectors can be combined with the operators ~, |, &, -, ^.

The docstring of skrub/_selectors/__init__.py provides more details but here is a summary:

  • Selectors have a .expand(dataframe) method that returns the list of column names that should be selected.
  • make_selector turns a column name, list of column names, or selector into a selector.
  • select returns the part of a dataframe matched by a selector (ie indexes the dataframe with the result of expand).
>>> from skrub import _selectors as s

>>> import pandas as pd
>>> df = pd.DataFrame(
...     {
...         "height_mm": [297.0, 420.0],
...         "width_mm": [210.0, 297.0],
...         "kind": ["A4", "A3"],
...         "ID": [4, 3],
...     }
... )

>>> s.all().expand(df)
['height_mm', 'width_mm', 'kind', 'ID']
>>> (s.all() - ['ID']).expand(df)
['height_mm', 'width_mm', 'kind']
>>> s.numeric().expand(df)
['height_mm', 'width_mm', 'ID']
>>> (s.glob('*_mm') | s.string()).expand(df)
['height_mm', 'width_mm', 'kind']
>>> s.cols('kind', 'ID').expand(df)
['kind', 'ID']
>>> s.make_selector(['kind', 'ID'])
cols('kind', 'ID')
>>> s.select(df, s.glob('*_mm'))
   height_mm  width_mm
0      297.0     210.0
1      420.0     297.0

@jeromedockes
Copy link
Member Author

I'm not sure why codecov reports some lines as missing. They are covered by the tests, for example the reported line here is covered by this test; when running the tests locally I see 100% coverage for the _selectors module

Copy link
Contributor

@TheooJ TheooJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jeromedockes !
I mostly have small nitpicks about markdown, but it looks very good to me !



@is_integer.specialize("pandas")
def _is_integer_pandas(column):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nitpick here, your function has a parameter column whereas some other funcs use col, like _to_list_pandas for instance

I'd harmonize this for consistency

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I 100% agree, however looking at the module there seems to be an almost balanced mix of col and columns (my bad). so let's harmonize it in another PR as this one isn't really related to the dataframe module so doing it here would obfuscate the diff with a ton of unrelated changes

skrub/_dataframe/_common.py Outdated Show resolved Hide resolved
skrub/_dataframe/_common.py Outdated Show resolved Hide resolved
skrub/_dataframe/_common.py Outdated Show resolved Hide resolved
return column.dtype.is_integer()


@dispatch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as for _is_integer

skrub/_selectors/_base.py Show resolved Hide resolved
skrub/_selectors/_selectors.py Outdated Show resolved Hide resolved
skrub/_selectors/_selectors.py Show resolved Hide resolved
skrub/_selectors/_selectors.py Outdated Show resolved Hide resolved
skrub/_selectors/tests/test_selectors.py Show resolved Hide resolved
jeromedockes and others added 2 commits April 22, 2024 15:21
Co-authored-by: Théo Jolivet <57430673+TheooJ@users.noreply.github.com>
Copy link
Member Author

@jeromedockes jeromedockes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a lot for the thorough review @TheooJ !

skrub/_selectors/__init__.py Show resolved Hide resolved
skrub/_selectors/_base.py Show resolved Hide resolved
skrub/_selectors/tests/test_selectors.py Show resolved Hide resolved
skrub/_selectors/__init__.py Outdated Show resolved Hide resolved
Copy link
Contributor

@TheooJ TheooJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :) thanks

@jeromedockes jeromedockes removed the request for review from glemaitre April 23, 2024 12:43
@TheooJ TheooJ merged commit f819ea8 into skrub-data:main Apr 29, 2024
25 of 27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants