Add selectors #895

jeromedockes · 2024-03-20T16:43:51Z

This adds the (private) skrub._selectors module.
It is kind of a weird PR because we decided to keep it private for now, and it is not yet used anywhere in skrub 😅

However it is a step towards column-wise transformations (#877), and in the longer term towards the skrub Pipe/Recipe/..., so selectors will have a reason to exist at some point.

Moreover, I think in the future we will probably want to expose them to advanced users and accept them in skrub interfaces that currently accept a list of column names such as SelectCols (in addition to the list of columns, not instead).

The selectors achieve 2 things:

1: Delayed selection: passing a selection rule that describes what columns should be selected, when the dataframe on which the rule is evaluated and expanded into a concrete list of column names is not available yet. For example, instantiating a SelectCols that selects "all numeric columns except the column named 'ID'", even though the actual dataframe (and concrete list of column names) is only known at fit time, not instantiation time

A consequence is 1b: have a good value to represent "all columns" as a default argument for interfaces that expect a column list (None to mean "all" is not great, _selectors.all() is better).

2: Easily expressing complex selection rules because (i) the _selectors module offers a range of useful pre-defined rules such as numeric(), cardinality_below(30) or glob('User_*'), and (ii) because selectors can be combined with the operators ~, |, &, -, ^.

The docstring of skrub/_selectors/__init__.py provides more details but here is a summary:

Selectors have a .expand(dataframe) method that returns the list of column names that should be selected.
make_selector turns a column name, list of column names, or selector into a selector.
select returns the part of a dataframe matched by a selector (ie indexes the dataframe with the result of expand).

>>> from skrub import _selectors as s

>>> import pandas as pd
>>> df = pd.DataFrame(
...     {
...         "height_mm": [297.0, 420.0],
...         "width_mm": [210.0, 297.0],
...         "kind": ["A4", "A3"],
...         "ID": [4, 3],
...     }
... )

>>> s.all().expand(df)
['height_mm', 'width_mm', 'kind', 'ID']
>>> (s.all() - ['ID']).expand(df)
['height_mm', 'width_mm', 'kind']
>>> s.numeric().expand(df)
['height_mm', 'width_mm', 'ID']
>>> (s.glob('*_mm') | s.string()).expand(df)
['height_mm', 'width_mm', 'kind']
>>> s.cols('kind', 'ID').expand(df)
['kind', 'ID']
>>> s.make_selector(['kind', 'ID'])
cols('kind', 'ID')
>>> s.select(df, s.glob('*_mm'))
   height_mm  width_mm
0      297.0     210.0
1      420.0     297.0

jeromedockes · 2024-03-21T09:48:55Z

I'm not sure why codecov reports some lines as missing. They are covered by the tests, for example the reported line here is covered by this test; when running the tests locally I see 100% coverage for the _selectors module

TheooJ

Thanks @jeromedockes !
I mostly have small nitpicks about markdown, but it looks very good to me !

TheooJ · 2024-03-26T15:18:02Z

skrub/_dataframe/_common.py

+
+
+@is_integer.specialize("pandas")
+def _is_integer_pandas(column):


Small nitpick here, your function has a parameter column whereas some other funcs use col, like _to_list_pandas for instance

I'd harmonize this for consistency

I 100% agree, however looking at the module there seems to be an almost balanced mix of col and columns (my bad). so let's harmonize it in another PR as this one isn't really related to the dataframe module so doing it here would obfuscate the diff with a ton of unrelated changes

skrub/_dataframe/_common.py

TheooJ · 2024-03-26T15:19:34Z

skrub/_dataframe/_common.py

+    return column.dtype.is_integer()
+
+
+@dispatch


Same comment as for _is_integer

skrub/_selectors/_base.py

skrub/_selectors/_selectors.py

skrub/_selectors/tests/test_selectors.py

Co-authored-by: Théo Jolivet <57430673+TheooJ@users.noreply.github.com>

jeromedockes

thanks a lot for the thorough review @TheooJ !

skrub/_selectors/__init__.py

skrub/_selectors/_base.py

skrub/_selectors/tests/test_selectors.py

skrub/_selectors/__init__.py

…ectors-pr

TheooJ

LGTM :) thanks

jeromedockes added 4 commits March 20, 2024 17:16

add selectors

c466458

fix to_numpy for old pandas

90c8d10

add integer() and float() selectors

7c6e0df

add test for select

ca77fad

jeromedockes added 4 commits March 21, 2024 11:46

start adding docstrings

d97cb40

more docstrings + unpacking in filter

8d5b698

more docstrings

ce491ad

change in pandas output display

f15c07c

jeromedockes requested a review from glemaitre March 21, 2024 14:55

TheooJ mentioned this pull request Mar 25, 2024

Use dispatch and module fixture in tests #896

Merged

jeromedockes added 2 commits March 26, 2024 09:38

add selector_repr param to Filter & add to __all__

c361687

add test

a64fa43

TheooJ requested changes Mar 26, 2024

View reviewed changes

jeromedockes and others added 2 commits April 22, 2024 15:21

code review on docstring

3b2d342

Apply suggestions from code review

29a8432

Co-authored-by: Théo Jolivet <57430673+TheooJ@users.noreply.github.com>

jeromedockes commented Apr 22, 2024

View reviewed changes

skrub/_selectors/__init__.py Show resolved Hide resolved

skrub/_selectors/_base.py Show resolved Hide resolved

skrub/_selectors/tests/test_selectors.py Show resolved Hide resolved

skrub/_selectors/__init__.py Outdated Show resolved Hide resolved

jeromedockes added 2 commits April 22, 2024 15:28

Merge branch 'selectors-pr' of github.com:jeromedockes/skrub into sel…

2d59925

…ectors-pr

improve test

976f1cf

TheooJ approved these changes Apr 23, 2024

View reviewed changes

jeromedockes removed the request for review from glemaitre April 23, 2024 12:43

jeromedockes added 3 commits April 24, 2024 14:04

RST details

ba5f0a1

add all, any, has_nulls

af14661

Merge remote-tracking branch 'upstream/main' into selectors-pr

8c8eaf7

jeromedockes added the no changelog needed label Apr 29, 2024

TheooJ merged commit f819ea8 into skrub-data:main Apr 29, 2024
25 of 27 checks passed

jeromedockes mentioned this pull request May 2, 2024

Add column-wise transforms & refactor TableVectorizer #902

Merged

jeromedockes mentioned this pull request May 28, 2024

[WIP] Column-wise transformations on part of a dataframe #877

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add selectors #895

Add selectors #895

jeromedockes commented Mar 20, 2024

jeromedockes commented Mar 21, 2024

TheooJ left a comment •

edited

Loading

TheooJ Mar 26, 2024

jeromedockes Apr 22, 2024

TheooJ Mar 26, 2024

jeromedockes left a comment

TheooJ left a comment



		@is_integer.specialize("pandas")
		def _is_integer_pandas(column):

Add selectors #895

Add selectors #895

Conversation

jeromedockes commented Mar 20, 2024

jeromedockes commented Mar 21, 2024

TheooJ left a comment • edited Loading

Choose a reason for hiding this comment

TheooJ Mar 26, 2024

Choose a reason for hiding this comment

jeromedockes Apr 22, 2024

Choose a reason for hiding this comment

TheooJ Mar 26, 2024

Choose a reason for hiding this comment

jeromedockes left a comment

Choose a reason for hiding this comment

TheooJ left a comment

Choose a reason for hiding this comment

TheooJ left a comment •

edited

Loading