-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SLEP006 on Sample Properties #16
Merged
Merged
Changes from 1 commit
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
2763686
Starting to draft SLEP006 on Sample Properties
jnothman 9c95886
iter
jnothman 00b3067
WIP
jnothman d4c3374
WIP
jnothman 373aa23
a fourth solution and a little more fleshing... still no code examples.
jnothman 9060a30
Code examples using Solution 4
jnothman d62a3d0
A couple of cross-references
jnothman 99213a4
WIP
jnothman d4c7a47
Merge branch 'master' into props06
jnothman c1d6d1e
Filling out example code
jnothman f5f7f03
Note handling of misspelled keys
jnothman dfe7d66
Note the status quo hacks
jnothman 8bbcb72
new code
jnothman 4dd69fd
Small additions including section on nomenclature
jnothman fa06f4a
Some more thoughts on backwards compatibility
jnothman ac09c64
Note on potential for mixed keys
jnothman File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,125 @@ | ||
.. _slep_006: | ||
|
||
================================ | ||
Routing sample-aligned meta-data | ||
================================ | ||
|
||
Scikit-learn has limited support for information pertaining to each sample | ||
(henceforth "sample properties") to be passed through an estimation pipeline. | ||
The user can, for instance, pass fit parameters to all members of a | ||
FeatureUnion, or to a specified member of a Pipeline using dunder (``__``) | ||
prefixing:: | ||
|
||
>>> from sklearn.pipeline import Pipeline | ||
>>> from sklearn.linear_model import LogisticRegression | ||
>>> pipe = Pipeline([('clf', LogisticRegression())]) | ||
>>> pipe.fit([[1, 2], [3, 4]], [5, 6], | ||
... clf__sample_weight=[.5, .7]) # doctest: +SKIP | ||
|
||
Several other meta-estimators, such as GridSearchCV, support forwarding these | ||
fit parameters to their base estimator when fitting. | ||
|
||
Desirable features we do not currently support include: | ||
|
||
* passing sample properties (e.g. `sample_weight`) to a scorer used in | ||
cross-validation | ||
* passing sample properties (e.g. `groups`) to a CV splitter in nested cross | ||
validation | ||
* (maybe in scope) passing sample properties (e.g. `sample_weight`) to some | ||
scorers and not others in a multi-metric cross-validation setup | ||
* (likley out of scope) passing sample properties to non-fit methods, for | ||
instance to index grouped samples that are to be treated as a single sequence | ||
in prediction. | ||
|
||
History | ||
------- | ||
|
||
This version was drafted after a discussion of the issue and potential | ||
solutions at the February 2019 development sprint in Paris. | ||
|
||
Supersedes `SLEP004 | ||
<https://github.com/scikit-learn/enhancement_proposals/tree/master/slep004>`_ | ||
with greater depth of desiderata and options. | ||
|
||
TODO: more | ||
|
||
Desiderata | ||
---------- | ||
|
||
We will consider the following attributes to develop and compare solutions: | ||
|
||
Usability | ||
Can the use cases be achieved in succinct, readable code? Can common use | ||
cases be achieved with a simple recipe copy-pasted from a QA forum? | ||
Brittleness | ||
If a property is being routed through a Pipeline, does changing the | ||
structure of the pipeline (e.g. adding a layer of nesting) require rewriting | ||
other code? | ||
Error handling | ||
If the user mistypes the name of a sample property, will an appropriate | ||
exception be raised? | ||
Impact on meta-estimator design | ||
How much meta-estimator code needs to change? How hard will it be to | ||
maintain? | ||
Impact on estimator design | ||
How much will the proposal affect estimator developers? | ||
Backwards compatibility | ||
Can existing behavior be maintained? | ||
Forwards compatibility | ||
Is the solution going to make users' code more | ||
brittle with future changes? (For example, will a user's pipeline change | ||
behaviour radicaly when sample_weight is implemented on some estimator) | ||
Introspection | ||
If sensible to do so (e.g. for improved efficiency), can a | ||
meta-estimator identify whether its base estimator (recursively) would | ||
handle some particular sample property (e.g. so a meta-estimator can choose | ||
between weighting and resampling)? | ||
|
||
Keyword arguments vs. a single argument | ||
--------------------------------------- | ||
|
||
Currently, sample properties are provided as keyword arguments to a `fit` | ||
method. In redevloping sample properties, we can instead accept a single | ||
parameter (named `props` or `sample_props` or `etc`, for example) which maps | ||
string keys to arrays of the same length (a "DataFrame-like"). | ||
|
||
Keyword arguments:: | ||
|
||
>>> gs.fit(X, y, groups=groups, sample_weight=sample_weight) | ||
|
||
Single argument:: | ||
|
||
>>> gs.fit(X, y, prop={'groups': groups, 'sample_weight': sample_weight}) | ||
|
||
While drafting this document, we will assume the latter notation for clarity. | ||
|
||
Advantages of multiple keyword arguments: | ||
|
||
* succinct | ||
* possible to maintain backwards compatible support for sample_weight, etc. | ||
* we do not need to consider supporting estimators that don't expect a new | ||
`props` argument. | ||
|
||
Advantages of a single argument: | ||
|
||
* we are able to remove the constraint that all kwargs to `fit` are | ||
sample-aligned, so that we can add further functionality (`with_warm_start` | ||
has been proposed, for instance). | ||
* we are able to redefine the handling of weights etc. without being concerned | ||
by backwards compatibility. | ||
|
||
Use case setup | ||
-------------- | ||
|
||
|
||
|
||
Solution 1 | ||
---------- | ||
|
||
Pass everything | ||
|
||
|
||
Solution 2 | ||
---------- | ||
|
||
discusss capability-based routing (i.e. pass sample_weight if supported) with a check in each meta-estimator that each gets passed somewhere; or pass-everything |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this particularly "harder" to implement than the others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well firstly the use cases for it will need further definition; we don't currently pass around anything like weights to predict or transform methods. But yes it is hard in part because we have fused method like fit_transform
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I guess the
sample_weight
usecases maybe less frequent than things which takegender
orrace
into account. There apredict
may be some postprocessing on the output of another estimator based on these sample properties.