Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLEP006 on Sample Properties #16

Merged
merged 16 commits into from
Jun 29, 2020
125 changes: 125 additions & 0 deletions slep006/proposal.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
.. _slep_006:

================================
Routing sample-aligned meta-data
================================

Scikit-learn has limited support for information pertaining to each sample
(henceforth "sample properties") to be passed through an estimation pipeline.
The user can, for instance, pass fit parameters to all members of a
FeatureUnion, or to a specified member of a Pipeline using dunder (``__``)
prefixing::

>>> from sklearn.pipeline import Pipeline
>>> from sklearn.linear_model import LogisticRegression
>>> pipe = Pipeline([('clf', LogisticRegression())])
>>> pipe.fit([[1, 2], [3, 4]], [5, 6],
... clf__sample_weight=[.5, .7]) # doctest: +SKIP

Several other meta-estimators, such as GridSearchCV, support forwarding these
fit parameters to their base estimator when fitting.

Desirable features we do not currently support include:

* passing sample properties (e.g. `sample_weight`) to a scorer used in
cross-validation
* passing sample properties (e.g. `groups`) to a CV splitter in nested cross
validation
* (maybe in scope) passing sample properties (e.g. `sample_weight`) to some
scorers and not others in a multi-metric cross-validation setup
* (likley out of scope) passing sample properties to non-fit methods, for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this particularly "harder" to implement than the others?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well firstly the use cases for it will need further definition; we don't currently pass around anything like weights to predict or transform methods. But yes it is hard in part because we have fused method like fit_transform

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I guess the sample_weight usecases maybe less frequent than things which take gender or race into account. There a predict may be some postprocessing on the output of another estimator based on these sample properties.

instance to index grouped samples that are to be treated as a single sequence
in prediction.

History
-------

This version was drafted after a discussion of the issue and potential
solutions at the February 2019 development sprint in Paris.

Supersedes `SLEP004
<https://github.com/scikit-learn/enhancement_proposals/tree/master/slep004>`_
with greater depth of desiderata and options.

TODO: more

Desiderata
----------

We will consider the following attributes to develop and compare solutions:

Usability
Can the use cases be achieved in succinct, readable code? Can common use
cases be achieved with a simple recipe copy-pasted from a QA forum?
Brittleness
If a property is being routed through a Pipeline, does changing the
structure of the pipeline (e.g. adding a layer of nesting) require rewriting
other code?
Error handling
If the user mistypes the name of a sample property, will an appropriate
exception be raised?
Impact on meta-estimator design
How much meta-estimator code needs to change? How hard will it be to
maintain?
Impact on estimator design
How much will the proposal affect estimator developers?
Backwards compatibility
Can existing behavior be maintained?
Forwards compatibility
Is the solution going to make users' code more
brittle with future changes? (For example, will a user's pipeline change
behaviour radicaly when sample_weight is implemented on some estimator)
Introspection
If sensible to do so (e.g. for improved efficiency), can a
meta-estimator identify whether its base estimator (recursively) would
handle some particular sample property (e.g. so a meta-estimator can choose
between weighting and resampling)?

Keyword arguments vs. a single argument
---------------------------------------

Currently, sample properties are provided as keyword arguments to a `fit`
method. In redevloping sample properties, we can instead accept a single
parameter (named `props` or `sample_props` or `etc`, for example) which maps
string keys to arrays of the same length (a "DataFrame-like").

Keyword arguments::

>>> gs.fit(X, y, groups=groups, sample_weight=sample_weight)

Single argument::

>>> gs.fit(X, y, prop={'groups': groups, 'sample_weight': sample_weight})

While drafting this document, we will assume the latter notation for clarity.

Advantages of multiple keyword arguments:

* succinct
* possible to maintain backwards compatible support for sample_weight, etc.
* we do not need to consider supporting estimators that don't expect a new
`props` argument.

Advantages of a single argument:

* we are able to remove the constraint that all kwargs to `fit` are
sample-aligned, so that we can add further functionality (`with_warm_start`
has been proposed, for instance).
* we are able to redefine the handling of weights etc. without being concerned
by backwards compatibility.

Use case setup
--------------



Solution 1
----------

Pass everything


Solution 2
----------

discusss capability-based routing (i.e. pass sample_weight if supported) with a check in each meta-estimator that each gets passed somewhere; or pass-everything