Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLEP006 on Sample Properties #16

Merged
merged 16 commits into from
Jun 29, 2020
386 changes: 386 additions & 0 deletions slep006/proposal.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,386 @@
.. _slep_006:

================================
Routing sample-aligned meta-data
================================

Scikit-learn has limited support for information pertaining to each sample
(henceforth "sample properties") to be passed through an estimation pipeline.
The user can, for instance, pass fit parameters to all members of a
FeatureUnion, or to a specified member of a Pipeline using dunder (``__``)
prefixing::

>>> from sklearn.pipeline import Pipeline
>>> from sklearn.linear_model import LogisticRegression
>>> pipe = Pipeline([('clf', LogisticRegression())])
>>> pipe.fit([[1, 2], [3, 4]], [5, 6],
... clf__sample_weight=[.5, .7]) # doctest: +SKIP

Several other meta-estimators, such as GridSearchCV, support forwarding these
fit parameters to their base estimator when fitting.

Desirable features we do not currently support include:

* passing sample properties (e.g. `sample_weight`) to a scorer used in
cross-validation
* passing sample properties (e.g. `groups`) to a CV splitter in nested cross
validation
* (maybe in scope) passing sample properties (e.g. `sample_weight`) to some
scorers and not others in a multi-metric cross-validation setup
* (likley out of scope) passing sample properties to non-fit methods, for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this particularly "harder" to implement than the others?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well firstly the use cases for it will need further definition; we don't currently pass around anything like weights to predict or transform methods. But yes it is hard in part because we have fused method like fit_transform

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I guess the sample_weight usecases maybe less frequent than things which take gender or race into account. There a predict may be some postprocessing on the output of another estimator based on these sample properties.

instance to index grouped samples that are to be treated as a single sequence
in prediction.

Definitions
-----------

consumer
An estimator, scorer, splitter, etc., that receives and can make use of
one or more passed props.
key
A label passed along with sample prop data to indicate how it should be
interpreted (e.g. "weight").
router
An estimator or function that passes props on to some other router or
consumer, while either consuming props itself or delivering props to
more than one immediate destination.

History
-------

This version was drafted after a discussion of the issue and potential
solutions at the February 2019 development sprint in Paris.

Supersedes `SLEP004
<https://github.com/scikit-learn/enhancement_proposals/tree/master/slep004>`_
with greater depth of desiderata and options.

Related issues and pull requests include:

- `Consistent API for attaching properties to samples
#4497 <https://github.com/scikit-learn/scikit-learn/issues/4497>`__
- `Acceptance of sample\_weights in pipeline.score
#7723 <https://github.com/scikit-learn/scikit-learn/pull/7723>`__
- `Establish global error state like np.seterr
#4660 <https://github.com/scikit-learn/scikit-learn/issues/4660>`__
- `Should cross-validation scoring take sample-weights into account?
#4632 <https://github.com/scikit-learn/scikit-learn/issues/4632>`__
- `Sample properties
#4696 <https://github.com/scikit-learn/scikit-learn/issues/4696>`__

TODO: more. ndawe's work.

Desiderata
----------

We will consider the following attributes to develop and compare solutions:

Usability
Can the use cases be achieved in succinct, readable code? Can common use
cases be achieved with a simple recipe copy-pasted from a QA forum?
Brittleness
If a property is being routed through a Pipeline, does changing the
structure of the pipeline (e.g. adding a layer of nesting) require rewriting
other code?
Error handling
If the user mistypes the name of a sample property, will an appropriate
exception be raised?
Impact on meta-estimator design
How much meta-estimator code needs to change? How hard will it be to
maintain?
Impact on estimator design
How much will the proposal affect estimator developers?
Backwards compatibility
Can existing behavior be maintained?
Forwards compatibility
Is the solution going to make users' code more
brittle with future changes? (For example, will a user's pipeline change
behaviour radically when sample_weight is implemented on some estimator)
Introspection
If sensible to do so (e.g. for improved efficiency), can a
meta-estimator identify whether its base estimator (recursively) would
handle some particular sample property (e.g. so a meta-estimator can choose
between weighting and resampling)?

Keyword arguments vs. a single argument
---------------------------------------

Currently, sample properties are provided as keyword arguments to a `fit`
method. In redeveloping sample properties, we can instead accept a single
parameter (named `props` or `sample_props` or `etc`, for example) which maps
string keys to arrays of the same length (a "DataFrame-like").

Keyword arguments::

>>> gs.fit(X, y, groups=groups, sample_weight=sample_weight)

Single argument::

>>> gs.fit(X, y, prop={'groups': groups, 'weight': weight})

While drafting this document, we will assume the latter notation for clarity.

Advantages of multiple keyword arguments:

* succinct
* possible to maintain backwards compatible support for sample_weight, etc.
* we do not need to handle cases for whether or not some estimator expects a
`props` argument.

Advantages of a single argument:

* we are able to consider kwargs to `fit` that are not sample-aligned, so that
we can add further functionality (some that have been proposed:
`with_warm_start`, `feature_names_in`, `feature_meta`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like feature_names_in_ would be through NamedArray?

* we are able to redefine the default routing of weights etc. without being
concerned by backwards compatibility.
* we can consider the use of keys that are not limited to strings or valid
identifiers (and hence are not limited to using ``_`` as a delimiter).

Test case setup
---------------

Case A
~~~~~~

Cross-validate a ``LogisticRegressionCV(cv=GroupKFold(), scoring='accuracy')``
with weighted scoring and weighted fitting.

Case B
~~~~~~

Cross-validate a ``LogisticRegressionCV(cv=GroupKFold(), scoring='accuracy')``
with weighted scoring and unweighted fitting.

Case C
~~~~~~

Extend Case A to apply an unweighted univariate feature selector in a
``Pipeline``.

Case D
~~~~~~

Extend Case A to apply an arbitrary SLEP005 resampler in a pipeline, which
rewrites ``sample_weight`` and ``groups``.

Case E
~~~~~~

Different weights for scoring and for fitting in Case A.

Solution 1: Pass everything
---------------------------

This proposal passes all props to all consumers (estimators, splitters,
scorers, etc). The consumer would optionally use props it is familiar with by
name and disregard other props.

We may consider providing syntax for the user to control the interpretation of
incoming props:

* to require that some prop is provided (for an estimator where that prop is
otherwise optional)
* to disregard some provided prop
* to treat a particular prop key as having a certain meaning (e.g. locally
interpreting 'scoring_sample_weight' as 'sample_weight').

These constraints would be checked by calling a helper at the consumer.

Issues (TODO: clean)

* Error handling: if a prop is optional in a consumer, no error will be
raised for misspelling. An introspection API might change this, allowing a
meta-estimator to check if all props are to be used.
* Forwards compatibility: newly supporting a prop in a consumer will change
behaviour. Other than a ChangedBehaviorWarning I don't see any way around
this.
* Introspection: not inherently supported. Would need an API like
``get_prop_support(names: List[str]) -> Dict[str, Literal["supported", "required", "ignored"]]``.

Solution 2: Specify routes at call
----------------------------------

Similar to the legacy behavior of fit parameters in
:class:`sklearn.pipeline.Pipeline`, this requires the user to specify the
path for each "prop" to follow when calling `fit`. For example, to pass
a prop named 'weights' to a step named 'spam' in a Pipeline, you might use
`my_pipe.fit(X, y, props={'spam__weights': my_weights})`.

SLEP004's syntax to override the common routing scheme falls under this
solution.

Issues:

* This gets tricky or impossible where the available routes change
mid-fit, such as where a grid search considers estimators with different
structures.
* This may not work if a meta-estimator were to have the role of changing a
prop, e.g. a meta-estimator that passes `sample_weights` corresponding to
balanced classes onto its base estimator. The meta-estimator would need a
list of destinations to pass props to.

Solution 3: Specify routes on metaestimators
--------------------------------------------

Each meta-estimator is given a routing specification which it must follow in
passing only the required parameters to each of its children. In this context,
a GridSearchCV has children including `estimator`, `cv` and (each element of)
`scoring`.

Pull request `#9566 <https://github.com/scikit-learn/scikit-learn/pull/9566>`__
is a partial implementation of this approach.

Disadvantages:

* Routing may be hard to get one's head around, especially since the prop
support belongs to the child estimator but the parent is responsible for the
routing.
* Need to design an API for specifying routings.

Solution 4: Each child requests
-------------------------------

Here the meta-estimator provides only what its each of its children requests.
The meta-estimator would also need to request, on behalf of its children,
any prop that its descendants require.

Each object in a situation that could receive props would have a method like
`_get_prop_requests()` which would return a list of prop names (or perhaps a
mapping for more sophisticated use-cases). Group* CV splitters would default to
returning `['groups']`, for example. Estimators supporting weighted fitting
may return `[]` by default, but may have a parameter `request_props` which
may be set to `['weight']` if weight is sought, or perhaps just boolean
parameter `request_weight`. `make_scorer` would have a similar mechanism for
enabling weighted scoring.

Benefits:

* This will not need to affect legacy estimators, since no props will be
passed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can also think about the cases where sample_weights is not passed now, but that's because we've forgotten to pass it along (not sure how many of those we have left though).

* This does not require defining a new syntax for routing.
* The implementation changes in meta-estimators may be easy to provide via a
helper or two (perhaps even `call_with_props(method, target, props)`).
* Easy to reconfigure what props an estimator gets in a grid search.
* Could make use of existing `**fit_params` syntax rather than introducing new
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is kinda orthogonal to this solution, isn't it? I still like the idea of passing things that are not sample aligned.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's orthogonal, but the benefit of reusing **fit_params syntax is that legacy estimators inheriting from BaseEstimator will automatically work; no need to require that they accept props.

Also, because Solution 4 requires estimators to explicitly declare that a fit parameter is a sample prop, it's possible (at least in theory) for other fit parameters to not be sample-aligned, and to be passed by some other mechanism...

`props` argument to `fit`.

Disadvantages:

* This will require modifying every estimator that may want props, as well as
all meta-estimators. We could provide a mixin or similar to add prop-request
support to a legacy estimator; or `BaseEstimator` could have a
`set_props_request` method (instead of the `request_props` constructor
parameter approach) such that all legacy base estimators are
automatically equipped.
* If the parameter request does not correspond to a `get_params` parameter,
`clone` will have to ensure that this request is cloned as well as
parameters.

Possible public syntax:

* `BaseEstimator` will have methods `set_props_request` and `get_props_request`
* `make_scorer` will have a `request_props` parameter to set props required by
the scorer.
* `get_props_request` will return a dict. It maps the prop name that the user
passes to the prop name that the estimator expects.
* `set_props_request` will accept either such a dict or a sequence `s` to be
interpreted as the identity mapping for all elements in `s`
(`{x: x for x in s}`). It will return `self` to enable chaining.
* `Group*` CV splitters will by default request the 'groups' prop, but its
mapping can be changed with their `set_props_request` method.

(TODO: this might look more persuasive using ``**fit_params`` rather than a
prop param.)

Test cases::

from sklearn.model_selection import GroupKFold, cross_validate
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import accuracy

# Case A:
weighted_acc = make_scorer(accuracy, request_props=['sample_weight'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't think having the same argument in the __init__ of all estimators would be a bad idea.

We can also probably call it sample_props instead. It'd be probably more intuitive for the users, or even required_sample_props.

lr = LogisticRegressionCV(
cv=GroupKFold(),
scoring=weighted_acc,
).set_props_request(['sample_weight'])
cross_validate(lr, X, y,
props={'sample_weight': my_weights, 'groups': my_groups},
scoring=weighted_acc)

# Case B variant 1:
weighted_acc = make_scorer(accuracy, request_props=['sample_weight'])
lr = LogisticRegressionCV(
cv=GroupKFold(),
scoring=weighted_acc,
)
cross_validate(lr, X, y,
props={'sample_weight': my_weights, 'groups': my_groups},
scoring=weighted_acc)

# Case B variant 2:
weighted_acc = make_scorer(accuracy, request_props=['scoring_weight'])
lr = LogisticRegressionCV(
cv=GroupKFold(),
scoring=weighted_acc,
).set_props_request(['fitting_weight'])
cross_validate(lr, X, y,
props={
'scoring_weight': my_weights,
'fitting_weight': None,
'groups': my_groups,
scoring=weighted_acc)

# Case C:
weighted_acc = make_scorer(accuracy, request_props=['sample_weight'])
lr = LogisticRegressionCV(
cv=GroupKFold(),
scoring=weighted_acc,
).set_props_request(['sample_weight'])
sel = SelectKBest()
pipe = make_pipeline(sel, lr)
cross_validate(pipe, X, y,
props={'sample_weight': my_weights, 'groups': my_groups},
scoring=weighted_acc)

# Case D:
weighted_acc = make_scorer(accuracy, request_props=['sample_weight'])
lr = LogisticRegressionCV(
cv=GroupKFold(),
scoring=weighted_acc,
).set_props_request(['sample_weight'])
pipe = ResamplingTrainer(Resampler(), lr)
cross_validate(pipe, X, y,
props={'sample_weight': my_weights, 'groups': my_groups},
scoring=weighted_acc)


Proposal
--------

Having considered the above solutions, we propose:


Backward compatibility
----------------------

Discussion
----------

References and Footnotes
------------------------

.. [1] Each SLEP must either be explicitly labeled as placed in the public
domain (see this SLEP as an example) or licensed under the `Open
Publication License`_.
.. _Open Publication License: https://www.opencontent.org/openpub/


Copyright
---------

This document has been placed in the public domain. [1]_