Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH add an option to centered ICE and PD #18310

Merged
merged 35 commits into from
Apr 7, 2022
Merged

Conversation

JoElfner
Copy link
Contributor

@JoElfner JoElfner commented Aug 31, 2020

This adds the centered argument to plot_partial_dependence and to PartialDependenceDisplay.plot to allow for plotting cICE-plots, centered PDP plots as well es centered plots for kind='both'

Reference Issue

Fixes #18195

What does this implement/fix? Explain your changes.

By setting centered=True, cICE or centered PDP are plotted, see:

import matplotlib.pyplot as plt
from sklearn.inspection import partial_dependence
from sklearn.inspection import plot_partial_dependence
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import QuantileTransformer
from sklearn.neural_network import MLPRegressor

import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

cal_housing = fetch_california_housing()
X = pd.DataFrame(cal_housing.data, columns=cal_housing.feature_names)
y = cal_housing.target

y -= y.mean()

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.1, random_state=0
)

print("Training MLPRegressor...")
est = make_pipeline(QuantileTransformer(),
                    MLPRegressor(hidden_layer_sizes=(50, 50),
                                 learning_rate_init=0.01,
                                 early_stopping=True))
est.fit(X_train, y_train)

print('Computing partial dependence plots...')
features = ['MedInc', 'AveOccup', 'HouseAge', 'AveRooms']
display = plot_partial_dependence(
       est, X_train, features, kind="both", subsample=50,
       n_jobs=3, grid_resolution=20
)
display.figure_.suptitle(
    'Partial dependence of house value on non-location features\n'
    'for the California housing dataset, with MLPRegressor'
)
display.figure_.subplots_adjust(hspace=0.3)

comments:

flake8 and pytest pass on my machine (somehow the centered arg in parametrize got lost, even though it was already there before the first commit...)

documentation will follow once this is reviewed

@glemaitre glemaitre self-requested a review September 2, 2020 09:22
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to document the parameter in the user guide as well.
We need a test in the test_plot_partial_dependence.py to check that when centered=True all lines start y=0.

sklearn/inspection/_plot/partial_dependence.py Outdated Show resolved Hide resolved
sklearn/inspection/_plot/partial_dependence.py Outdated Show resolved Hide resolved
sklearn/inspection/_plot/partial_dependence.py Outdated Show resolved Hide resolved
sklearn/inspection/_plot/partial_dependence.py Outdated Show resolved Hide resolved
sklearn/inspection/_plot/partial_dependence.py Outdated Show resolved Hide resolved
sklearn/inspection/_plot/partial_dependence.py Outdated Show resolved Hide resolved
@glemaitre
Copy link
Member

Also, please add an entry to the change log at doc/whats_new/v*.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

glemaitre review

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
@JoElfner
Copy link
Contributor Author

JoElfner commented Sep 2, 2020

Thanks for the review. I'll update the docs and changelog in the next few days.

edit: Ok, done. The whatsnew commit now shows all lines added since I forked the repo. Is that correct?

Copy link
Contributor Author

@JoElfner JoElfner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the if ln is not None clause: This results from faulty indexing in the current plot_partial_dependency implementation. I'd suggest that I open a separate Issue/PR to deal with this, since it is not exactly connected to this PR.
See my reviews for a proposed solution.

sklearn/inspection/_plot/partial_dependence.py Outdated Show resolved Hide resolved
sklearn/inspection/_plot/partial_dependence.py Outdated Show resolved Hide resolved
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to have documentation in the user guide and see if it is meaningful to change an example

doc/whats_new/v0.24.rst Outdated Show resolved Hide resolved
sklearn/inspection/_plot/partial_dependence.py Outdated Show resolved Hide resolved
@glemaitre
Copy link
Member

As for the if ln is not None clause: This results from faulty indexing in the current plot_partial_dependency implementation. I'd suggest that I open a separate Issue/PR to deal with this, since it is not exactly connected to this PR.
See my reviews for a proposed solution.

Yes we need to solve this issue first and open another PR.

@JoElfner
Copy link
Contributor Author

JoElfner commented Sep 4, 2020

We still need to have documentation in the user guide and see if it is meaningful to change an example

done.

As for the if ln is not None clause: This results from faulty indexing in the current plot_partial_dependency implementation. I'd suggest that I open a separate Issue/PR to deal with this, since it is not exactly connected to this PR.
See my reviews for a proposed solution.

Yes we need to solve this issue first and open another PR.

Ok, going to open an Issue once this is merged.

@glemaitre
Copy link
Member

Ok, going to open an Issue once this is merged.

I would solve the bug first.

@JoElfner
Copy link
Contributor Author

JoElfner commented Sep 8, 2020

Ok, going to open an Issue once this is merged.

I would solve the bug first.

I disagree, since it is imho not productive/effective to postpone a feature introduction due to a bug, which has been there before and which is not directly connected to the new feature. But since you've got a much better overview of all processes in this project, I assume that you are right. :)
So I'll wait until the bug has been fixed.

@glemaitre
Copy link
Member

I made a PR there: #18359

@ogrisel
Copy link
Member

ogrisel commented Sep 9, 2020

Using the first sample as reference to center everything seems very arbitrary. I have really no idea how to make sense of the resulting plot.

  • Why not shift by removing the mean value of each ICE line instead?

  • Do you have any reference in the literature that discusses the centering with the value of the first point?

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
@JoElfner
Copy link
Contributor Author

JoElfner commented Sep 9, 2020

Hi @ogrisel,
thanks for your review. Removing the mean has several downsides, f.i. outlier sensitivity and having an undefined overlapping location (if any). Thus it is hard to compare each sample's trend.
Another approach is to use a specific quantile of the data as centering anchor, as mentioned in the issue considering this PR. You can also find literature in this post. It may not be "citeable" literature, but at least the online-book provdies references.

@ogrisel
Copy link
Member

ogrisel commented Sep 9, 2020

Ok so this centering method is indeed introduced in Goldstein, A., Kapelner, A., Bleich, J., and Pitkin, E., Peeking Inside the Black Box: VisualizingStatistical Learning With Plots of Individual Conditional Expectation. (2014) Journal of Computa-tional and Graphical Statistics https://arxiv.org/abs/1309.6392 which meets the scikit-learn inclusion criteria (with more than 200 citations).

So +1 for leaving this at it is.

@ogrisel
Copy link
Member

ogrisel commented Sep 9, 2020

The above reference is already cited in the user guide. Could you please add another ref to in the new paragraph where you introduce centered ICE?

@glemaitre glemaitre added this to the 1.1 milestone Aug 31, 2021
ogrisel added a commit to scikit-learn-inria-fondation/follow-up that referenced this pull request Sep 7, 2021
## August 31th, 2021

### Gael

* TODO: Jeremy's renewal, Chiara's replacement, Mathis's consulting gig

### Olivier

- input feature names: main PR [#18010](scikit-learn/scikit-learn#18010) that links into sub PRs
  - remaining (need review): [#20853](scikit-learn/scikit-learn#20853) (found a bug in `OvOClassifier.n_features_in_`)
- reviewing `get_feature_names_out`: [#18444](scikit-learn/scikit-learn#18444)
- next: give feedback to Chiara on ARM wheel building [#20711](scikit-learn/scikit-learn#20711) (needed for the release)
- next: assist Adrin for the release process
- next: investigate regression in loky that blocks the cloudpickle release [#432](cloudpipe/cloudpickle#432)
- next: come back to intel to write a technical roadmap for a possible collaboration

### Julien

 - Was on holidays
 - Planned week @ Nexedi, Lille, from September 13th to 17th
 - Reviewed PRs
     - [`#20567`](scikit-learn/scikit-learn#20567) Common Private Loss module
     - [`#18310`](scikit-learn/scikit-learn#18310) ENH Add option to centered ICE plots (cICE)
     - Others PRs prior to holidays
 - [`#20254`](scikit-learn/scikit-learn#20254)
     - Adapted benchmarks on `pdist_aggregation` to test #20254 against sklearnex
     - Adapting PR for `fast_euclidean` and `fast_sqeuclidean` on user-facing APIs
     - Next: comparing against scipy's 
     - Next: Having feedback on [#20254](scikit-learn/scikit-learn#20254) would also help
- Next: I need to block time to study Cython code.

### Mathis
- `sklearn_benchmarks`
  - Adapting benchmark script to run on Margaret
  - Fix issue with profiling files too big to be deployed on Github Pages
  - Ensure deterministic benchmark results
  - Working on declarative pipeline specification
  - Next: run long HPO benchmarks on Margaret

### Arturo

- Finished MOOC!
- Finished filling [Loïc's notes](https://notes.inria.fr/rgSzYtubR6uSOQIfY9Fpvw#) to find questions with score under 60% (Issue [#432](INRIA/scikit-learn-mooc#432))
    - started addressing easy-to-fix questions, resulting in gitlab MRs [#21](https://gitlab.inria.fr/learninglab/mooc-scikit-learn/mooc-scikit-learn-coordination/-/merge_requests/21) and [#22](https://gitlab.inria.fr/learninglab/mooc-scikit-learn/mooc-scikit-learn-coordination/-/merge_requests/22)
    - currently working on expanding the notes up to 70%
- Continued cross-linking forum posts with issues in GitHub, resulting in [#444](INRIA/scikit-learn-mooc#444), [#445](INRIA/scikit-learn-mooc#445), [#446](INRIA/scikit-learn-mooc#446), [#447](INRIA/scikit-learn-mooc#447) and [#448](INRIA/scikit-learn-mooc#448)

### Jérémie
- back from holidays, catching up
- Mathis' benchmarks
- trying to find what's going on with ASV benchmarks
  (asv should display the versions of all build and runtime depndencies for each run)

### Guillaume

- back from holidays
- Next:
    - release with Adrin
    - check the PR and issue trackers

### TODO / Next

- Expand Loïc’s notes up to 70% (Arturo)
- Create presentation to discuss my experience doing the MOOC (Arturo)
- Help with the scikit-learn release (Olivier, Guillaume)
- HR: Jeremy's renewal, Chiara's replacement (Gael)
- Mathis's consulting gig (Olivier, Gael, Mathis)
@glemaitre glemaitre removed their assignment Dec 17, 2021
@jeremiedbb
Copy link
Member

@glemaitre is it still relevant ? do you think you'll find time to refresh it before 1.1 (targeted end of april) ?

@glemaitre glemaitre self-assigned this Apr 6, 2022
doc/whats_new/v1.1.rst Outdated Show resolved Hide resolved
examples/inspection/plot_partial_dependence.py Outdated Show resolved Hide resolved
sklearn/inspection/_plot/partial_dependence.py Outdated Show resolved Hide resolved
sklearn/inspection/_plot/partial_dependence.py Outdated Show resolved Hide resolved
sklearn/inspection/_plot/partial_dependence.py Outdated Show resolved Hide resolved
sklearn/inspection/_plot/partial_dependence.py Outdated Show resolved Hide resolved
glemaitre and others added 3 commits April 7, 2022 14:32
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
@glemaitre glemaitre changed the title ENH Add option to centered ICE plots (cICE) ENH add an option to centered ICE and PD Apr 7, 2022
Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

doc/modules/partial_dependence.rst Outdated Show resolved Hide resolved
doc/whats_new/v1.1.rst Outdated Show resolved Hide resolved
sklearn/inspection/_plot/partial_dependence.py Outdated Show resolved Hide resolved
old_min_pd, old_max_pd = pdp_lim.get(n_fx, (min_pd, max_pd))
min_pd = min(min_pd, old_min_pd)
max_pd = max(max_pd, old_max_pd)
pdp_lim[n_fx] = (min_pd, max_pd)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we center before computing pdp_lim?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the limits are just a bit wider than if they were computed after the centering, but never too narrow. Can we merge as is and do a follow up PR to better adjust the limits ? @glemaitre @thomasjpfan ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would merge it now and make a small subsequent PR.

Copy link
Member

@thomasjpfan thomasjpfan Apr 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay to merge first if @jeremiedbb and @jjerphan are okay with it.

I think that fixing pdp_lim to take into account of centering is a blocker for release tho. Imagine a case where the first point is [0, -10] and the last point is [0, 10], the range is (-10, 10), which will cut off the plot after centering.

Edit: I think the bounds are okay.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking this over, I think the current implementation is correct. preds_offset correctly offsets the prediction based on centering.

glemaitre and others added 2 commits April 7, 2022 17:06
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
@jeremiedbb jeremiedbb merged commit adb47e7 into scikit-learn:main Apr 7, 2022
@jeremiedbb
Copy link
Member

Thanks @JoElfner and @glemaitre !

jjerphan added a commit to jjerphan/scikit-learn that referenced this pull request Apr 29, 2022
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Add centered ICE plots to plot_partial_dependence
8 participants