Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add init_score attr to python booster class. #4235

Closed
wants to merge 1 commit into from

Conversation

JoshuaC3
Copy link

Save init_score to the python booster class so that it is accessible after training, after saving and as part of model interpretability: #4065

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @JoshuaC3 ! I've read the attached issue and the other stuff it links to so I think I understand what this is trying to accomplish, but I'm concerned.

init_score can be an array with shape (n_observations, ) or even (n_observations, n_classes) for multiclass classification (docs), so this could be a fairly large amount of data. This might substantially increase the memory footprint of the Booster object if you used init_score during training, which would have at least two negative side effects:

Is it absolutely necessary for #3905?

I'm just leaving a "Comment" review to bring this to the attention of other reviewers, for their consideration.

@JoshuaC3
Copy link
Author

JoshuaC3 commented Apr 27, 2021

Thanks for reviewing @jameslamb.

Yes, I was concerned about this as well. It certainly isn't essential and is somewhat of a workaround for the following: #4234 (which would be, or at least some variant to create an "intercept"). That said, I do think init_score is useful in its own right, but might have too much over an overhead in some cases.

Would a parameter such as keep_init_score be an option?

@jameslamb
Copy link
Collaborator

Would a parameter such as keep_init_score be an option?

I'm personally opposed to adding such a flag to the Booster object for this purpose. I think anyone choosing to set that flag could just as easily choose to separately save init_score (or the entire Dataset object). I worry that each parameter added to LightGBM's public API increases the complexity of the API, which can make it harder for users to learn and harder for maintainers to anticipate and test all combinations of parameters.

However, that's just one opinion! We should hear what @shiyu1994 and @StrikerRUS think.

@shiyu1994
Copy link
Collaborator

shiyu1994 commented Apr 28, 2021

Thanks @JoshuaC3 for your work. Keep the init_score available can be useful in some cases. But I have the following concerns,

  1. As with @jameslamb, I don't think adding one more parameter keep_init_score would be a good idea, which increases the complexity of the pakcage.
  2. We need to make it clear that init_score provided by users through 'Dataset' and the "init_score" caused by boost_from_average=True are two different concepts which does not interfere with each in the core of LightGBM. The former one can be get and set through Python API. But for the later one, currently it is not stored in anywhere of the Booster or Dataset object, except in first tree model. The only place that this boost_from_average score is saved is in the first tree (or first K trees if K classification). In the first tree, value for boost_from_average is simply added to all leaves, so it is mixed with the original leaf prediction values and cannot be recover. So even with the booster model saving the init_score from Dataset, as is done in this PR, still the boost_from_average value cannot be get. Without setting init_score in Dataset constructor, get_init_score simply returns None, regardless of boost_from_average.
  3. For interpretability mentioned in Investigate possibility of borrowing some features/ideas from Explainable Boosted Machines #3905, I think what we want is the boost_from_average value. We can have an API to provide access to this value. That requires recording this value from the gbdt.cpp code and returns through new C and Python API functions.
    std::vector<double> init_scores(num_tree_per_iteration_, 0.0);
    // boosting first
    if (gradients == nullptr || hessians == nullptr) {
    for (int cur_tree_id = 0; cur_tree_id < num_tree_per_iteration_; ++cur_tree_id) {
    init_scores[cur_tree_id] = BoostFromAverage(cur_tree_id, true);
    }

So to make us closer to the interpretability goal in #3905, I think we should focus on the boost_from_average score instead of the user specified init_score for dataset.
I'm not sure I have fully understand your plan @JoshuaC3 for #3905. If I made any mistake, please feel free to respond.

@JoshuaC3
Copy link
Author

JoshuaC3 commented Apr 28, 2021

@shiyu1994 I think you do understand my plan, and maybe a little better than I did. Your comment has made it clear now!

You point 3. is spot on and highlights one of the reasons why I tried to access init_score instead of boost_from_average:
init_score is removed and is not used/forced into the leaves of the first tree unlike boost_from_average. This behaviour is highly problematic when working out the feature importance in the EBM style as it biases the feature with the first split massively. Fixing this would involved something like having the average removed from the first trees leaves, saved on the booster model and added to the scores at prediction.

The other reason was, often we don't wish to boost from "average", but wish to boost from some other value instead: min, max, median, 20th quantile, etc. This is because EBMs mean-centre the feature-scores in favour of the intercept being the mean of the target, thus it is rarely is the mean. Potentially superseding boost_from_mean=True with a boost_from_value='mean' parameter would be an nice end goal. We could then have boost_from_value='median', boost_from_value='min' as pre-defined values or simply boost_from_value=53.4, for example. I feel this could be appealing in other applications, possibly for optimising median centred metrics like MAE (as an intuition) . That said, it is not essential in comparison to the previous point.

Is it worth copying this to an issue/feature request for further discussion and closing this PR?

@shiyu1994
Copy link
Collaborator

@JoshuaC3 OK. So the final goal is to provide the EBM style interpretability.

Fixing this would involved something like having the average removed from the first trees leaves, saved on the booster model and added to the scores at prediction.

That would require modifying the prediction logic. When prediction starts from the first tree, we should add back the average manually.

A simpler work around would be to make the average score available to the Python API for some cases the user want to know its value, but without removing the average from the first tree. And the EBM style interpretability can be calculated internally by subtracting in C++ or Python part. With the average being stored, it won't be difficult to calculate the EBM style interpretability. In other words, we subtract the average from the model only when calculating the EBM style interpretability, but leave the prediction logic and model file unchanged.

Do you think which strategy better suits the goal?

@JoshuaC3
Copy link
Author

JoshuaC3 commented Apr 29, 2021

My gut feeling is that the first proposal - modifying the prediction logic - seems more robust and more easily used by other languages, not just Python.

That said, option two means fewer changes to the prediction logic. I have created some examples in this notebook on my gist. The EBM-FI calculation and Plots is current written in Python and is a little slow - I am sure it could be sped up a lot though! However, it doesn't seem to make much difference if we subtract the mean before training, at the start of training or after training, so this option seems OK. Pay special attention to x5 as it is the first feature that is split on and has the intercept/mean included in it. One of the graphs is not zeroed on the y-axis so is easy to miss!

I think compared to this init_score PR it is a MUCH better solution!

Should I close this PR and open an issue detailing this?

@shiyu1994
Copy link
Collaborator

@JoshuaC3 Sure! That would be great.

@JoshuaC3
Copy link
Author

JoshuaC3 commented May 21, 2021

Close in favour of #4313.

Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants