Add init_score attr to python booster class. #4235

JoshuaC3 · 2021-04-27T11:12:42Z

Save init_score to the python booster class so that it is accessible after training, after saving and as part of model interpretability: #4065

jameslamb

Thanks @JoshuaC3 ! I've read the attached issue and the other stuff it links to so I think I understand what this is trying to accomplish, but I'm concerned.

init_score can be an array with shape (n_observations, ) or even (n_observations, n_classes) for multiclass classification (docs), so this could be a fairly large amount of data. This might substantially increase the memory footprint of the Booster object if you used init_score during training, which would have at least two negative side effects:

increases the memory requirements for deploying a LightGBM model
increases the amount of data that needs to be sent over the wire from Dask workers back to the client in distributed training with lightgbm.dask (https://github.com/microsoft/LightGBM/blob/d5c2c55682cdf52b9371dbead9bd551e2acbffdc/python-package/lightgbm/dask.py)

Is it absolutely necessary for #3905?

I'm just leaving a "Comment" review to bring this to the attention of other reviewers, for their consideration.

JoshuaC3 · 2021-04-27T17:43:37Z

Thanks for reviewing @jameslamb.

Yes, I was concerned about this as well. It certainly isn't essential and is somewhat of a workaround for the following: #4234 (which would be, or at least some variant to create an "intercept"). That said, I do think init_score is useful in its own right, but might have too much over an overhead in some cases.

Would a parameter such as keep_init_score be an option?

jameslamb · 2021-04-28T04:17:39Z

Would a parameter such as keep_init_score be an option?

I'm personally opposed to adding such a flag to the Booster object for this purpose. I think anyone choosing to set that flag could just as easily choose to separately save init_score (or the entire Dataset object). I worry that each parameter added to LightGBM's public API increases the complexity of the API, which can make it harder for users to learn and harder for maintainers to anticipate and test all combinations of parameters.

However, that's just one opinion! We should hear what @shiyu1994 and @StrikerRUS think.

shiyu1994 · 2021-04-28T06:09:28Z

Thanks @JoshuaC3 for your work. Keep the init_score available can be useful in some cases. But I have the following concerns,

As with @jameslamb, I don't think adding one more parameter keep_init_score would be a good idea, which increases the complexity of the pakcage.
We need to make it clear that init_score provided by users through 'Dataset' and the "init_score" caused by boost_from_average=True are two different concepts which does not interfere with each in the core of LightGBM. The former one can be get and set through Python API. But for the later one, currently it is not stored in anywhere of the Booster or Dataset object, except in first tree model. The only place that this boost_from_average score is saved is in the first tree (or first K trees if K classification). In the first tree, value for boost_from_average is simply added to all leaves, so it is mixed with the original leaf prediction values and cannot be recover. So even with the booster model saving the init_score from Dataset, as is done in this PR, still the boost_from_average value cannot be get. Without setting init_score in Dataset constructor, get_init_score simply returns None, regardless of boost_from_average.

For interpretability mentioned in Investigate possibility of borrowing some features/ideas from Explainable Boosted Machines #3905, I think what we want is the boost_from_average value. We can have an API to provide access to this value. That requires recording this value from the gbdt.cpp code and returns through new C and Python API functions.

LightGBM/src/boosting/gbdt.cpp

Lines 371 to 376 in d5c2c55

    
           std::vector<double> init_scores(num_tree_per_iteration_, 0.0); 
        
           // boosting first 
        
           if (gradients == nullptr || hessians == nullptr) { 
        
             for (int cur_tree_id = 0; cur_tree_id < num_tree_per_iteration_; ++cur_tree_id) { 
        
               init_scores[cur_tree_id] = BoostFromAverage(cur_tree_id, true); 
        
             }

So to make us closer to the interpretability goal in #3905, I think we should focus on the boost_from_average score instead of the user specified init_score for dataset.
I'm not sure I have fully understand your plan @JoshuaC3 for #3905. If I made any mistake, please feel free to respond.

JoshuaC3 · 2021-04-28T11:25:31Z

@shiyu1994 I think you do understand my plan, and maybe a little better than I did. Your comment has made it clear now!

You point 3. is spot on and highlights one of the reasons why I tried to access init_score instead of boost_from_average:
init_score is removed and is not used/forced into the leaves of the first tree unlike boost_from_average. This behaviour is highly problematic when working out the feature importance in the EBM style as it biases the feature with the first split massively. Fixing this would involved something like having the average removed from the first trees leaves, saved on the booster model and added to the scores at prediction.

The other reason was, often we don't wish to boost from "average", but wish to boost from some other value instead: min, max, median, 20th quantile, etc. This is because EBMs mean-centre the feature-scores in favour of the intercept being the mean of the target, thus it is rarely is the mean. Potentially superseding boost_from_mean=True with a boost_from_value='mean' parameter would be an nice end goal. We could then have boost_from_value='median', boost_from_value='min' as pre-defined values or simply boost_from_value=53.4, for example. I feel this could be appealing in other applications, possibly for optimising median centred metrics like MAE (as an intuition) . That said, it is not essential in comparison to the previous point.

Is it worth copying this to an issue/feature request for further discussion and closing this PR?

shiyu1994 · 2021-04-29T03:22:24Z

@JoshuaC3 OK. So the final goal is to provide the EBM style interpretability.

Fixing this would involved something like having the average removed from the first trees leaves, saved on the booster model and added to the scores at prediction.

That would require modifying the prediction logic. When prediction starts from the first tree, we should add back the average manually.

A simpler work around would be to make the average score available to the Python API for some cases the user want to know its value, but without removing the average from the first tree. And the EBM style interpretability can be calculated internally by subtracting in C++ or Python part. With the average being stored, it won't be difficult to calculate the EBM style interpretability. In other words, we subtract the average from the model only when calculating the EBM style interpretability, but leave the prediction logic and model file unchanged.

Do you think which strategy better suits the goal?

JoshuaC3 · 2021-04-29T12:18:50Z

My gut feeling is that the first proposal - modifying the prediction logic - seems more robust and more easily used by other languages, not just Python.

That said, option two means fewer changes to the prediction logic. I have created some examples in this notebook on my gist. The EBM-FI calculation and Plots is current written in Python and is a little slow - I am sure it could be sped up a lot though! However, it doesn't seem to make much difference if we subtract the mean before training, at the start of training or after training, so this option seems OK. Pay special attention to x5 as it is the first feature that is split on and has the intercept/mean included in it. One of the graphs is not zeroed on the y-axis so is easy to miss!

I think compared to this init_score PR it is a MUCH better solution!

Should I close this PR and open an issue detailing this?

shiyu1994 · 2021-05-07T06:33:57Z

@JoshuaC3 Sure! That would be great.

JoshuaC3 · 2021-05-21T12:22:16Z

Close in favour of #4313.

github-actions · 2023-11-15T00:21:21Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

Add init_score attr to booster class.

cc0f764

JoshuaC3 requested review from chivee, henry0312, jameslamb, shiyu1994, StrikerRUS and wxchan as code owners April 27, 2021 11:12

jameslamb reviewed Apr 27, 2021

View reviewed changes

jameslamb added the in progress label Apr 28, 2021

JoshuaC3 closed this May 21, 2021

JoshuaC3 mentioned this pull request May 21, 2021

Add BoostFromAverage value as an attribute to LightGBM models. #4313

Closed

JoshuaC3 deleted the get-init-score-booster branch May 21, 2021 12:22

jameslamb removed the in progress label Aug 13, 2023

github-actions bot locked as resolved and limited conversation to collaborators Nov 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add init_score attr to python booster class. #4235

Add init_score attr to python booster class. #4235

JoshuaC3 commented Apr 27, 2021

jameslamb left a comment

JoshuaC3 commented Apr 27, 2021 •

edited

Loading

jameslamb commented Apr 28, 2021

shiyu1994 commented Apr 28, 2021 •

edited

Loading

JoshuaC3 commented Apr 28, 2021 •

edited

Loading

shiyu1994 commented Apr 29, 2021

JoshuaC3 commented Apr 29, 2021 •

edited

Loading

shiyu1994 commented May 7, 2021

JoshuaC3 commented May 21, 2021 •

edited

Loading

github-actions bot commented Nov 15, 2023

Add init_score attr to python booster class. #4235

Add init_score attr to python booster class. #4235

Conversation

JoshuaC3 commented Apr 27, 2021

jameslamb left a comment

Choose a reason for hiding this comment

JoshuaC3 commented Apr 27, 2021 • edited Loading

jameslamb commented Apr 28, 2021

shiyu1994 commented Apr 28, 2021 • edited Loading

JoshuaC3 commented Apr 28, 2021 • edited Loading

shiyu1994 commented Apr 29, 2021

JoshuaC3 commented Apr 29, 2021 • edited Loading

shiyu1994 commented May 7, 2021

JoshuaC3 commented May 21, 2021 • edited Loading

github-actions bot commented Nov 15, 2023

JoshuaC3 commented Apr 27, 2021 •

edited

Loading

shiyu1994 commented Apr 28, 2021 •

edited

Loading

JoshuaC3 commented Apr 28, 2021 •

edited

Loading

JoshuaC3 commented Apr 29, 2021 •

edited

Loading

JoshuaC3 commented May 21, 2021 •

edited

Loading