Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trees with one leaf can have strong bias in predictions, far outside observed range #4708

Closed
arnocandel opened this issue Oct 22, 2021 · 5 comments
Labels

Comments

@arnocandel
Copy link

arnocandel commented Oct 22, 2021

Description

Regression model predicts outside y range, by a lot

3.3.0
ymin=996.5
ymax=1034.3

n_estimators=1
pred_min=1019.4970124657474
pred_max=1019.4970124657474

n_estimators=10
pred_min=437.80879810966866
pred_max=440.58117669151034

n_estimators=100
pred_min=-3.610830259193446
pred_max=4.928009330393595

Reproducible example

import lightgbm as lgb
import numpy as np
from lightgbm.sklearn import LGBMRegressor

X = np.array(
    [
        [1021.0589, 1018.9578],
        [1023.85754, 1018.7854],
        [1024.5468, 1018.88513],
        [1019.02954, 1018.88513],
        [1016.79926, 1018.88513],
        [1007.6, 1018.88513],
        [1014.86957, 1018.88513],
        [1016.6986, 1018.9578],
        [1011.2, 1018.88513],
        [1007.6, 1018.9578],
        [1016.8388, 1018.9578],
        [1021.7486, 1018.9578],
        [1016.8388, 1018.9578],
        [1016.6986, 1018.9578],
        [1016.8388, 1018.9578],
        [1017.6715, 1018.7854],
        [1007.7, 1018.7854],
        [1021.17773, 1018.88513],
        [1016.8388, 1018.9578],
        [1017.6715, 1018.7854],
        [1016.8388, 1018.9578],
        [1016.8388, 1018.9578],
        [1017.3, 1018.7854],
        [1021.17773, 1018.88513],
        [1021.17773, 1018.88513],
        [996.5, 1018.88513],
        [1017.5635, 1018.7854],
        [1016.6986, 1018.9578],
        [1021.3, 1018.88513],
        [1024.6957, 1018.88513],
        [1021.3, 1018.88513],
        [1017.3, 1018.7854],
        [999.4, 1018.7854],
        [1017.5635, 1018.7854],
        [1016.8388, 1018.9578],
        [1021.17773, 1018.88513],
        [1007.9, 1018.88513],
        [1016.8, 1018.9578],
        [1010.2, 1018.7854],
        [1030.2, 1018.88513],
        [1021.7486, 1018.9578],
        [1016.6986, 1018.9578],
        [1013.09265, 1018.9578],
        [1016.8388, 1018.9578],
        [1026.4463, 1018.88513],
        [1024.5468, 1018.88513],
        [1016.4, 1018.88513],
        [1021.17773, 1018.88513],
        [1020.8382, 1018.7854],
        [1021.17773, 1018.88513],
        [1021.17773, 1018.88513],
        [1019.02954, 1018.88513],
        [1023.7932, 1018.9578],
        [1013.09265, 1018.9578],
        [1014.86957, 1018.88513],
        [999.4, 1018.7854],
        [1016.8388, 1018.9578],
        [1019.02954, 1018.88513],
        [1012.3, 1018.7854],
        [1007.6, 1018.88513],
        [1021.6, 1018.88513],
        [1014.86957, 1018.88513],
        [1007.9, 1018.7854],
        [1023.50085, 1018.9578],
        [1026.4463, 1018.88513],
        [1015.8, 1018.88513],
        [1017.6715, 1018.7854],
        [1007.6, 1018.9578],
        [1012.7, 1018.7854],
        [1012.7, 1018.7854],
        [1017.491, 1018.9578],
        [1026.4941, 1018.7854],
        [1008.58325, 1018.9578],
        [1020.8382, 1018.7854],
        [1028.2369, 1018.7854],
        [1021.4, 1018.7854],
        [1024.5468, 1018.88513],
        [1016.8388, 1018.9578],
        [1023.85754, 1018.7854],
        [1024.3479, 1018.9578],
        [1016.8, 1018.9578],
        [1006.9, 1018.9578],
        [1026.4463, 1018.88513],
        [1026.3362, 1018.9578],
        [1014.7, 1018.88513],
        [1019.92944, 1018.7854],
        [1012.7, 1018.7854],
        [1024.5468, 1018.88513],
        [1028.0812, 1018.9578],
        [1024.2329, 1018.7854],
        [1021.17773, 1018.88513],
        [1029.921, 1018.7854],
        [1026.3362, 1018.9578],
        [1032.2, 1018.7854],
        [1029.921, 1018.7854],
        [1026.3362, 1018.9578],
        [1026.3362, 1018.9578],
        [1024.5468, 1018.88513],
        [1024.5468, 1018.88513],
        [1019.92944, 1018.7854],
        [1019.92944, 1018.7854],
        [1010.459, 1018.88513],
        [1029.921, 1018.7854],
        [1026.4463, 1018.88513],
        [1023.72144, 1018.7854],
        [1014.8, 1018.88513],
        [1021.17773, 1018.88513],
        [1021.17773, 1018.88513],
        [1026.3362, 1018.9578],
        [1014.7, 1018.7854],
        [1021.17773, 1018.88513],
        [1027.2, 1018.7854],
        [1029.921, 1018.7854],
        [1027.2, 1018.7854],
        [1026.3, 1018.9578],
        [1023.72144, 1018.7854],
        [1024.6957, 1018.88513],
        [1026.3362, 1018.9578],
        [1026.3362, 1018.9578],
        [1020.6999, 1018.9578],
        [1025.5, 1018.9578],
        [1014.4, 1018.7854],
        [1021.17773, 1018.88513],
        [1024.3479, 1018.9578],
        [1023.85754, 1018.7854],
        [1018.4, 1018.7854],
        [1024.3479, 1018.9578],
        [1006.3, 1018.7854],
        [1017.54236, 1018.9578],
        [1020.9, 1018.88513],
        [1024.5468, 1018.88513],
        [1011.7, 1018.88513],
        [1012.3, 1018.88513],
        [1014.1, 1018.7854],
    ],
    dtype=np.float32,
)
y = np.array(
    [
        1023.8,
        1024.6,
        1024.4,
        1023.8,
        1022.0,
        1014.4,
        1017.1,
        1018.5,
        1012.4,
        1010.7,
        1020.7,
        1022.4,
        1021.0,
        1019.2,
        1018.8,
        1011.7,
        1018.0,
        1021.1,
        1012.6,
        1010.1,
        1017.6,
        1015.8,
        996.5,
        1009.5,
        1012.8,
        1017.3,
        1016.8,
        1017.4,
        1019.2,
        1015.7,
        1017.8,
        1004.0,
        1012.3,
        1014.2,
        1008.5,
        1016.8,
        1017.5,
        1004.9,
        1011.4,
        1018.7,
        1021.3,
        1016.5,
        1013.1,
        1014.6,
        1007.7,
        1017.5,
        1020.2,
        1025.2,
        1012.2,
        1011.8,
        1008.3,
        1007.4,
        1017.6,
        1013.6,
        1023.9,
        1021.6,
        1024.7,
        1024.8,
        1022.8,
        1020.9,
        1022.6,
        1022.7,
        1013.8,
        1017.4,
        1023.6,
        1023.2,
        1018.7,
        1010.2,
        1014.8,
        1017.4,
        1028.0,
        1028.2,
        1025.7,
        1023.1,
        1018.4,
        1027.2,
        1027.2,
        1027.3,
        1025.5,
        1011.9,
        1003.2,
        1006.3,
        1020.6,
        1024.2,
        1018.0,
        1024.4,
        1024.7,
        1024.4,
        1023.5,
        1017.7,
        1008.4,
        1021.9,
        1024.5,
        1030.4,
        1028.5,
        1026.7,
        1026.6,
        1027.2,
        1032.1,
        1034.3,
        1031.4,
        1022.9,
        1026.7,
        1026.4,
        1025.1,
        1028.2,
        1026.3,
        1027.8,
        1024.6,
        1017.2,
        1030.0,
        1023.0,
        1024.1,
        1019.7,
        1010.8,
        1016.7,
        1006.9,
        1033.5,
        1032.2,
        1025.7,
        1020.0,
        1013.8,
        1014.1,
        1017.3,
        1018.2,
        1022.4,
        1014.8,
        1021.3,
        1014.7,
        1023.3,
        1029.6,
        1027.3,
        1018.5,
        1025.9,
    ]
)


print(lgb.__version__)
print(f"ymin={y.min()}")
print(f"ymax={y.max()}")
for n in [1, 10, 100]:
    model = LGBMRegressor()
    model.set_params(
        **{
            "n_estimators": n,
            "extra_trees": True,
            "min_data_in_bin": 1,
            "extra_seed": 43,
        }
    )

    model.fit(X, y)
    preds = model.predict(X)
    print(f"\nn_estimators={n}")
    print(f"pred_min={preds.min()}")
    print(f"pred_max={preds.max()}")

Environment info

x86 Ubuntu 20.04

LightGBM version or commit hash:
https://github.com/microsoft/LightGBM/releases/tag/v3.3.0

Command(s) you used to install LightGBM

pip install lightgbm-3.3.0-py3-none-manylinux1_x86_64.whl

Discovered by H2O Driverless AI testing

Additional Comments

@shiyu1994
Copy link
Collaborator

@arnocandel Thanks for using LightGBM!
After investigation, I found that the problem is not related to extra_trees, but in the handling of trees with only one leaf. When using random splits, we can easily get tree with only one leaf since the first randomly chosen split for the root node may have gain smaller than min_gain_to_split.
When we have a single-leaf tree, according to

if (new_tree->num_leaves() > 1) {
should_continue = true;
auto score_ptr = train_score_updater_->score() + offset;
auto residual_getter = [score_ptr](const label_t* label, int i) {return static_cast<double>(label[i]) - score_ptr[i]; };
tree_learner_->RenewTreeOutput(new_tree.get(), objective_function_, residual_getter,
num_data_, bag_data_indices_.data(), bag_data_cnt_);
// shrinkage by learning rate
new_tree->Shrinkage(shrinkage_rate_);
// update score
UpdateScore(new_tree.get(), cur_tree_id);
if (std::fabs(init_scores[cur_tree_id]) > kEpsilon) {
new_tree->AddBias(init_scores[cur_tree_id]);
}
} else {
// only add default score one-time
if (models_.size() < static_cast<size_t>(num_tree_per_iteration_)) {
double output = 0.0;
if (!class_need_train_[cur_tree_id]) {
if (objective_function_ != nullptr) {
output = objective_function_->BoostFromScore(cur_tree_id);
}
} else {
output = init_scores[cur_tree_id];
}
new_tree->AsConstantTree(output);
// updates scores
train_score_updater_->AddScore(output, cur_tree_id);
for (auto& score_updater : valid_score_updater_) {
score_updater->AddScore(output, cur_tree_id);
}
}
}
// add model
models_.push_back(std::move(new_tree));
}

the average score will be added.
But I have been doubted about the code from line 417 to 434 before, since the average score should have been already added in line 375
init_scores[cur_tree_id] = BoostFromAverage(cur_tree_id, true);

With your example, I now confirm that the average score is added incorrectly.
I'll open a PR to fix this.

@arnocandel arnocandel changed the title extra_trees regression can have strong bias in predictions, far outside observed range trees with one leaf can have strong bias in predictions, far outside observed range Oct 25, 2021
@jameslamb jameslamb added the bug label Oct 26, 2021
@guolinke
Copy link
Collaborator

guolinke commented Mar 2, 2022

@shiyu1994 any updates about the fix?

@shiyu1994
Copy link
Collaborator

#5050 is opened to fix this. The output for the example with the fixed branch:

ymin=996.5
ymax=1034.3

n_estimators=1
pred_min=1019.4970124657474
pred_max=1019.4970124657474

n_estimators=10
pred_min=1018.4457739030327
pred_max=1021.2181511499891

n_estimators=100
pred_min=1015.7546598320027
pred_max=1024.293499000457

And the outputs are consistent with boost_from_average=True and boost_from_average=False. When a single-leaf tree is encountered, the average score should be added automatically, regardless of boost_from_average.

@shiyu1994
Copy link
Collaborator

BTW, I found a new problem when dealing with this issue. Now in CLI version of LightGBM, when a single-leaf tree is trained (or all trees are single-leaf in an iteration of multi-class boosting), the training stops. See

LightGBM/src/c_api.cpp

Lines 1672 to 1681 in 01568cf

int LGBM_BoosterUpdateOneIter(BoosterHandle handle, int* is_finished) {
API_BEGIN();
Booster* ref_booster = reinterpret_cast<Booster*>(handle);
if (ref_booster->TrainOneIter()) {
*is_finished = 1;
} else {
*is_finished = 0;
}
API_END();
}

Here is_finished means that the training is finished because no more splits can be found and the last tree is just single-leaf.
However, in Python API, it seems that the return value of TrainOneIter is interpreted differently, see
Returns
-------
is_finished : bool
Whether the update was successfully finished.
"""

I believe the CLI version has the correct understanding of the return value of TrainOneIter. It should be training of boosting has been finished instead of training of this iteration has been finished successfully. That's why with the example in this issue, python API continues to train even if the warning that no more leaves that meet the split requirements has been given, which indicates a single-leaf tree.

ymin=996.5
ymax=1034.3
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.045732 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 67
[LightGBM] [Info] Number of data points in the train set: 134, number of used features: 2
[LightGBM] [Info] Start training from score 1019.497012
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] Stopped training because there are no more leaves that meet the split requirements

n_estimators=1
pred_min=1019.4970124657474
pred_max=1019.4970124657474
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.031245 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 67
[LightGBM] [Info] Number of data points in the train set: 134, number of used features: 2
[LightGBM] [Info] Start training from score 1019.497012
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] Stopped training because there are no more leaves that meet the split requirements
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] Stopped training because there are no more leaves that meet the split requirements
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf

n_estimators=10
pred_min=1018.4457739030327
pred_max=1021.2181511499891

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants