trees with one leaf can have strong bias in predictions, far outside observed range #4708

arnocandel · 2021-10-22T22:40:47Z

Description

Regression model predicts outside y range, by a lot

3.3.0
ymin=996.5
ymax=1034.3

n_estimators=1
pred_min=1019.4970124657474
pred_max=1019.4970124657474

n_estimators=10
pred_min=437.80879810966866
pred_max=440.58117669151034

n_estimators=100
pred_min=-3.610830259193446
pred_max=4.928009330393595

Reproducible example

import lightgbm as lgb
import numpy as np
from lightgbm.sklearn import LGBMRegressor

X = np.array(
    [
        [1021.0589, 1018.9578],
        [1023.85754, 1018.7854],
        [1024.5468, 1018.88513],
        [1019.02954, 1018.88513],
        [1016.79926, 1018.88513],
        [1007.6, 1018.88513],
        [1014.86957, 1018.88513],
        [1016.6986, 1018.9578],
        [1011.2, 1018.88513],
        [1007.6, 1018.9578],
        [1016.8388, 1018.9578],
        [1021.7486, 1018.9578],
        [1016.8388, 1018.9578],
        [1016.6986, 1018.9578],
        [1016.8388, 1018.9578],
        [1017.6715, 1018.7854],
        [1007.7, 1018.7854],
        [1021.17773, 1018.88513],
        [1016.8388, 1018.9578],
        [1017.6715, 1018.7854],
        [1016.8388, 1018.9578],
        [1016.8388, 1018.9578],
        [1017.3, 1018.7854],
        [1021.17773, 1018.88513],
        [1021.17773, 1018.88513],
        [996.5, 1018.88513],
        [1017.5635, 1018.7854],
        [1016.6986, 1018.9578],
        [1021.3, 1018.88513],
        [1024.6957, 1018.88513],
        [1021.3, 1018.88513],
        [1017.3, 1018.7854],
        [999.4, 1018.7854],
        [1017.5635, 1018.7854],
        [1016.8388, 1018.9578],
        [1021.17773, 1018.88513],
        [1007.9, 1018.88513],
        [1016.8, 1018.9578],
        [1010.2, 1018.7854],
        [1030.2, 1018.88513],
        [1021.7486, 1018.9578],
        [1016.6986, 1018.9578],
        [1013.09265, 1018.9578],
        [1016.8388, 1018.9578],
        [1026.4463, 1018.88513],
        [1024.5468, 1018.88513],
        [1016.4, 1018.88513],
        [1021.17773, 1018.88513],
        [1020.8382, 1018.7854],
        [1021.17773, 1018.88513],
        [1021.17773, 1018.88513],
        [1019.02954, 1018.88513],
        [1023.7932, 1018.9578],
        [1013.09265, 1018.9578],
        [1014.86957, 1018.88513],
        [999.4, 1018.7854],
        [1016.8388, 1018.9578],
        [1019.02954, 1018.88513],
        [1012.3, 1018.7854],
        [1007.6, 1018.88513],
        [1021.6, 1018.88513],
        [1014.86957, 1018.88513],
        [1007.9, 1018.7854],
        [1023.50085, 1018.9578],
        [1026.4463, 1018.88513],
        [1015.8, 1018.88513],
        [1017.6715, 1018.7854],
        [1007.6, 1018.9578],
        [1012.7, 1018.7854],
        [1012.7, 1018.7854],
        [1017.491, 1018.9578],
        [1026.4941, 1018.7854],
        [1008.58325, 1018.9578],
        [1020.8382, 1018.7854],
        [1028.2369, 1018.7854],
        [1021.4, 1018.7854],
        [1024.5468, 1018.88513],
        [1016.8388, 1018.9578],
        [1023.85754, 1018.7854],
        [1024.3479, 1018.9578],
        [1016.8, 1018.9578],
        [1006.9, 1018.9578],
        [1026.4463, 1018.88513],
        [1026.3362, 1018.9578],
        [1014.7, 1018.88513],
        [1019.92944, 1018.7854],
        [1012.7, 1018.7854],
        [1024.5468, 1018.88513],
        [1028.0812, 1018.9578],
        [1024.2329, 1018.7854],
        [1021.17773, 1018.88513],
        [1029.921, 1018.7854],
        [1026.3362, 1018.9578],
        [1032.2, 1018.7854],
        [1029.921, 1018.7854],
        [1026.3362, 1018.9578],
        [1026.3362, 1018.9578],
        [1024.5468, 1018.88513],
        [1024.5468, 1018.88513],
        [1019.92944, 1018.7854],
        [1019.92944, 1018.7854],
        [1010.459, 1018.88513],
        [1029.921, 1018.7854],
        [1026.4463, 1018.88513],
        [1023.72144, 1018.7854],
        [1014.8, 1018.88513],
        [1021.17773, 1018.88513],
        [1021.17773, 1018.88513],
        [1026.3362, 1018.9578],
        [1014.7, 1018.7854],
        [1021.17773, 1018.88513],
        [1027.2, 1018.7854],
        [1029.921, 1018.7854],
        [1027.2, 1018.7854],
        [1026.3, 1018.9578],
        [1023.72144, 1018.7854],
        [1024.6957, 1018.88513],
        [1026.3362, 1018.9578],
        [1026.3362, 1018.9578],
        [1020.6999, 1018.9578],
        [1025.5, 1018.9578],
        [1014.4, 1018.7854],
        [1021.17773, 1018.88513],
        [1024.3479, 1018.9578],
        [1023.85754, 1018.7854],
        [1018.4, 1018.7854],
        [1024.3479, 1018.9578],
        [1006.3, 1018.7854],
        [1017.54236, 1018.9578],
        [1020.9, 1018.88513],
        [1024.5468, 1018.88513],
        [1011.7, 1018.88513],
        [1012.3, 1018.88513],
        [1014.1, 1018.7854],
    ],
    dtype=np.float32,
)
y = np.array(
    [
        1023.8,
        1024.6,
        1024.4,
        1023.8,
        1022.0,
        1014.4,
        1017.1,
        1018.5,
        1012.4,
        1010.7,
        1020.7,
        1022.4,
        1021.0,
        1019.2,
        1018.8,
        1011.7,
        1018.0,
        1021.1,
        1012.6,
        1010.1,
        1017.6,
        1015.8,
        996.5,
        1009.5,
        1012.8,
        1017.3,
        1016.8,
        1017.4,
        1019.2,
        1015.7,
        1017.8,
        1004.0,
        1012.3,
        1014.2,
        1008.5,
        1016.8,
        1017.5,
        1004.9,
        1011.4,
        1018.7,
        1021.3,
        1016.5,
        1013.1,
        1014.6,
        1007.7,
        1017.5,
        1020.2,
        1025.2,
        1012.2,
        1011.8,
        1008.3,
        1007.4,
        1017.6,
        1013.6,
        1023.9,
        1021.6,
        1024.7,
        1024.8,
        1022.8,
        1020.9,
        1022.6,
        1022.7,
        1013.8,
        1017.4,
        1023.6,
        1023.2,
        1018.7,
        1010.2,
        1014.8,
        1017.4,
        1028.0,
        1028.2,
        1025.7,
        1023.1,
        1018.4,
        1027.2,
        1027.2,
        1027.3,
        1025.5,
        1011.9,
        1003.2,
        1006.3,
        1020.6,
        1024.2,
        1018.0,
        1024.4,
        1024.7,
        1024.4,
        1023.5,
        1017.7,
        1008.4,
        1021.9,
        1024.5,
        1030.4,
        1028.5,
        1026.7,
        1026.6,
        1027.2,
        1032.1,
        1034.3,
        1031.4,
        1022.9,
        1026.7,
        1026.4,
        1025.1,
        1028.2,
        1026.3,
        1027.8,
        1024.6,
        1017.2,
        1030.0,
        1023.0,
        1024.1,
        1019.7,
        1010.8,
        1016.7,
        1006.9,
        1033.5,
        1032.2,
        1025.7,
        1020.0,
        1013.8,
        1014.1,
        1017.3,
        1018.2,
        1022.4,
        1014.8,
        1021.3,
        1014.7,
        1023.3,
        1029.6,
        1027.3,
        1018.5,
        1025.9,
    ]
)


print(lgb.__version__)
print(f"ymin={y.min()}")
print(f"ymax={y.max()}")
for n in [1, 10, 100]:
    model = LGBMRegressor()
    model.set_params(
        **{
            "n_estimators": n,
            "extra_trees": True,
            "min_data_in_bin": 1,
            "extra_seed": 43,
        }
    )

    model.fit(X, y)
    preds = model.predict(X)
    print(f"\nn_estimators={n}")
    print(f"pred_min={preds.min()}")
    print(f"pred_max={preds.max()}")

Environment info

x86 Ubuntu 20.04

LightGBM version or commit hash:
https://github.com/microsoft/LightGBM/releases/tag/v3.3.0

Command(s) you used to install LightGBM

pip install lightgbm-3.3.0-py3-none-manylinux1_x86_64.whl

Discovered by H2O Driverless AI testing

Additional Comments

The text was updated successfully, but these errors were encountered:

shiyu1994 · 2021-10-25T04:13:27Z

@arnocandel Thanks for using LightGBM!
After investigation, I found that the problem is not related to extra_trees, but in the handling of trees with only one leaf. When using random splits, we can easily get tree with only one leaf since the first randomly chosen split for the root node may have gain smaller than min_gain_to_split.
When we have a single-leaf tree, according to

LightGBM/src/boosting/gbdt.cpp

Lines 404 to 438 in df12c1b

    
             if (new_tree->num_leaves() > 1) { 
        
               should_continue = true; 
        
               auto score_ptr = train_score_updater_->score() + offset; 
        
               auto residual_getter = [score_ptr](const label_t* label, int i) {return static_cast<double>(label[i]) - score_ptr[i]; }; 
        
               tree_learner_->RenewTreeOutput(new_tree.get(), objective_function_, residual_getter, 
        
                                              num_data_, bag_data_indices_.data(), bag_data_cnt_); 
        
               // shrinkage by learning rate 
        
               new_tree->Shrinkage(shrinkage_rate_); 
        
               // update score 
        
               UpdateScore(new_tree.get(), cur_tree_id); 
        
               if (std::fabs(init_scores[cur_tree_id]) > kEpsilon) { 
        
                 new_tree->AddBias(init_scores[cur_tree_id]); 
        
               } 
        
             } else { 
        
               // only add default score one-time 
        
               if (models_.size() < static_cast<size_t>(num_tree_per_iteration_)) { 
        
                 double output = 0.0; 
        
                 if (!class_need_train_[cur_tree_id]) { 
        
                   if (objective_function_ != nullptr) { 
        
                     output = objective_function_->BoostFromScore(cur_tree_id); 
        
                   } 
        
                 } else { 
        
                   output = init_scores[cur_tree_id]; 
        
                 } 
        
                 new_tree->AsConstantTree(output); 
        
                 // updates scores 
        
                 train_score_updater_->AddScore(output, cur_tree_id); 
        
                 for (auto& score_updater : valid_score_updater_) { 
        
                   score_updater->AddScore(output, cur_tree_id); 
        
                 } 
        
               } 
        
             } 
        
             // add model 
        
             models_.push_back(std::move(new_tree)); 
        
           }

the average score will be added.
But I have been doubted about the code from line 417 to 434 before, since the average score should have been already added in line 375

LightGBM/src/boosting/gbdt.cpp

Line 375 in df12c1b

init_scores[cur_tree_id] = BoostFromAverage(cur_tree_id, true);

With your example, I now confirm that the average score is added incorrectly.
I'll open a PR to fix this.

guolinke · 2022-03-02T15:53:32Z

@shiyu1994 any updates about the fix?

shiyu1994 · 2022-03-03T17:05:04Z

#5050 is opened to fix this. The output for the example with the fixed branch:

ymin=996.5
ymax=1034.3

n_estimators=1
pred_min=1019.4970124657474
pred_max=1019.4970124657474

n_estimators=10
pred_min=1018.4457739030327
pred_max=1021.2181511499891

n_estimators=100
pred_min=1015.7546598320027
pred_max=1024.293499000457

And the outputs are consistent with boost_from_average=True and boost_from_average=False. When a single-leaf tree is encountered, the average score should be added automatically, regardless of boost_from_average.

shiyu1994 · 2022-03-03T17:12:07Z

BTW, I found a new problem when dealing with this issue. Now in CLI version of LightGBM, when a single-leaf tree is trained (or all trees are single-leaf in an iteration of multi-class boosting), the training stops. See

LightGBM/src/c_api.cpp

Lines 1672 to 1681 in 01568cf

    
           int LGBM_BoosterUpdateOneIter(BoosterHandle handle, int* is_finished) { 
        
             API_BEGIN(); 
        
             Booster* ref_booster = reinterpret_cast<Booster*>(handle); 
        
             if (ref_booster->TrainOneIter()) { 
        
               *is_finished = 1; 
        
             } else { 
        
               *is_finished = 0; 
        
             } 
        
             API_END(); 
        
           }

Here is_finished means that the training is finished because no more splits can be found and the last tree is just single-leaf.
However, in Python API, it seems that the return value of TrainOneIter is interpreted differently, see

LightGBM/python-package/lightgbm/basic.py

Lines 2967 to 2971 in 01568cf

    
                   Returns 
        
                   ------- 
        
                   is_finished : bool 
        
                       Whether the update was successfully finished. 
        
                   """

I believe the CLI version has the correct understanding of the return value of TrainOneIter. It should be training of boosting has been finished instead of training of this iteration has been finished successfully. That's why with the example in this issue, python API continues to train even if the warning that no more leaves that meet the split requirements has been given, which indicates a single-leaf tree.

ymin=996.5
ymax=1034.3
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.045732 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 67
[LightGBM] [Info] Number of data points in the train set: 134, number of used features: 2
[LightGBM] [Info] Start training from score 1019.497012
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] Stopped training because there are no more leaves that meet the split requirements

n_estimators=1
pred_min=1019.4970124657474
pred_max=1019.4970124657474
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.031245 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 67
[LightGBM] [Info] Number of data points in the train set: 134, number of used features: 2
[LightGBM] [Info] Start training from score 1019.497012
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] Stopped training because there are no more leaves that meet the split requirements
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] Stopped training because there are no more leaves that meet the split requirements
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf

n_estimators=10
pred_min=1018.4457739030327
pred_max=1021.2181511499891

github-actions · 2023-08-23T00:19:28Z

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

arnocandel changed the title ~~extra_trees regression can have strong bias in predictions, far outside observed range~~ trees with one leaf can have strong bias in predictions, far outside observed range Oct 25, 2021

jameslamb added the bug label Oct 26, 2021

shiyu1994 mentioned this issue Oct 28, 2021

[Draft] Oct~Nov iteration Plan #4677

Closed

16 tasks

shiyu1994 mentioned this issue Mar 3, 2022

[fix] fix duplicate added initial scores for single-leaf trees #5050

Merged

shiyu1994 mentioned this issue Mar 3, 2022

[RFC] Inconsistent behavior when a single-leaf tree is encountered #5051

Open

jameslamb closed this as completed in f6d654b Mar 9, 2022

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trees with one leaf can have strong bias in predictions, far outside observed range #4708

trees with one leaf can have strong bias in predictions, far outside observed range #4708

arnocandel commented Oct 22, 2021 •

edited

Loading

shiyu1994 commented Oct 25, 2021

guolinke commented Mar 2, 2022

shiyu1994 commented Mar 3, 2022

shiyu1994 commented Mar 3, 2022

github-actions bot commented Aug 23, 2023

trees with one leaf can have strong bias in predictions, far outside observed range #4708

trees with one leaf can have strong bias in predictions, far outside observed range #4708

Comments

arnocandel commented Oct 22, 2021 • edited Loading

Description

Reproducible example

Environment info

Additional Comments

shiyu1994 commented Oct 25, 2021

guolinke commented Mar 2, 2022

shiyu1994 commented Mar 3, 2022

shiyu1994 commented Mar 3, 2022

github-actions bot commented Aug 23, 2023

arnocandel commented Oct 22, 2021 •

edited

Loading