Trees with linear models at leaves #3299

btrotta · 2020-08-12T07:24:23Z

Implements boosting for trees with linear models at the leaves (sometimes called M5 trees). This is a hybrid between traditional tree boosting and the model proposed in the paper Gradient Boosting with Piece-Wise Linear Regression Trees by Shi, Li, and Li (https://arxiv.org/pdf/1802.05640.pdf), which is mentioned in #1315. In this PR, the tree structure is created by finding the best split in the normal way, but then we calculate a linear model on each leaf. In contrast, in the paper, the splits are chosen by calculating the linear models for each potential split point, which is much more computationally intensive and would require more significant code changes in LightGBM. (The paper above actually mentions M5 trees, in Appendix D, but only tests existing slow implementations, which give poor results compared to their code. I think with the better implementation from this PR, M5 trees would come close to the performance of the fully-linear approach.)

The running time of the linear-leaf model is around 10-20% more than traditional tree boosting (depending on the dataset etc), but it converges faster, so overall it gives a small improvement in training time (and also can achieve slightly better accuracy). Memory use is higher since we need so store the full feature data.

Regularisation can be controlled with the parameter linear_lambda; this is important because it's more prone to overfitting than the traditional tree-boosting model. Also it's important to scale the data before training so that all features have similar mean and standard deviation.

The code uses parts of the Eigen library licensed under MPL2.

I have only implemented this for Python. I think to get it working for R it would require some changes to the data loading interface, but I'm not very familiar with R, so maybe someone else would like to take that on.

Here is a test script to measure performance on the SUSY physics data (https://archive.ics.uci.edu/ml/datasets/SUSY).

import pandas as pd
import numpy as np
import lightgbm as lgb
import time as time
from sklearn.metrics import log_loss, roc_auc_score
import matplotlib.pyplot as plt

# read data
train = pd.read_csv('susy.csv', header=None)
np.random.seed(0)

# normalise
for i in range(len(train.columns) - 1):
    m, s = train.iloc[:, i + 1].agg(['mean', 'std'])
    train.iloc[:, i + 1] = (train.iloc[:, i + 1] - m) / s

train_bool = train.index <= train.index.max() - 500000  # last 500000 is the hold-out test set
valid_bool = train.index <= 500000  # validation set for early stopping

# linear model
t = time.time()
lgb_train = lgb.Dataset(train.loc[train_bool & ~valid_bool, list(range(1, 19))], label=train.loc[train_bool & ~valid_bool, 0])
lgb_test = lgb.Dataset(train.loc[valid_bool, list(range(1, 19))], label=train.loc[valid_bool, 0])
valid_sets = [lgb_train, lgb_test]
valid_names = ['train', 'valid']
# linear model
params = {'objective': 'binary', 'seed': 0, 'num_leaves': 16, 'learning_rate': 0.05,
          'metric': 'binary_logloss', 'verbose': 2, 'linear_lambda': 0.01, 'linear_tree': True}
res = {}
t0 = time.time()
time_arr_linear = []
timer_callback = lambda env: time_arr_linear.append(time.time() - t0)
est_linear = lgb.train(params, lgb_train, num_boost_round=500,valid_sets=valid_sets, valid_names=valid_names,
                       categorical_feature=[], evals_result=res, callbacks=[timer_callback])
linear_time = time.time() - t
print("Linear model")
print("training time ", linear_time)

# normal model
params = {'objective': 'binary', 'seed': 0, 'num_leaves': 16, 'learning_rate': 0.05,
          'metric': 'binary_logloss', 'verbose': 2, 'linear_tree': False}
t = time.time()
res2 = {}
lgb_train = lgb.Dataset(train.loc[train_bool & ~valid_bool, list(range(1, 19))], label=train.loc[train_bool & ~valid_bool, 0])
lgb_test = lgb.Dataset(train.loc[valid_bool, list(range(1, 19))], label=train.loc[valid_bool, 0])
valid_sets = [lgb_train, lgb_test]
valid_names = ['train', 'valid']
t0 = time.time()
time_arr_nonlinear = []
timer_callback = lambda env: time_arr_nonlinear.append(time.time() - t0)
est = lgb.train(params, lgb_train, valid_sets=valid_sets, valid_names=valid_names, num_boost_round=500,
                categorical_feature=[], evals_result=res2, callbacks=[timer_callback])
nonlinear_time = time.time() - t

# plot loss vs iterations
plt.figure()
plt.plot(res['train']['binary_logloss'])
plt.plot(res['valid']['binary_logloss'])
plt.plot(res2['train']['binary_logloss'])
plt.plot(res2['valid']['binary_logloss'])
plt.legend(['train_linear', 'test_linear', 'train', 'test'])

# plot loss vs time
plt.figure()
plt.plot(time_arr_linear, res['train']['binary_logloss'])
plt.plot(time_arr_linear, res['valid']['binary_logloss'])
plt.plot(time_arr_nonlinear, res2['train']['binary_logloss'])
plt.plot(time_arr_nonlinear, res2['valid']['binary_logloss'])
plt.legend(['train_linear', 'test_linear', 'train', 'test'])

# evaluate performance on hold-out set
t = time.time()
p_linear = est_linear.predict(train.loc[~train_bool, list(range(1, 19))], num_boost_round=est_linear.best_iteration)
linear_pred_time = time.time() - t
print("Linear model")
print("training time ", linear_time)
print("prediction time ", linear_pred_time)
print("log loss ", log_loss(train.loc[~train_bool, 0], p_linear))
print("auc ", roc_auc_score(train.loc[~train_bool, 0], p_linear))

t = time.time()
p = est.predict(train.loc[~train_bool, list(range(1, 19))], num_boost_round=est.best_iteration)
nonlinear_pred_time = time.time() - t
print("\n\nConstant model")
print("training time ", nonlinear_time)
print("prediction time ", nonlinear_pred_time)
print("log loss ", log_loss(train.loc[~train_bool, 0], p))
print("auc ", roc_auc_score(train.loc[~train_bool, 0], p))

plt.title('Accuracy vs time on SUSY dataset (500 iterations)', fontsize=12)
plt.xlabel('Time (seconds)', fontsize=12)
plt.ylabel('Log loss', fontsize=12)

plt.figure()
plt.scatter(p, p_linear)
plt.xlim([0, 0.2])
plt.ylim([0, 0.2])

Output:

Linear model
training time  101.74181818962097
prediction time  3.5460383892059326
log loss  0.4245472152015672
auc  0.8775945941602206

Constant model
training time  89.05728363990784
prediction time  2.811807155609131
log loss  0.4258133037378132
auc  0.8769452618339838
Out[3]: (0, 0.2)

Full training graph:

Training graph zoomed in on y-axis to show different convergence:

… dataset having incompatible parameters.

ChipKerchner · 2021-01-11T23:16:35Z

@StrikerRUS @guolinke IMO this pull introduced a bug or unexpected behavior in the class method SizesInByte. It no long returns the data_.size() but the AlignedSize(data_.size()). This is BAD if you are using it for allocation or copying from a desire location in the data. In the past, SizesInByte was NEVER BIGGER than num_data_ but NOW in some case it is. There should be some way of getting the original data_.size(). SizesInByte is used in FeatureGroupSizesInByte and should be the same as FeatureGroupData's get_data().size and not the AlignedSize.

shiyu1994 · 2021-01-12T03:02:05Z

@ChipKerchner It seems that the AlignedSize(data_.size()) in SizesInByte is not introduced by this PR, but #3415

StrikerRUS · 2021-01-12T11:12:57Z

@shiyu1994 @guolinke
Related to the AlignedSize problem: #3450 (comment).

JoshuaC3 · 2021-02-03T14:30:55Z

When will this go to pip please? Really interested to give this a try!

JoshuaC3 · 2021-02-03T14:32:58Z

This linear model can support linear regression ？

From the code, it supports regression, except regression with L1 loss.

@shiyu1994 - what was the reason for this? Does it support custom functions?

jameslamb · 2021-02-03T14:34:09Z

The next formal release is being tracked in #3872 , you can subscribe to that for updates. We cannot give an exact date at this time, but you can see a list there of the work that still needs to be done.

StrikerRUS · 2021-02-03T15:56:17Z

@JoshuaC3

When will this go to pip please? Really interested to give this a try!

For now please feel free to install from nightly wheels.

https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html

JoshuaC3 · 2021-02-03T17:19:28Z

@StrikerRUS I only looked at the git pip install on the pip website itself. This broke because "Windows build issues". I will give the Wheels a spin!!

Also, @jameslamb, I recognised you name but wasn't sure where from. I've just realised!! recent-developments-in-lightgbm. Excellent video so thanks. Hopefully we get another one with 4.0 😉🤞

Thank both/all 👏

spiralulam · 2021-04-15T09:06:35Z

Hi @btrotta, thanks for your work! Is it possible to access the coefficients and offsets of the linear models at each leaf? I did not find this information in e.g. the dump_model() method.

shiyu1994 · 2021-04-16T07:06:45Z

Hi @spiralulam, thanks for using LightGBM. If you dump the tree model with linear_tree enabled, you should see an entry named leaf_coeff in the part of each tree. That's the coefficients of the linear leaves.

spiralulam · 2021-04-16T09:14:13Z

Hi @spiralulam, thanks for using LightGBM. If you dump the tree model with linear_tree enabled, you should see an entry named leaf_coeff in the part of each tree. That's the coefficients of the linear leaves.

Thanks for the answer. Does that mean I basically have to parse a string to obtain these information using save_model()? The dump_model() method which returns a dictionary does not contain leaf_coeff nor leaf_const.

shiyu1994 · 2021-04-16T09:31:52Z

Just checked the method that dumps a tree to JSON. Unfortunately, information of linear leaves are not handled. So currently parsing from the model text file seems to be the only solution. Sorry for the inconvenience.

This should be a very useful case when using linear trees, and I believe we should provide direct access to linear model coefficients through C++, Python and R API.

spiralulam · 2021-04-16T09:43:52Z

That would be awesome, indeed.

cc22226 · 2021-06-11T12:47:43Z

Hi @btrotta, Thx for your work! Does the code allow using a subset of the features for the tree splits and another totally different to estimate the linear model at each leaf?

shiyu1994 · 2021-06-18T03:49:43Z

@cc22226 Thanks for using LightGBM. Currently the linear models at leaves will consider all the numerical features (or non-categorical features). And there's no parameter to control which features are used in the linear models and which are used in the splits. I think it would be nice sometimes to have these two parts of features being separated. And maybe we can leave that as a feature request.

cc22226 · 2021-06-21T19:19:45Z

Thanks for your response. Cesar

…

On Thu, Jun 17, 2021 at 11:49 PM shiyu1994 ***@***.***> wrote: @cc22226 <https://github.com/cc22226> Thanks for using LightGBM. Currently the linear models at leaves will consider all the numerical features (or non-categorical features). And there's no parameter to control which features are used in the linear models and which are used in the splits. I think it would be nice sometimes to have these two parts of features being separated. And maybe we can leave that as a feature request. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3299 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANQTDQEMNLEGE673A7OWED3TTK66LANCNFSM4P4HS22Q> .

github-actions · 2023-08-23T19:18:47Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

btrotta added 30 commits June 26, 2020 12:09

Add Eigen library.

3370bc5

Working for simple test.

08cb3d1

Apply changes to config params.

22040c7

Handle nan data.

1001383

Update docs.

aa77951

Add test.

8261f7b

Only load raw data if boosting=gbdt_linear

3e43722

Remove unneeded code.

0b00394

Minor updates.

8f9d69e

Update to work with sk-learn interface.

566895b

Update to work with chunked datasets.

e507f72

Throw error if we try to create a Booster with an already-constructed…

3a37ba7

… dataset having incompatible parameters.

Save raw data in binary dataset file.

414c028

Update docs and fix parameter checking.

2fdc4ab

Fix dataset loading.

a56fb45

Add test for regularization.

90477db

Fix bugs when saving and loading tree.

64da46a

Add test for load/save linear model.

30fc91f

Remove unneeded code.

f690536

Fix case where not enough leaf data for linear model.

c1ee624

Simplify code.

24930f0

Speed up code.

f41c5e7

Speed up code.

f6cdc7d

Simplify code.

874c1c8

Speed up code.

7d9bdad

Fix bugs.

7d15fa5

Working version.

7460d53

Store feature data column-wise (not fully working yet).

4fc8121

Fix bugs.

48e6d5b

Speed up.

bc643c9

jameslamb mentioned this pull request Jan 5, 2021

[python] [ci] Refactor tests to use pytest style #3732

Closed

StrikerRUS mentioned this pull request Jan 11, 2021

Update CUDA treelearner according to changes introduced for linear trees #3750

Merged

StrikerRUS mentioned this pull request Feb 4, 2021

[LightGBM] [Warning] Unknown parameter: linear_tree #3910

Closed

jameslamb mentioned this pull request Mar 2, 2021

linear_tree in lightgbm 3.1.1 / python 3.6 #4041

Closed

shiyu1994 mentioned this pull request Mar 9, 2021

[bug] Creating multiple boosters from file results in broken scores. #3778

Closed

This was referenced Apr 16, 2021

Expose information about linear trees in JSON format #4186

Closed

Add support for piecewise linear gradient boosting trees in LightGBM BayesWitnesses/m2cgen#382

Open

btrotta mentioned this pull request May 29, 2021

Add linear leaf models to json output (fixes #4186) #4329

Merged

jameslamb mentioned this pull request Oct 27, 2021

[R-package] Link to optimized linear algebra libraries #4692

Closed

AndreFCruz mentioned this pull request May 18, 2022

Model serialization changed from v3.0.0; check if this is expected behavior feedzai/fairgbm#12

Closed

AgFlore mentioned this pull request Sep 15, 2022

[python-package] Print coefficients of linear leaves when plotting trees #5488

Closed

1991jhf mentioned this pull request Jul 4, 2023

Feature Request: piecewise linear gradient boosting tree Evovest/EvoTrees.jl#241

Closed

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trees with linear models at leaves #3299

Trees with linear models at leaves #3299

btrotta commented Aug 12, 2020

ChipKerchner commented Jan 11, 2021 •

edited

Loading

shiyu1994 commented Jan 12, 2021

StrikerRUS commented Jan 12, 2021

JoshuaC3 commented Feb 3, 2021

JoshuaC3 commented Feb 3, 2021

jameslamb commented Feb 3, 2021

StrikerRUS commented Feb 3, 2021

JoshuaC3 commented Feb 3, 2021

spiralulam commented Apr 15, 2021

shiyu1994 commented Apr 16, 2021

spiralulam commented Apr 16, 2021

shiyu1994 commented Apr 16, 2021 •

edited

Loading

spiralulam commented Apr 16, 2021

cc22226 commented Jun 11, 2021

shiyu1994 commented Jun 18, 2021

cc22226 commented Jun 21, 2021 via email

github-actions bot commented Aug 23, 2023

Trees with linear models at leaves #3299

Trees with linear models at leaves #3299

Conversation

btrotta commented Aug 12, 2020

ChipKerchner commented Jan 11, 2021 • edited Loading

shiyu1994 commented Jan 12, 2021

StrikerRUS commented Jan 12, 2021

JoshuaC3 commented Feb 3, 2021

JoshuaC3 commented Feb 3, 2021

jameslamb commented Feb 3, 2021

StrikerRUS commented Feb 3, 2021

JoshuaC3 commented Feb 3, 2021

spiralulam commented Apr 15, 2021

shiyu1994 commented Apr 16, 2021

spiralulam commented Apr 16, 2021

shiyu1994 commented Apr 16, 2021 • edited Loading

spiralulam commented Apr 16, 2021

cc22226 commented Jun 11, 2021

shiyu1994 commented Jun 18, 2021

cc22226 commented Jun 21, 2021 via email

github-actions bot commented Aug 23, 2023

ChipKerchner commented Jan 11, 2021 •

edited

Loading

shiyu1994 commented Apr 16, 2021 •

edited

Loading