Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trees with linear models at leaves #3299

Merged
merged 159 commits into from
Dec 24, 2020
Merged

Conversation

btrotta
Copy link
Collaborator

@btrotta btrotta commented Aug 12, 2020

Implements boosting for trees with linear models at the leaves (sometimes called M5 trees). This is a hybrid between traditional tree boosting and the model proposed in the paper Gradient Boosting with Piece-Wise Linear Regression Trees by Shi, Li, and Li (https://arxiv.org/pdf/1802.05640.pdf), which is mentioned in #1315. In this PR, the tree structure is created by finding the best split in the normal way, but then we calculate a linear model on each leaf. In contrast, in the paper, the splits are chosen by calculating the linear models for each potential split point, which is much more computationally intensive and would require more significant code changes in LightGBM. (The paper above actually mentions M5 trees, in Appendix D, but only tests existing slow implementations, which give poor results compared to their code. I think with the better implementation from this PR, M5 trees would come close to the performance of the fully-linear approach.)

The running time of the linear-leaf model is around 10-20% more than traditional tree boosting (depending on the dataset etc), but it converges faster, so overall it gives a small improvement in training time (and also can achieve slightly better accuracy). Memory use is higher since we need so store the full feature data.

Regularisation can be controlled with the parameter linear_lambda; this is important because it's more prone to overfitting than the traditional tree-boosting model. Also it's important to scale the data before training so that all features have similar mean and standard deviation.

The code uses parts of the Eigen library licensed under MPL2.

I have only implemented this for Python. I think to get it working for R it would require some changes to the data loading interface, but I'm not very familiar with R, so maybe someone else would like to take that on.

Here is a test script to measure performance on the SUSY physics data (https://archive.ics.uci.edu/ml/datasets/SUSY).

import pandas as pd
import numpy as np
import lightgbm as lgb
import time as time
from sklearn.metrics import log_loss, roc_auc_score
import matplotlib.pyplot as plt

# read data
train = pd.read_csv('susy.csv', header=None)
np.random.seed(0)

# normalise
for i in range(len(train.columns) - 1):
    m, s = train.iloc[:, i + 1].agg(['mean', 'std'])
    train.iloc[:, i + 1] = (train.iloc[:, i + 1] - m) / s

train_bool = train.index <= train.index.max() - 500000  # last 500000 is the hold-out test set
valid_bool = train.index <= 500000  # validation set for early stopping

# linear model
t = time.time()
lgb_train = lgb.Dataset(train.loc[train_bool & ~valid_bool, list(range(1, 19))], label=train.loc[train_bool & ~valid_bool, 0])
lgb_test = lgb.Dataset(train.loc[valid_bool, list(range(1, 19))], label=train.loc[valid_bool, 0])
valid_sets = [lgb_train, lgb_test]
valid_names = ['train', 'valid']
# linear model
params = {'objective': 'binary', 'seed': 0, 'num_leaves': 16, 'learning_rate': 0.05,
          'metric': 'binary_logloss', 'verbose': 2, 'linear_lambda': 0.01, 'linear_tree': True}
res = {}
t0 = time.time()
time_arr_linear = []
timer_callback = lambda env: time_arr_linear.append(time.time() - t0)
est_linear = lgb.train(params, lgb_train, num_boost_round=500,valid_sets=valid_sets, valid_names=valid_names,
                       categorical_feature=[], evals_result=res, callbacks=[timer_callback])
linear_time = time.time() - t
print("Linear model")
print("training time ", linear_time)

# normal model
params = {'objective': 'binary', 'seed': 0, 'num_leaves': 16, 'learning_rate': 0.05,
          'metric': 'binary_logloss', 'verbose': 2, 'linear_tree': False}
t = time.time()
res2 = {}
lgb_train = lgb.Dataset(train.loc[train_bool & ~valid_bool, list(range(1, 19))], label=train.loc[train_bool & ~valid_bool, 0])
lgb_test = lgb.Dataset(train.loc[valid_bool, list(range(1, 19))], label=train.loc[valid_bool, 0])
valid_sets = [lgb_train, lgb_test]
valid_names = ['train', 'valid']
t0 = time.time()
time_arr_nonlinear = []
timer_callback = lambda env: time_arr_nonlinear.append(time.time() - t0)
est = lgb.train(params, lgb_train, valid_sets=valid_sets, valid_names=valid_names, num_boost_round=500,
                categorical_feature=[], evals_result=res2, callbacks=[timer_callback])
nonlinear_time = time.time() - t

# plot loss vs iterations
plt.figure()
plt.plot(res['train']['binary_logloss'])
plt.plot(res['valid']['binary_logloss'])
plt.plot(res2['train']['binary_logloss'])
plt.plot(res2['valid']['binary_logloss'])
plt.legend(['train_linear', 'test_linear', 'train', 'test'])

# plot loss vs time
plt.figure()
plt.plot(time_arr_linear, res['train']['binary_logloss'])
plt.plot(time_arr_linear, res['valid']['binary_logloss'])
plt.plot(time_arr_nonlinear, res2['train']['binary_logloss'])
plt.plot(time_arr_nonlinear, res2['valid']['binary_logloss'])
plt.legend(['train_linear', 'test_linear', 'train', 'test'])

# evaluate performance on hold-out set
t = time.time()
p_linear = est_linear.predict(train.loc[~train_bool, list(range(1, 19))], num_boost_round=est_linear.best_iteration)
linear_pred_time = time.time() - t
print("Linear model")
print("training time ", linear_time)
print("prediction time ", linear_pred_time)
print("log loss ", log_loss(train.loc[~train_bool, 0], p_linear))
print("auc ", roc_auc_score(train.loc[~train_bool, 0], p_linear))

t = time.time()
p = est.predict(train.loc[~train_bool, list(range(1, 19))], num_boost_round=est.best_iteration)
nonlinear_pred_time = time.time() - t
print("\n\nConstant model")
print("training time ", nonlinear_time)
print("prediction time ", nonlinear_pred_time)
print("log loss ", log_loss(train.loc[~train_bool, 0], p))
print("auc ", roc_auc_score(train.loc[~train_bool, 0], p))

plt.title('Accuracy vs time on SUSY dataset (500 iterations)', fontsize=12)
plt.xlabel('Time (seconds)', fontsize=12)
plt.ylabel('Log loss', fontsize=12)

plt.figure()
plt.scatter(p, p_linear)
plt.xlim([0, 0.2])
plt.ylim([0, 0.2])

Output:

Linear model
training time  101.74181818962097
prediction time  3.5460383892059326
log loss  0.4245472152015672
auc  0.8775945941602206

Constant model
training time  89.05728363990784
prediction time  2.811807155609131
log loss  0.4258133037378132
auc  0.8769452618339838
Out[3]: (0, 0.2)

Full training graph:
susy_test

Training graph zoomed in on y-axis to show different convergence:
susy_test_zoom

@ChipKerchner
Copy link
Contributor

ChipKerchner commented Jan 11, 2021

@StrikerRUS @guolinke IMO this pull introduced a bug or unexpected behavior in the class method SizesInByte. It no long returns the data_.size() but the AlignedSize(data_.size()). This is BAD if you are using it for allocation or copying from a desire location in the data. In the past, SizesInByte was NEVER BIGGER than num_data_ but NOW in some case it is. There should be some way of getting the original data_.size(). SizesInByte is used in FeatureGroupSizesInByte and should be the same as FeatureGroupData's get_data().size and not the AlignedSize.

@shiyu1994
Copy link
Collaborator

@ChipKerchner It seems that the AlignedSize(data_.size()) in SizesInByte is not introduced by this PR, but #3415

@StrikerRUS
Copy link
Collaborator

@shiyu1994 @guolinke
Related to the AlignedSize problem: #3450 (comment).

@JoshuaC3
Copy link

JoshuaC3 commented Feb 3, 2021

When will this go to pip please? Really interested to give this a try!

@JoshuaC3
Copy link

JoshuaC3 commented Feb 3, 2021

This linear model can support linear regression ?

From the code, it supports regression, except regression with L1 loss.

@shiyu1994 - what was the reason for this? Does it support custom functions?

@jameslamb
Copy link
Collaborator

The next formal release is being tracked in #3872 , you can subscribe to that for updates. We cannot give an exact date at this time, but you can see a list there of the work that still needs to be done.

@StrikerRUS
Copy link
Collaborator

@JoshuaC3

When will this go to pip please? Really interested to give this a try!

For now please feel free to install from nightly wheels.

https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html

image

@JoshuaC3
Copy link

JoshuaC3 commented Feb 3, 2021

@StrikerRUS I only looked at the git pip install on the pip website itself. This broke because "Windows build issues". I will give the Wheels a spin!!

Also, @jameslamb, I recognised you name but wasn't sure where from. I've just realised!! recent-developments-in-lightgbm. Excellent video so thanks. Hopefully we get another one with 4.0 😉🤞

Thank both/all 👏

@spiralulam
Copy link

Hi @btrotta, thanks for your work! Is it possible to access the coefficients and offsets of the linear models at each leaf? I did not find this information in e.g. the dump_model() method.

@shiyu1994
Copy link
Collaborator

Hi @spiralulam, thanks for using LightGBM. If you dump the tree model with linear_tree enabled, you should see an entry named leaf_coeff in the part of each tree. That's the coefficients of the linear leaves.

@spiralulam
Copy link

Hi @spiralulam, thanks for using LightGBM. If you dump the tree model with linear_tree enabled, you should see an entry named leaf_coeff in the part of each tree. That's the coefficients of the linear leaves.

Thanks for the answer. Does that mean I basically have to parse a string to obtain these information using save_model()? The dump_model() method which returns a dictionary does not contain leaf_coeff nor leaf_const.

@shiyu1994
Copy link
Collaborator

shiyu1994 commented Apr 16, 2021

Just checked the method that dumps a tree to JSON. Unfortunately, information of linear leaves are not handled. So currently parsing from the model text file seems to be the only solution. Sorry for the inconvenience.

This should be a very useful case when using linear trees, and I believe we should provide direct access to linear model coefficients through C++, Python and R API.

@spiralulam
Copy link

That would be awesome, indeed.

@cc22226
Copy link

cc22226 commented Jun 11, 2021

Hi @btrotta, Thx for your work! Does the code allow using a subset of the features for the tree splits and another totally different to estimate the linear model at each leaf?

@shiyu1994
Copy link
Collaborator

@cc22226 Thanks for using LightGBM. Currently the linear models at leaves will consider all the numerical features (or non-categorical features). And there's no parameter to control which features are used in the linear models and which are used in the splits. I think it would be nice sometimes to have these two parts of features being separated. And maybe we can leave that as a feature request.

@cc22226
Copy link

cc22226 commented Jun 21, 2021 via email

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.