Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-package] Very large l2 when training model #4305

Closed
jfouyang opened this issue May 20, 2021 · 5 comments
Closed

[R-package] Very large l2 when training model #4305

jfouyang opened this issue May 20, 2021 · 5 comments

Comments

@jfouyang
Copy link

Description

Hi, I am using lightGBM to determine feature importances from an in-house dataset that is very sparse in nature. When training the model on this sparse dataset, I noticed that the training l2 error is very large in the order of 10^73 and the feature importance results do not agree with my domain knowledge.

I also tried running the same dataset using xgboost and the training RMSE is much smaller in the range of 0.4-0.6. Furthermore, the feature importance results make a lot more sense to me. Finally, I also compared the Gain computed from lightGBM and xgboost (see the scatter plot below) and they do not agree very well with each other. I wonder if lightGBM does any manipulation/preprocessing to the dataset which resulted in the spurious large training l2 error?

As an additional note, I ran the same feature importance code previously on the older version of lightGBM (v2.3.4) and got results that are similar to xgboost. I only started getting this weird phenomenon when I upgraded to version3+ of lightGBM.

Reproducible example

The in-house dataset testData.rds can be downloaded from here

And here is the R code:

library(Matrix)
library(ggplot2)
library(xgboost)
library(lightgbm)

# LGB portion
testData = readRDS("testData.rds")
lgbParams = list(boosting_type = "gbdt", objective = "regression",
                 learning_rate = 0.01)
inp = lgb.Dataset(data = testData[, -1], 
                  label = testData[, 1]) 
set.seed(42)
model = lightgbm(data = inp, params = lgbParams, nrounds = 1000, 
                 eval_freq = 100, verbose = 1, num_threads = 20)
oup1 = lgb.importance(model)

# XGB portion
xgbParams = list(booster = "gbtree", objective = "reg:squarederror",
                 eta = 0.01, tree_method = "hist")
inp = xgb.DMatrix(data = testData[, -1], 
                  label = testData[, 1]) 
set.seed(42)
model = xgboost(data = inp, params = xgbParams, print_every_n = 100,
                nrounds = 1000, verbose = 1, nthread = 20)
oup2 = xgb.importance(model = model)

# Compare LGB and XGB Gain and plot 
oupCompare = oup1[oup2, on = "Feature"]
ggplot(oupCompare, aes(Gain, i.Gain)) +
  geom_point() + xlab("LGB Gain") + ylab("XGB Gain") + 
  theme_classic(base_size = 24) + scale_x_log10() + scale_y_log10()

Output from lightGBM:

[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.081095 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 104040
[LightGBM] [Info] Number of data points in the train set: 43791, number of used features: 408
[LightGBM] [Info] Start training from score 114245549891896526252176302135050240.000000
[1] "[1]:  train's l2:1.36747e+73"
[1] "[101]:  train's l2:1.2623e+73"
[1] "[201]:  train's l2:1.18071e+73"
[1] "[301]:  train's l2:1.10743e+73"
[1] "[401]:  train's l2:1.04192e+73"
[1] "[501]:  train's l2:9.82944e+72"
[1] "[601]:  train's l2:9.29686e+72"
[1] "[701]:  train's l2:8.79174e+72"
[1] "[801]:  train's l2:8.3274e+72"
[1] "[901]:  train's l2:7.92462e+72"
[1] "[1000]:  train's l2:7.55389e+72"

Output from xgboost:

[1]	train-rmse:0.663716 
[101]	train-rmse:0.504467 
[201]	train-rmse:0.462098 
[301]	train-rmse:0.442670 
[401]	train-rmse:0.429328 
[501]	train-rmse:0.419056 
[601]	train-rmse:0.410899 
[701]	train-rmse:0.403912 
[801]	train-rmse:0.397756 
[901]	train-rmse:0.391985 
[1000]	train-rmse:0.386239 

Comparison of Gain feature importance from xgboost vs lightGBM:
Rplot

Environment info

> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS

Matrix products: default
BLAS/LAPACK: /home/john/miniconda3/lib/libopenblasp-r0.3.15.so

locale:
 [1] LC_CTYPE=en_SG.UTF-8       LC_NUMERIC=C               LC_TIME=en_SG.UTF-8        LC_COLLATE=en_SG.UTF-8    
 [5] LC_MONETARY=en_SG.UTF-8    LC_MESSAGES=en_SG.UTF-8    LC_PAPER=en_SG.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_SG.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_3.3.3   lightgbm_3.2.1  R6_2.5.0        xgboost_1.4.1.1 Matrix_1.3-2   

loaded via a namespace (and not attached):
 [1] magrittr_2.0.1    tidyselect_1.1.0  munsell_0.5.0     colorspace_2.0-0  lattice_0.20-41   rlang_0.4.10     
 [7] fansi_0.4.2       dplyr_1.0.5       tools_4.0.3       grid_4.0.3        data.table_1.14.0 gtable_0.3.0     
[13] utf8_1.2.1        DBI_1.1.1         withr_2.4.1       ellipsis_0.3.1    digest_0.6.27     assertthat_0.2.1 
[19] tibble_3.1.0      lifecycle_1.0.0   crayon_1.4.1      farver_2.1.0      purrr_0.3.4       vctrs_0.3.6      
[25] glue_1.4.2        compiler_4.0.3    pillar_1.5.1      generics_0.1.0    scales_1.1.1      jsonlite_1.7.2   
[31] pkgconfig_2.0.3  
@jameslamb
Copy link
Collaborator

Thanks very much for using {lightgbm} and for the detailed write-up with a reproducible example! If no other maintainers get to it sooner, I will take a look in the next day or two.

@jameslamb
Copy link
Collaborator

Ok, I took a look.

I was able to reproduce this behavior on my system ({lightgbm} 3.2.1 installed from CRAN, R 4.05, macOS).

[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.212046 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 104040
[LightGBM] [Info] Number of data points in the train set: 43791, number of used features: 408
[LightGBM] [Info] Start training from score 5786324437318345225632202477835124736.000000
[1] "[1]:  train's l2:7.61155e+74"
[1] "[101]:  train's l2:7.44365e+74"
[1] "[201]:  train's l2:7.2954e+74"
[1] "[301]:  train's l2:7.15143e+74"
[1] "[401]:  train's l2:7.0124e+74"
[1] "[501]:  train's l2:6.87951e+74"
[1] "[601]:  train's l2:6.75179e+74"
[1] "[701]:  train's l2:6.62947e+74"
[1] "[801]:  train's l2:6.51303e+74"
[1] "[901]:  train's l2:6.39489e+74"
[1] "[1000]:  train's l2:6.28377e+74"

I then tried building {lightgbm} from latest master, and found that the problem seems to have been fixed.

[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.378029 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 104040
[LightGBM] [Info] Number of data points in the train set: 43791, number of used features: 408
[LightGBM] [Info] Start training from score 0.408969
[1] "[1]:  train's l2:0.432521"
[1] "[101]:  train's l2:0.262002"
[1] "[201]:  train's l2:0.226265"
[1] "[301]:  train's l2:0.211681"
[1] "[401]:  train's l2:0.202313"
[1] "[501]:  train's l2:0.194707"
[1] "[601]:  train's l2:0.188175"
[1] "[701]:  train's l2:0.18244"
[1] "[801]:  train's l2:0.177234"
[1] "[901]:  train's l2:0.172524"
[1] "[1000]:  train's l2:0.168078"

So I'm not sure what the root cause is, but I suspect that one of the stability fixes we've made recently for the R package fixed this. Maybe one or all of these:


I'm very sorry for the inconvenience, but could you try building {lightgbm} from source on latest master and see if that solves the problem for you as well?

git clone --recursive git@github.com:microsoft/LightGBM.git
cd LightGBM
sh build-cran-package.sh
R CMD INSTALL lightgbm_3.2.1.99.tar.gz

I'll start a separate conversation with other maintainers about doing a new release to CRAN soon.

@jfouyang
Copy link
Author

Hi @jameslamb, I followed your code to install the latest version of lightGBM and I am getting exactly the same l2 training error as you posted. Thanks so much for the help and looking forward to lightGBM v3.3.0 on CRAN soon!

@jameslamb
Copy link
Collaborator

Ok great! Very sorry for the inconvenience.

Thanks again for the excellent bug report with a detailed reproducible example. Made it easy for me to test fixes.

You can subscribe to #4310 to be notified when the next release is out.

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants