Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-Package] Extremely long column names cause error "[LightGBM] [Fatal] Check failed: (reserved_string_size) >= (required_string_size) at lightgbm_R.cpp, line 177" #4556

Closed
mikemahoney218 opened this issue Aug 25, 2021 · 4 comments

Comments

@mikemahoney218
Copy link
Contributor

Description

Using the R package to fit models fails with error
[LightGBM] [Fatal] Check failed: (reserved_string_size) >= (required_string_size) at lightgbm_R.cpp, line 177
when (many?) columns have long column names.

Truncating column names solves this issue and the model fits successfully.

Reproducible example

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
library(lightgbm)
#> Loading required package: R6
#> 
#> Attaching package: 'lightgbm'
#> The following object is masked from 'package:dplyr':
#> 
#>     slice
library(AmesHousing)

set.seed(123)
ames <- make_ames()

# One-hot encoding categorical features
# This automatically generates column names in the pattern
# <original_column_name>_<level>
ames <- ames |> 
  mutate(dummy_value = 1) |> 
  pivot_wider(names_from = where(is.factor), 
              values_from = dummy_value,
              values_fill = 0)

# The longest column name here is 429 characters:
vapply(names(ames), \(x) nchar(x), numeric(1)) |> 
  max()
#> [1] 429

xtrain <- as.matrix(ames[setdiff(names(ames), "Sale_Price")])
ytrain <- ames[["Sale_Price"]]

# This causes an error
try(
  lightgbm(
    data = xtrain,
    label = ytrain,
    obj = "regression",
    verbose = -1L
  )
)
#> Error in lgb.call(fun_name = fun_name, ret = buf, ..., buf_len, act_len) : 
#>   [LightGBM] [Fatal] Check failed: (reserved_string_size) >= (required_string_size) at lightgbm_R.cpp, line 177 .

# Truncating column names to 30 characters solves the issue:
ames_trunc <- ames
names(ames_trunc) <- substr(names(ames), 1, 30)
xtrain <- as.matrix(ames_trunc[setdiff(names(ames_trunc), "Sale_Price")])
ytrain <- ames_trunc[["Sale_Price"]]

model <- lightgbm(
    data = xtrain,
    label = ytrain,
    obj = "regression",
    verbose = -1L
  )

sessionInfo()
#> R version 4.1.1 (2021-08-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] AmesHousing_0.0.4 lightgbm_3.2.1    R6_2.5.0          tidyr_1.1.3      
#> [5] dplyr_1.0.7      
#> 
#> loaded via a namespace (and not attached):
#>  [1] pillar_1.6.1      compiler_4.1.1    highr_0.9         tools_4.1.1      
#>  [5] digest_0.6.27     jsonlite_1.7.2    evaluate_0.14     lifecycle_1.0.0  
#>  [9] tibble_3.1.3      lattice_0.20-44   pkgconfig_2.0.3   rlang_0.4.11     
#> [13] reprex_2.0.0      Matrix_1.3-4      DBI_1.1.1         yaml_2.2.1       
#> [17] xfun_0.24         withr_2.4.2       styler_1.4.1      stringr_1.4.0    
#> [21] knitr_1.33        generics_0.1.0    fs_1.5.0          vctrs_0.3.8      
#> [25] grid_4.1.1        tidyselect_1.1.1  glue_1.4.2        data.table_1.14.0
#> [29] fansi_0.5.0       rmarkdown_2.9     purrr_0.3.4       magrittr_2.0.1   
#> [33] backports_1.2.1   ellipsis_0.3.2    htmltools_0.5.1.1 assertthat_0.2.1 
#> [37] utf8_1.2.2        stringi_1.6.2     crayon_1.4.1

Created on 2021-08-25 by the reprex package (v2.0.0)

Environment info

LightGBM version or commit hash: lightgbm_3.2.1 (R package)

Command(s) you used to install LightGBM

install.packages("lightgbm")

Additional Comments

@jameslamb
Copy link
Collaborator

Thanks very much for using {lightgbm} and for this excellent issue report! We really appreciate that you took the time to create a reproducible example.

I believe this issue has been fixed on master (in #4256), and just hasn't been released to CRAN yet.

See the following test as an example.

test_that("lgb.Dataset: should be able to use and retrieve long feature names", {
# set one feature to a value longer than the default buffer size used
# in LGBM_DatasetGetFeatureNames_R
feature_names <- names(iris)
long_name <- paste0(rep("a", 1000L), collapse = "")
feature_names[1L] <- long_name
names(iris) <- feature_names
# check that feature name survived the trip from R to C++ and back
dtrain <- lgb.Dataset(
data = as.matrix(iris[, -5L])
, label = as.numeric(iris$Species) - 1L
)
dtrain$construct()
col_names <- dtrain$get_colnames()
expect_equal(col_names[1L], long_name)
expect_equal(nchar(col_names[1L]), 1000L)
})

To confirm, you could try installing the R package from source.

git clone --recursive git@github.com:microsoft/LightGBM.git
cd LightGBM
sh build-cran-package.sh
R CMD INSTALL lightgbm_3.2.1.99.tar.gz

A new release with this and many other fixes will be up on CRAN soon. I recommend subscribing to #4310 to be notified when that release goes out.

@mikemahoney218
Copy link
Contributor Author

Fantastic! Thank you. I did actually search to see if this was a duplicate, just apparently wasn't very good at it 😄

Thanks for the great package!

@jameslamb
Copy link
Collaborator

No problem, come back any time! 👋

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants