Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-package] enable use of trees with linear models at leaves (fixes #3319) #3699

Merged
merged 17 commits into from
Jan 18, 2021
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions R-package/configure
Original file line number Diff line number Diff line change
Expand Up @@ -1699,6 +1699,12 @@ CXX=`"${R_HOME}/bin/R" CMD config CXX11`
# LightGBM-specific flags
LGB_CPPFLAGS=""

#########
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/gha run r-valgrind

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this build (https://github.com/microsoft/LightGBM/runs/1629232486?check_suite_focus=true) failed with an error I didn't see testing the CRAN package locally

In file included from ./include/Eigen/Dense:1,
from treelearner/linear_tree_learner.cpp:7:
./include/Eigen/Core:15:10: fatal error: src/Core/util/DisableStupidWarnings.h: No such file or directory
15 | #include "src/Core/util/DisableStupidWarnings.h"

Will look into it tomorrow. Probably just a mistake I made with the include paths.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/gha run r-valgrind

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/gha run r-valgrind

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the problem is that the fatal error causes lightgbm to exit without freeing some allocated memory? Similar to this: https://stackoverflow.com/questions/1901322/valgrind-reports-memory-leak-when-assigning-a-value-to-a-string

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea! I know we ran into similar issues in the past. I'm going to try pushing a commit right now that skips those tests, to test that theory.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/gha run r-valgrind

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work @btrotta ! After skipping the one test introduced in this PR that checks for errors, valgrind is happy 😀

==2177== 
==2177== HEAP SUMMARY:
==2177==     in use at exit: 312,459,948 bytes in 55,720 blocks
==2177==   total heap usage: 2,837,681 allocs, 2,781,961 frees, 6,180,339,569 bytes allocated
==2177== 
==2177== 336 bytes in 1 blocks are possibly lost in loss record 152 of 2,721
==2177==    at 0x483DD99: calloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==2177==    by 0x40149CA: allocate_dtv (dl-tls.c:286)
==2177==    by 0x40149CA: _dl_allocate_tls (dl-tls.c:532)
==2177==    by 0x5706322: allocate_stack (allocatestack.c:622)
==2177==    by 0x5706322: pthread_create@@GLIBC_2.2.5 (pthread_create.c:660)
==2177==    by 0x56D4DDA: ??? (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==2177==    by 0x56CC8E0: GOMP_parallel (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==2177==    by 0x15576383: LGBM_DatasetCreateFromCSC (c_api.cpp:1305)
==2177==    by 0x155A5512: LGBM_DatasetCreateFromCSC_R (lightgbm_R.cpp:91)
==2177==    by 0x4942146: R_doDotCall (dotcode.c:634)
==2177==    by 0x494CFDD: do_dotcall (dotcode.c:1281)
==2177==    by 0x49A06DF: bcEval (eval.c:7072)
==2177==    by 0x498C2C0: Rf_eval (eval.c:727)
==2177==    by 0x498F02F: R_execClosure (eval.c:1897)
==2177== 
==2177== LEAK SUMMARY:
==2177==    definitely lost: 0 bytes in 0 blocks
==2177==    indirectly lost: 0 bytes in 0 blocks
==2177==      possibly lost: 336 bytes in 1 blocks
==2177==    still reachable: 312,459,612 bytes in 55,719 blocks
==2177==                       of which reachable via heuristic:
==2177==                         newarray           : 4,264 bytes in 1 blocks
==2177==         suppressed: 0 bytes in 0 blocks
==2177== Reachable blocks (those to which a pointer was found) are not shown.
==2177== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==2177== 
==2177== For lists of detected and suppressed errors, rerun with: -s
==2177== ERROR SUMMARY: 3 errors from 2 contexts (suppressed: 0 from 0)
writing valgrind output to valgrind-logs.log
valgrind found 0 bytes definitely lost
valgrind found 0 bytes indirectly lost
valgrind found 336 bytes possibly lost

I think we should leave this skip in to get this feature merged, and I can work on removing it separately as a follow-up task.

# Eigen #
#########

LGB_CPPFLAGS="${LGB_CPPFLAGS} -DEIGEN_MPL2_ONLY"

###############
# MM_PREFETCH #
###############
Expand Down
6 changes: 6 additions & 0 deletions R-package/configure.ac
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,12 @@ CXX=`"${R_HOME}/bin/R" CMD config CXX11`
# LightGBM-specific flags
LGB_CPPFLAGS=""

#########
# Eigen #
#########

LGB_CPPFLAGS="${LGB_CPPFLAGS} -DEIGEN_MPL2_ONLY"

###############
# MM_PREFETCH #
###############
Expand Down
6 changes: 6 additions & 0 deletions R-package/configure.win
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,12 @@ CC=`"${R_EXE}" CMD config CC`
# LightGBM-specific flags
LGB_CPPFLAGS=""

#########
# Eigen #
#########

LGB_CPPFLAGS="${LGB_CPPFLAGS} -DEIGEN_MPL2_ONLY"

###############
# MM_PREFETCH #
###############
Expand Down
279 changes: 279 additions & 0 deletions R-package/tests/testthat/test_basic.R
Original file line number Diff line number Diff line change
Expand Up @@ -345,6 +345,45 @@ test_that("lightgbm.cv() gives the correct best_score and best_iter for a metric
expect_identical(cv_bst$best_score, auc_scores[which.max(auc_scores)])
})

test_that("lgb.cv() fit on linearly-relatead data improves when using linear learners", {
set.seed(708L)
.new_dataset <- function() {
X <- matrix(rnorm(1000L), ncol = 1L)
return(lgb.Dataset(
data = X
, label = 2L * X + runif(nrow(X), 0L, 0.1)
))
}

params <- list(
objective = "regression"
, verbose = -1L
, metric = "mse"
, seed = 0L
, num_leaves = 2L
)

dtrain <- .new_dataset()
cv_bst <- lgb.cv(
data = dtrain
, nrounds = 10L
, params = params
, nfold = 5L
)
expect_is(cv_bst, "lgb.CVBooster")

dtrain <- .new_dataset()
cv_bst_linear <- lgb.cv(
data = dtrain
, nrounds = 10L
, params = modifyList(params, list(linear_tree = TRUE))
, nfold = 5L
)
expect_is(cv_bst_linear, "lgb.CVBooster")

expect_true(cv_bst_linear$best_score < cv_bst$best_score)
})

context("lgb.train()")

test_that("lgb.train() works as expected with multiple eval metrics", {
Expand Down Expand Up @@ -1631,6 +1670,246 @@ test_that("early stopping works with lgb.cv()", {
)
})

context("linear learner")

test_that("lgb.train() fit on linearly-relatead data improves when using linear learners", {
set.seed(708L)
.new_dataset <- function() {
X <- matrix(rnorm(100L), ncol = 1L)
return(lgb.Dataset(
data = X
, label = 2L * X + runif(nrow(X), 0L, 0.1)
))
}

params <- list(
objective = "regression"
, verbose = -1L
, metric = "mse"
, seed = 0L
, num_leaves = 2L
)

dtrain <- .new_dataset()
bst <- lgb.train(
data = dtrain
, nrounds = 10L
, params = params
, valids = list("train" = dtrain)
)
expect_true(lgb.is.Booster(bst))

dtrain <- .new_dataset()
bst_linear <- lgb.train(
data = dtrain
, nrounds = 10L
, params = modifyList(params, list(linear_tree = TRUE))
, valids = list("train" = dtrain)
)
expect_true(lgb.is.Booster(bst_linear))

bst_last_mse <- bst$record_evals[["train"]][["l2"]][["eval"]][[10L]]
bst_lin_last_mse <- bst_linear$record_evals[["train"]][["l2"]][["eval"]][[10L]]
expect_true(bst_lin_last_mse < bst_last_mse)
})

test_that("lgb.train() w/ linear learner fails already-constructed dataset with linear=false", {
testthat::skip("Skipping this test because it causes issues for valgrind")
set.seed(708L)
params <- list(
objective = "regression"
, verbose = -1L
, metric = "mse"
, seed = 0L
, num_leaves = 2L
)

dtrain <- lgb.Dataset(
data = matrix(rnorm(100L), ncol = 1L)
, label = rnorm(100L)
)
dtrain$construct()
expect_error({
bst_linear <- lgb.train(
data = dtrain
, nrounds = 10L
, params = modifyList(params, list(linear_tree = TRUE))
, valids = list("train" = dtrain)
)
}, regexp = "Cannot change linear_tree after constructed Dataset handle")
})

test_that("lgb.train() works with linear learners even if Dataset has missing values", {
set.seed(708L)
.new_dataset <- function() {
values <- rnorm(100L)
values[sample(seq_len(length(values)), size = 10L)] <- NA_real_
X <- matrix(
data = sample(values, size = 100L)
, ncol = 1L
)
return(lgb.Dataset(
data = X
, label = 2L * X + runif(nrow(X), 0L, 0.1)
))
}

params <- list(
objective = "regression"
, verbose = -1L
, metric = "mse"
, seed = 0L
, num_leaves = 2L
)

dtrain <- .new_dataset()
bst <- lgb.train(
data = dtrain
, nrounds = 10L
, params = params
, valids = list("train" = dtrain)
)
expect_true(lgb.is.Booster(bst))

dtrain <- .new_dataset()
bst_linear <- lgb.train(
data = dtrain
, nrounds = 10L
, params = modifyList(params, list(linear_tree = TRUE))
, valids = list("train" = dtrain)
)
expect_true(lgb.is.Booster(bst_linear))

bst_last_mse <- bst$record_evals[["train"]][["l2"]][["eval"]][[10L]]
bst_lin_last_mse <- bst_linear$record_evals[["train"]][["l2"]][["eval"]][[10L]]
expect_true(bst_lin_last_mse < bst_last_mse)
})

test_that("lgb.train() works with linear learners, bagging, and a Dataset that has missing values", {
set.seed(708L)
.new_dataset <- function() {
values <- rnorm(100L)
values[sample(seq_len(length(values)), size = 10L)] <- NA_real_
X <- matrix(
data = sample(values, size = 100L)
, ncol = 1L
)
return(lgb.Dataset(
data = X
, label = 2L * X + runif(nrow(X), 0L, 0.1)
))
}

params <- list(
objective = "regression"
, verbose = -1L
, metric = "mse"
, seed = 0L
, num_leaves = 2L
, bagging_freq = 1L
, subsample = 0.8
)

dtrain <- .new_dataset()
bst <- lgb.train(
data = dtrain
, nrounds = 10L
, params = params
, valids = list("train" = dtrain)
)
expect_true(lgb.is.Booster(bst))

dtrain <- .new_dataset()
bst_linear <- lgb.train(
data = dtrain
, nrounds = 10L
, params = modifyList(params, list(linear_tree = TRUE))
, valids = list("train" = dtrain)
)
expect_true(lgb.is.Booster(bst_linear))

bst_last_mse <- bst$record_evals[["train"]][["l2"]][["eval"]][[10L]]
bst_lin_last_mse <- bst_linear$record_evals[["train"]][["l2"]][["eval"]][[10L]]
expect_true(bst_lin_last_mse < bst_last_mse)
})

test_that("lgb.train() works with linear learners and data where a feature has only 1 non-NA value", {
set.seed(708L)
.new_dataset <- function() {
values <- rep(NA_real_, 100L)
values[18L] <- rnorm(1L)
X <- matrix(
data = values
, ncol = 1L
)
return(lgb.Dataset(
data = X
, label = 2L * X + runif(nrow(X), 0L, 0.1)
))
}

params <- list(
objective = "regression"
, verbose = -1L
, metric = "mse"
, seed = 0L
, num_leaves = 2L
)

dtrain <- .new_dataset()
bst_linear <- lgb.train(
data = dtrain
, nrounds = 10L
, params = modifyList(params, list(linear_tree = TRUE))
, valids = list("train" = dtrain)
)
expect_true(lgb.is.Booster(bst_linear))
})

test_that("lgb.train() works with linear learners when Dataset has categorical features", {
set.seed(708L)
.new_dataset <- function() {
X <- matrix(numeric(200L), nrow = 100L, ncol = 2L)
X[, 1L] <- rnorm(100L)
X[, 2L] <- sample(seq_len(4L), size = 100L, replace = TRUE)
return(lgb.Dataset(
data = X
, label = 2L * X[, 1L] + runif(nrow(X), 0L, 0.1)
))
}

params <- list(
objective = "regression"
, verbose = -1L
, metric = "mse"
, seed = 0L
, num_leaves = 2L
, categorical_featurs = 1L
)

dtrain <- .new_dataset()
bst <- lgb.train(
data = dtrain
, nrounds = 10L
, params = params
, valids = list("train" = dtrain)
)
expect_true(lgb.is.Booster(bst))

dtrain <- .new_dataset()
bst_linear <- lgb.train(
data = dtrain
, nrounds = 10L
, params = modifyList(params, list(linear_tree = TRUE))
, valids = list("train" = dtrain)
)
expect_true(lgb.is.Booster(bst_linear))

bst_last_mse <- bst$record_evals[["train"]][["l2"]][["eval"]][[10L]]
bst_lin_last_mse <- bst_linear$record_evals[["train"]][["l2"]][["eval"]][[10L]]
expect_true(bst_lin_last_mse < bst_last_mse)
})

context("interaction constraints")

test_that("lgb.train() throws an informative error if interaction_constraints is not a list", {
Expand Down
24 changes: 24 additions & 0 deletions build-cran-package.sh
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,27 @@ cp \
external_libs/fmt/include/fmt/*.h \
${TEMP_R_DIR}/src/include/LightGBM/fmt/

# including only specific files from Eigen, to keep the R package
# small and avoid redistributing code with licenses incompatible with
# LightGBM's license
EIGEN_R_DIR=${TEMP_R_DIR}/src/include/Eigen
mkdir -p ${EIGEN_R_DIR}

modules="Cholesky Core Dense Eigenvalues Geometry Householder Jacobi LU QR SVD"
for eigen_module in ${modules}; do
cp eigen/Eigen/${eigen_module} ${EIGEN_R_DIR}/${eigen_module}
if [ ${eigen_module} != "Dense" ]; then
mkdir -p ${EIGEN_R_DIR}/src/${eigen_module}/
cp -R eigen/Eigen/src/${eigen_module}/* ${EIGEN_R_DIR}/src/${eigen_module}/
fi
done

mkdir -p ${EIGEN_R_DIR}/src/misc
cp -R eigen/Eigen/src/misc/* ${EIGEN_R_DIR}/src/misc/

mkdir -p ${EIGEN_R_DIR}/src/plugins
cp -R eigen/Eigen/src/plugins/* ${EIGEN_R_DIR}/src/plugins/

cd ${TEMP_R_DIR}

# Remove files not needed for CRAN
Expand Down Expand Up @@ -69,6 +90,9 @@ cd ${TEMP_R_DIR}
for file in $(find . -name '*.h' -o -name '*.hpp' -o -name '*.cpp'); do
sed \
-i.bak \
-e 's/^.*#pragma clang diagnostic.*$//' \
-e 's/^.*#pragma diag_suppress.*$//' \
-e 's/^.*#pragma GCC diagnostic.*$//' \
-e 's/^.*#pragma region.*$//' \
-e 's/^.*#pragma endregion.*$//' \
-e 's/^.*#pragma warning.*$//' \
Expand Down
Loading