-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random Forest is extremely slow #749
Comments
@Laurae2 |
@guolinke There is a massive issue with MinGW in Windows. Check below the performance:
new code: setwd("E:/datasets")
sparse <- TRUE # keep this true for reproducing my results
rf <- TRUE
if (rf == FALSE) {
params <- list(num_threads = 40,
learning_rate = 0.05,
max_depth = -1,
num_leaves = 4095,
max_bin = 255)
} else {
params <- list(num_threads = 40,
learning_rate = 1,
max_depth = -1,
num_leaves = 4095,
max_bin = 255,
boosting_type = "rf",
bagging_freq = 1,
bagging_fraction = 0.632,
feature_fraction = ceiling(sqrt(970)) / 970)
}
library(data.table)
library(Matrix)
library(lightgbm)
library(R.utils)
data <- fread(file = "bosch_data.csv")
# Do xgboost / LightGBM
# When dense:
# > sum(data == 0, na.rm = TRUE)
# [1] 43574349
# > sum(is.na(data))
# [1] 929125166
# Split
if (sparse == TRUE) {
library(recommenderlab)
gc()
train_1 <- dropNA(as.matrix(data[1:1000000, 1:969]))
train_2 <- data[1:1000000, 970]$Response
gc()
test_1 <- dropNA(as.matrix(data[1000001:1183747, 1:969]))
test_2 <- data[1000001:1183747, 970]$Response
gc()
} else {
gc()
train_1 <- as.matrix(data[1:1000000, 1:969])
train_2 <- data[1:1000000, 970]$Response
gc()
test_1 <- as.matrix(data[1000001:1183747, 1:969])
test_2 <- data[1000001:1183747, 970]$Response
gc()
}
# For LightGBM
train <- lgb.Dataset(data = train_1, label = train_2)
test <- lgb.Dataset(data = test_1, label = test_2, reference=train)
# train$construct()
# test$construct()
gc()
Laurae::timer_func_print({temp_model <- lgb.train(params = params,
data = train,
nrounds = 25,
valids = list(test = test),
objective = "binary",
metric = "auc",
verbose = 2)})
perf <- as.numeric(rbindlist(temp_model$record_evals$test$auc))
max(perf)
which.max(perf) |
@Laurae2 I don't know why MinGW is so slow ... |
@guolinke Now we have a good reproducible example in case someone wants to check the performance discrepancy between Visual Studio and MinGW for LightGBM. |
Specs:
Random Forest can be extremely slow for unknown reasons.
To reproduce the issue (requires Bosch dataset), run the following:
The text was updated successfully, but these errors were encountered: