Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saved model memory issue #116

Closed
Saarialho opened this issue Apr 21, 2022 · 4 comments · Fixed by #117
Closed

Saved model memory issue #116

Saarialho opened this issue Apr 21, 2022 · 4 comments · Fixed by #117

Comments

@Saarialho
Copy link

Saarialho commented Apr 21, 2022

The problem

I'm having trouble with saving a stacked xgb model object in local folder for later use. The issue is that the object somehow expands in size and does not fit into RAM when read back to new session. I have tried to butchering but it does not seem to help much.

Im not sure if this is due to stacks, butcher or simply my approach. Here is an example with a lot smaller dataset than the one I struggle with but the relative increase in size is of the same magnitude.

Thanks,

Reproducible example

library(tidymodels)
library(butcher)
library(stacks)
library(finetune)
library(readr)

set.seed(123)
cars <- mtcars[rep(1:32, each = 1000),] 
split <- initial_split(cars)
train <- training(split)
test <- testing(split)

valid_metric <- 'rmse' 
metrics <- metric_set(!!sym(valid_metric))

base_spec <- 
  boost_tree( trees = 500,
              min_n = tune(),
              mtry = tune(),
              sample_size = tune(),
              learn_rate = tune(),
              loss_reduction = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("xgboost") 

base_recipe <- 
  recipe(vs ~ ., data = train %>% dplyr::slice(0))

wf <-
  workflow() %>% 
  add_recipe(base_recipe) %>% 
  add_model(base_spec) 


tuned <- tune_race_anova(
  wf,
  resamples = vfold_cv(train, v = 5),
  grid = 15,
  metrics = metrics,
  control = control_race(verbose_elim = TRUE, save_pred = TRUE, save_workflow = TRUE)
)
#> i Creating pre-processing data to finalize unknown parameter: mtry
#> i Racing will minimize the rmse metric.
#> i Resamples are analyzed in a random order.
#> i Fold5:  9 eliminated;  6 candidates remain.
#> i Fold1:  4 eliminated;  2 candidates remain.

stacked <- 
  stacks() %>%
  add_candidates(tuned) %>%
  blend_predictions(metric  = metrics) %>%
  fit_members()


lobstr::obj_size(stacked)
#> 9,237,216 B
lobstr::obj_size(butcher(stacked))
#> 8,452,936 B

readr::write_rds(stacked, file = 'stack.rds')
readr::write_rds(butcher(stacked), file = 'stack_butch.rds')

back_in_r <- readr::read_rds('stack.rds')
back_in_r_butch <- readr::read_rds('stack_butch.rds')

lobstr::obj_size(back_in_r)
#> 462,770,208 B
lobstr::obj_size(back_in_r_butch)
#> 429,055,960 B

as.numeric(lobstr::obj_size(back_in_r_butch))/as.numeric(lobstr::obj_size(butcher(stacked))) #50x more in size
#> [1] 50.75822

Created on 2022-04-21 by the reprex package (v2.0.1)

Session info
sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value
#>  version  R version 4.1.3 (2022-03-10)
#>  os       Windows 10 x64 (build 19042)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  Finnish_Finland.1252
#>  ctype    Finnish_Finland.1252
#>  tz       Europe/Helsinki
#>  date     2022-04-21
#> 
#> - Packages -------------------------------------------------------------------
#>  ! package      * version     date (UTC) lib source
#>    backports      1.4.1       2021-12-13 [1] CRAN (R 4.1.2)
#>    boot           1.3-28      2021-05-03 [1] CRAN (R 4.1.3)
#>    broom        * 0.8.0       2022-04-13 [1] CRAN (R 4.1.3)
#>    butcher      * 0.1.5       2021-06-28 [1] CRAN (R 4.1.3)
#>    class          7.3-20      2022-01-16 [1] CRAN (R 4.1.3)
#>    cli            3.2.0       2022-02-14 [1] CRAN (R 4.1.3)
#>    codetools      0.2-18      2020-11-04 [1] CRAN (R 4.1.3)
#>    colorspace     2.0-3       2022-02-21 [1] CRAN (R 4.1.3)
#>    crayon         1.5.1       2022-03-26 [1] CRAN (R 4.1.3)
#>    data.table     1.14.2      2021-09-27 [1] CRAN (R 4.1.3)
#>    dials        * 0.1.1.9000  2022-04-21 [1] Github (tidymodels/dials@b198556)
#>    DiceDesign     1.9         2021-02-13 [1] CRAN (R 4.1.3)
#>    digest         0.6.29      2021-12-01 [1] CRAN (R 4.1.3)
#>    dplyr        * 1.0.8       2022-02-08 [1] CRAN (R 4.1.3)
#>    ellipsis       0.3.2       2021-04-29 [1] CRAN (R 4.1.3)
#>    evaluate       0.15        2022-02-18 [1] CRAN (R 4.1.3)
#>    fansi          1.0.3       2022-03-24 [1] CRAN (R 4.1.3)
#>    fastmap        1.1.0       2021-01-25 [1] CRAN (R 4.1.3)
#>    finetune     * 0.2.0.9000  2022-04-21 [1] Github (tidymodels/finetune@31a780c)
#>    foreach        1.5.2       2022-02-02 [1] CRAN (R 4.1.3)
#>    fs             1.5.2       2021-12-08 [1] CRAN (R 4.1.3)
#>    furrr          0.2.3       2021-06-25 [1] CRAN (R 4.1.3)
#>    future         1.24.0      2022-02-19 [1] CRAN (R 4.1.3)
#>    future.apply   1.8.1       2021-08-10 [1] CRAN (R 4.1.3)
#>    generics       0.1.2       2022-01-31 [1] CRAN (R 4.1.3)
#>    ggplot2      * 3.3.5       2021-06-25 [1] CRAN (R 4.1.3)
#>    glmnet       * 4.1-4       2022-04-15 [1] CRAN (R 4.1.3)
#>    globals        0.14.0      2020-11-22 [1] CRAN (R 4.1.1)
#>    glue           1.6.2       2022-02-24 [1] CRAN (R 4.1.3)
#>    gower          1.0.0       2022-02-03 [1] CRAN (R 4.1.2)
#>    GPfit          1.0-8       2019-02-08 [1] CRAN (R 4.1.3)
#>    gtable         0.3.0       2019-03-25 [1] CRAN (R 4.1.3)
#>    hardhat        0.2.0.9000  2022-04-21 [1] Github (tidymodels/hardhat@961c14e)
#>    highr          0.9         2021-04-16 [1] CRAN (R 4.1.3)
#>    hms            1.1.1       2021-09-26 [1] CRAN (R 4.1.3)
#>    htmltools      0.5.2       2021-08-25 [1] CRAN (R 4.1.3)
#>    infer        * 1.0.0       2021-08-13 [1] CRAN (R 4.1.3)
#>    ipred          0.9-12      2021-09-15 [1] CRAN (R 4.1.3)
#>    iterators      1.0.14      2022-02-05 [1] CRAN (R 4.1.3)
#>    jsonlite       1.8.0       2022-02-22 [1] CRAN (R 4.1.3)
#>    knitr          1.38        2022-03-25 [1] CRAN (R 4.1.3)
#>    lattice        0.20-45     2021-09-22 [1] CRAN (R 4.1.3)
#>    lava           1.6.10      2021-09-02 [1] CRAN (R 4.1.3)
#>    lhs            1.1.5       2022-03-22 [1] CRAN (R 4.1.3)
#>    lifecycle      1.0.1       2021-09-24 [1] CRAN (R 4.1.3)
#>    listenv        0.8.0       2019-12-05 [1] CRAN (R 4.1.3)
#>    lme4           1.1-29      2022-04-07 [1] CRAN (R 4.1.3)
#>    lobstr         1.1.1       2019-07-02 [1] CRAN (R 4.1.3)
#>    lubridate      1.8.0       2021-10-07 [1] CRAN (R 4.1.3)
#>    magrittr       2.0.3       2022-03-30 [1] CRAN (R 4.1.3)
#>  D MASS           7.3-55      2022-01-16 [1] CRAN (R 4.1.3)
#>  D Matrix       * 1.4-0       2021-12-08 [1] CRAN (R 4.1.3)
#>    minqa          1.2.4       2014-10-09 [1] CRAN (R 4.1.3)
#>    modeldata    * 0.1.1       2021-07-14 [1] CRAN (R 4.1.3)
#>    munsell        0.5.0       2018-06-12 [1] CRAN (R 4.1.3)
#>  D nlme           3.1-155     2022-01-16 [1] CRAN (R 4.1.3)
#>    nloptr         2.0.0       2022-01-26 [1] CRAN (R 4.1.3)
#>    nnet           7.3-17      2022-01-16 [1] CRAN (R 4.1.3)
#>    parallelly     1.31.0-9002 2022-04-21 [1] Github (HenrikBengtsson/parallelly@3960c91)
#>    parsnip      * 0.2.1.9001  2022-04-21 [1] Github (tidymodels/parsnip@5716803)
#>    pillar         1.7.0       2022-02-01 [1] CRAN (R 4.1.3)
#>    pkgconfig      2.0.3       2019-09-22 [1] CRAN (R 4.1.3)
#>    prodlim        2019.11.13  2019-11-17 [1] CRAN (R 4.1.3)
#>    purrr        * 0.3.4       2020-04-17 [1] CRAN (R 4.1.3)
#>    R6             2.5.1       2021-08-19 [1] CRAN (R 4.1.3)
#>    Rcpp           1.0.8.3     2022-03-17 [1] CRAN (R 4.1.3)
#>    readr        * 2.1.2       2022-01-30 [1] CRAN (R 4.1.3)
#>    recipes      * 0.2.0       2022-02-18 [1] CRAN (R 4.1.3)
#>    reprex         2.0.1       2021-08-05 [1] CRAN (R 4.1.3)
#>    rlang        * 1.0.2       2022-03-04 [1] CRAN (R 4.1.3)
#>    rmarkdown      2.13        2022-03-10 [1] CRAN (R 4.1.3)
#>    rpart          4.1.16      2022-01-24 [1] CRAN (R 4.1.3)
#>    rsample      * 0.1.1       2021-11-08 [1] CRAN (R 4.1.3)
#>    rstudioapi     0.13        2020-11-12 [1] CRAN (R 4.1.3)
#>    scales       * 1.2.0       2022-04-13 [1] CRAN (R 4.1.3)
#>    sessioninfo    1.2.2       2021-12-06 [1] CRAN (R 4.1.3)
#>    shape          1.4.6       2021-05-19 [1] CRAN (R 4.1.1)
#>    stacks       * 0.2.2.9001  2022-04-21 [1] Github (tidymodels/stacks@69ed8b7)
#>    stringi        1.7.6       2021-11-29 [1] CRAN (R 4.1.2)
#>    stringr        1.4.0       2019-02-10 [1] CRAN (R 4.1.3)
#>  D survival       3.2-13      2021-08-24 [1] CRAN (R 4.1.3)
#>    tibble       * 3.1.6       2021-11-07 [1] CRAN (R 4.1.3)
#>    tidymodels   * 0.2.0       2022-03-19 [1] CRAN (R 4.1.3)
#>    tidyr        * 1.2.0       2022-02-01 [1] CRAN (R 4.1.3)
#>    tidyselect     1.1.2       2022-02-21 [1] CRAN (R 4.1.3)
#>    timeDate       3043.102    2018-02-21 [1] CRAN (R 4.1.2)
#>    tune         * 0.2.0.9000  2022-04-21 [1] Github (tidymodels/tune@f7bd18c)
#>    tzdb           0.3.0       2022-03-28 [1] CRAN (R 4.1.3)
#>    usethis        2.1.5       2021-12-09 [1] CRAN (R 4.1.3)
#>    utf8           1.2.2       2021-07-24 [1] CRAN (R 4.1.3)
#>    vctrs          0.4.1       2022-04-13 [1] CRAN (R 4.1.3)
#>    withr          2.5.0       2022-03-03 [1] CRAN (R 4.1.3)
#>    workflows    * 0.2.6.9000  2022-04-21 [1] Github (tidymodels/workflows@c0e4438)
#>    workflowsets * 0.2.1       2022-03-15 [1] CRAN (R 4.1.3)
#>    xfun           0.30        2022-03-02 [1] CRAN (R 4.1.3)
#>    xgboost      * 1.6.0.1     2022-04-21 [1] local
#>    yaml           2.3.5       2022-02-21 [1] CRAN (R 4.1.2)
#>    yardstick    * 0.0.9.9000  2022-04-21 [1] Github (tidymodels/yardstick@1323a73)
#> 
#> 
#> ------------------------------------------------------------------------------

@Saarialho
Copy link
Author

Saarialho commented Apr 21, 2022

I ran it again using the most recent development version of butcher and the increase was 32x still. Using butcher::weigh on the models reveals that coefs. objects increase a lot

@Saarialho Saarialho reopened this Apr 21, 2022
@simonpcouch
Copy link
Collaborator

Holy smokes, that's a big increase! I was able to reproduce the issue and it seems robust to saving in a few different ways. I also see that the coefs object is the main factor to blame, which is just a glmnet model fit via parsnip.

A slightly more minimal example:

library(tidymodels)
library(modeldata)
library(readr)
library(lobstr)
library(stacks)

data("lending_club")

folds <- vfold_cv(lending_club, v = 5)

lr_mod <- 
  logistic_reg(penalty = tune()) %>%
  set_engine("glmnet") %>%
  workflow(
    preprocessor = Class ~ funded_amnt + int_rate,
    spec = .
  ) %>%
  tune_grid(
    resamples = folds,
    control = control_stack_grid()
  )

lr_stack <- stacks() %>%
  add_candidates(lr_mod) %>%
  blend_predictions() %>%
  fit_members()
#> Warning: Predictions from 12 candidates were identical to those from existing
#> candidates and were removed from the data stack.

write_rds(lr_stack, file = "saved_mod.Rds")

saved_lr_stack <- read_rds("saved_mod.Rds")

obj_sizes(lr_stack, saved_lr_stack)
#> *   6,003,216 B
#> * 219,515,088 B

Created on 2022-04-21 by the reprex package (v2.0.1)

with a similar 30x+ increase.

Will come back to this in a bit—thanks for the issue. :)

@simonpcouch
Copy link
Collaborator

Quick addition—this isn't just a parsnip + glmnet issue!

library(modeldata)
library(readr)
library(lobstr)
library(parsnip)

data("lending_club")

lr_mod <- logistic_reg(penalty = .5)

mod_fit <-
  lr_mod %>%
  set_engine("glmnet") %>%
  fit(Class ~ funded_amnt + int_rate, data = lending_club)

write_rds(mod_fit, file = "saved_mod.Rds")

saved_mod_fit <- read_rds("saved_mod.Rds")

obj_sizes(mod_fit, saved_mod_fit)
#> * 1,651,112 B
#> * 1,647,624 B

Created on 2022-04-21 by the reprex package (v2.0.1)

@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators May 10, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants