group_bootstraps() errors when it gets original dataset in a resample #356

tjmahr · 2022-08-08T13:21:40Z

Thanks for adding group_bootstraps()! One quick problem from my testing: Currently, group_bootstraps() throws an error when it resamples the original dataset. This happens easily when the number of resamples is larger than the number of group combinations:

library(rsample)
library(tidyverse)

set.seed(2)
dat1 <- tibble::tibble(a = 1:20, b = letters[1:20], c = rep(1:4, 5))

d <- dat1 |> 
  group_bootstraps(c) 
#> Error in `group_bootstraps()`:
#> ! Some assessment sets contained zero rows
#> ℹ Consider using a non-grouped resampling method

Here 25 resamples of 4 groups hits the problem.

But I don't think an error/deadend is the correct design here. There are bootstrap workflows where it is fine to resample the original data. For example, suppose you want to just a fit a bunch of models on resampled groups:

bootstrap like 2000 datasets
fit a model on each one
estimate something on each model
report the mean and 95% interval of the model estimates

The assessment split data is never used in this workflow. Require non-empty assessment splits also means we are losing the ability to average over resamples that just happen to be the apparent dataset, so it's kind of a bias thing too.

Two possible suggestions:

allow user to permit empty assessment splits with an argument for like group_bootstraps(..., allow_empty_assessment = TRUE).
lower the condition from an error to a warning.

The second option is more dangerous because it means users will hit preventable errors when they try to use assessment(), so I like the first one better.

^{Created on 2022-08-08 by the reprex package (v2.0.1)}

The text was updated successfully, but these errors were encountered:

mikemahoney218 · 2022-08-08T13:43:55Z

This makes sense to me!

It's worth noting that bootstraps() doesn't error or warn when returning a 0 row assessment set:

rsample::bootstraps(mtcars[1, ])
#> # Bootstrap sampling 
#> # A tibble: 25 × 2
#>    splits        id         
#>    <list>        <chr>      
#>  1 <split [1/0]> Bootstrap01
#>  2 <split [1/0]> Bootstrap02
#>  3 <split [1/0]> Bootstrap03
#>  4 <split [1/0]> Bootstrap04
#>  5 <split [1/0]> Bootstrap05
#>  6 <split [1/0]> Bootstrap06
#>  7 <split [1/0]> Bootstrap07
#>  8 <split [1/0]> Bootstrap08
#>  9 <split [1/0]> Bootstrap09
#> 10 <split [1/0]> Bootstrap10
#> # … with 15 more rows
#> # ℹ Use `print(n = ...)` to see more rows

^{Created on 2022-08-08 by the reprex package (v2.0.1)}

So there's an argument here for just not checking at all. But of course, in practice sampling each observation is going to happen much less frequently when sampling observations rather than groups, so I do think either a warning or an argument would be useful.

We might also consider warning when there's 0 assessment data in a regular bootstrap rset, because that will basically not work with tune::fit_resamples().

For this function the error is in a pretty high-level spot, so I don't think this would be a hard change either:

rsample/R/boot.R

Lines 233 to 244 in 86f56df

    
           all_assessable <- purrr::map(split_objs, function(x) nrow(assessment(x))) 
        
           if (any(all_assessable == 0)) { 
        
             rlang::abort( 
        
               c( 
        
                 "Some assessment sets contained zero rows", 
        
                 i = "Consider using a non-grouped resampling method" 
        
               ), 
        
               call = rlang::caller_env() 
        
             ) 
        
           }

@juliasilge what do you think?

juliasilge · 2022-08-08T17:41:07Z

My inclination is to move from an error to a warning on group_bootstraps() and add the same warning to bootstraps(). I can see what you are saying @tjmahr about folks possibly blowing by the warning and trying to use something that involves the assessment set, but let's start with the less intrusive change and see if we get feedback about problems or otherwise decide a change to the function API is necessary.

Fixes #356

mikemahoney218 · 2022-08-08T18:52:33Z

Thanks for the issue @tjmahr 😄

github-actions · 2022-08-23T02:32:53Z

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

mikemahoney218 added a commit that referenced this issue Aug 8, 2022

Warn when bootstraps returns the apparent

a0f3d1e

Fixes #356

mikemahoney218 mentioned this issue Aug 8, 2022

Warn when bootstraps returns the apparent #357

Merged

juliasilge closed this as completed in #357 Aug 8, 2022

github-actions bot locked and limited conversation to collaborators Aug 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

group_bootstraps() errors when it gets original dataset in a resample #356

group_bootstraps() errors when it gets original dataset in a resample #356

tjmahr commented Aug 8, 2022

mikemahoney218 commented Aug 8, 2022

juliasilge commented Aug 8, 2022

mikemahoney218 commented Aug 8, 2022

github-actions bot commented Aug 23, 2022

group_bootstraps() errors when it gets original dataset in a resample #356

group_bootstraps() errors when it gets original dataset in a resample #356

Comments

tjmahr commented Aug 8, 2022

mikemahoney218 commented Aug 8, 2022

juliasilge commented Aug 8, 2022

mikemahoney218 commented Aug 8, 2022

github-actions bot commented Aug 23, 2022