Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

group_bootstraps() errors when it gets original dataset in a resample #356

Closed
tjmahr opened this issue Aug 8, 2022 · 4 comments · Fixed by #357
Closed

group_bootstraps() errors when it gets original dataset in a resample #356

tjmahr opened this issue Aug 8, 2022 · 4 comments · Fixed by #357

Comments

@tjmahr
Copy link

tjmahr commented Aug 8, 2022

Thanks for adding group_bootstraps()! One quick problem from my testing: Currently, group_bootstraps() throws an error when it resamples the original dataset. This happens easily when the number of resamples is larger than the number of group combinations:

library(rsample)
library(tidyverse)

set.seed(2)
dat1 <- tibble::tibble(a = 1:20, b = letters[1:20], c = rep(1:4, 5))

d <- dat1 |> 
  group_bootstraps(c) 
#> Error in `group_bootstraps()`:
#> ! Some assessment sets contained zero rows
#> ℹ Consider using a non-grouped resampling method

Here 25 resamples of 4 groups hits the problem.

But I don't think an error/deadend is the correct design here. There are bootstrap workflows where it is fine to resample the original data. For example, suppose you want to just a fit a bunch of models on resampled groups:

  • bootstrap like 2000 datasets
  • fit a model on each one
  • estimate something on each model
  • report the mean and 95% interval of the model estimates

The assessment split data is never used in this workflow. Require non-empty assessment splits also means we are losing the ability to average over resamples that just happen to be the apparent dataset, so it's kind of a bias thing too.

Two possible suggestions:

  • allow user to permit empty assessment splits with an argument for like group_bootstraps(..., allow_empty_assessment = TRUE).
  • lower the condition from an error to a warning.

The second option is more dangerous because it means users will hit preventable errors when they try to use assessment(), so I like the first one better.

Created on 2022-08-08 by the reprex package (v2.0.1)

@mikemahoney218
Copy link
Member

This makes sense to me!

It's worth noting that bootstraps() doesn't error or warn when returning a 0 row assessment set:

rsample::bootstraps(mtcars[1, ])
#> # Bootstrap sampling 
#> # A tibble: 25 × 2
#>    splits        id         
#>    <list>        <chr>      
#>  1 <split [1/0]> Bootstrap01
#>  2 <split [1/0]> Bootstrap02
#>  3 <split [1/0]> Bootstrap03
#>  4 <split [1/0]> Bootstrap04
#>  5 <split [1/0]> Bootstrap05
#>  6 <split [1/0]> Bootstrap06
#>  7 <split [1/0]> Bootstrap07
#>  8 <split [1/0]> Bootstrap08
#>  9 <split [1/0]> Bootstrap09
#> 10 <split [1/0]> Bootstrap10
#> # … with 15 more rows
#> # ℹ Use `print(n = ...)` to see more rows

Created on 2022-08-08 by the reprex package (v2.0.1)

So there's an argument here for just not checking at all. But of course, in practice sampling each observation is going to happen much less frequently when sampling observations rather than groups, so I do think either a warning or an argument would be useful.

We might also consider warning when there's 0 assessment data in a regular bootstrap rset, because that will basically not work with tune::fit_resamples().

For this function the error is in a pretty high-level spot, so I don't think this would be a hard change either:

rsample/R/boot.R

Lines 233 to 244 in 86f56df

all_assessable <- purrr::map(split_objs, function(x) nrow(assessment(x)))
if (any(all_assessable == 0)) {
rlang::abort(
c(
"Some assessment sets contained zero rows",
i = "Consider using a non-grouped resampling method"
),
call = rlang::caller_env()
)
}

@juliasilge what do you think?

@juliasilge
Copy link
Member

My inclination is to move from an error to a warning on group_bootstraps() and add the same warning to bootstraps(). I can see what you are saying @tjmahr about folks possibly blowing by the warning and trying to use something that involves the assessment set, but let's start with the less intrusive change and see if we get feedback about problems or otherwise decide a change to the function API is necessary.

@mikemahoney218
Copy link
Member

Thanks for the issue @tjmahr 😄

@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Aug 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants