Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is stratification minimum based on percent rather than absolute number of observations? #162

Closed
cportner opened this issue Jun 12, 2020 · 5 comments

Comments

@cportner
Copy link

I have a data set with ~ 1,500,000 individual-level observations from 236 surveys. Since each survey is collected independently from the other surveys, I would like to do the resampling for the bootstrap at the survey-level. The problem is that although the surveys contribute a substantial absolute number of observations, each survey is only approximately 0.5% of the total sample. Hence, if I try to stratify by survey using rsample and bootstraps, I get "Too little data to stratify. Unstratified resampling will be used."

I am unsure whether the 10% minimum for resampling is for practical or statistical reasons. If it is for practical reasons, is there a way around this minimum (short of writing my own rsample from the ground up!)? If it is for statistical reasons, do you have a reference that explains it? I have not been able to find anything on this.

@topepo
Copy link
Member

topepo commented Jun 12, 2020

See the discussion in #110

You might look into using group_vfold_cv() instead.

@cportner
Copy link
Author

Thank you, but I thought that vfold_cv and group_vfold_cv return only part of the sample, while I need the actual resampling with replacement that comes with bootstraps.

@ecsalomon
Copy link

I am having a similar issue with initial_split() trying to stratify a single train/test split for ~35,000 observations. The smallest group on the stratifier has about 550 observations. While this is a small group, it seems reasonable to be able to perform an 80/20 split on it.

@topepo
Copy link
Member

topepo commented Sep 15, 2020

The problem is a trade-off between false- and true-positives (where the event is "too little data"). No matter where/how we draw the line, there are going to be cases where someone's legitimate data analysis needs are not met.

Luckily, you can make whatever type of rsample object that you want using make_splits() and manual_rset().

@cportner There's no reprex here but this probably does what you want (but if not, you'll get the gist of it):

library(tidymodels)
#> ── Attaching packages ───────────────────────────────────────────── tidymodels 0.1.1 ──
#> ✓ broom     0.7.0          ✓ recipes   0.1.13    
#> ✓ dials     0.0.8.9001     ✓ rsample   0.0.7.9000
#> ✓ dplyr     1.0.2          ✓ tibble    3.0.3     
#> ✓ ggplot2   3.3.2          ✓ tidyr     1.1.2     
#> ✓ infer     0.5.2          ✓ tune      0.1.1.9000
#> ✓ modeldata 0.0.2          ✓ workflows 0.1.3.9000
#> ✓ parsnip   0.1.3          ✓ yardstick 0.0.7     
#> ✓ purrr     0.3.4
#> ── Conflicts ──────────────────────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard() masks scales::discard()
#> x dplyr::filter()  masks stats::filter()
#> x dplyr::lag()     masks stats::lag()
#> x recipes::step()  masks stats::step()

set.seed(1231)
ex_data <-
  tibble(survey = sample(letters, 500, replace = TRUE))

n <- nrow(ex_data)
all_ind <- 1:n

split_data <- 
  ex_data %>% 
  mutate(.row = row_number()) %>%
  group_nest(survey) %>% 
  mutate(
    sampled = map(data, ~ sample(.x, replace = TRUE)),
    sampled = map(sampled, 
                  ~ list(analysis = .x$.row, 
                         assessment = all_ind[!(all_ind %in% unique(.x$.row))]
                  )
    )
  )

splits <- map(split_data$sampled, make_splits, data = ex_data)

bt_strat_splits <- manual_rset(splits, split_data$survey)
bt_strat_splits
#> # Manual resampling 
#> # A tibble: 26 x 2
#>    splits           id   
#>    <list>           <chr>
#>  1 <split [18/482]> a    
#>  2 <split [15/485]> b    
#>  3 <split [18/482]> c    
#>  4 <split [22/478]> d    
#>  5 <split [18/482]> e    
#>  6 <split [23/477]> f    
#>  7 <split [21/479]> g    
#>  8 <split [25/475]> h    
#>  9 <split [22/478]> i    
#> 10 <split [22/478]> j    
#> # … with 16 more rows

Created on 2020-09-14 by the reprex package (v0.3.0)

@ecsalomon see #158 and #164 for examples with an initial split.

You'll need the GH version of rsample (but not any of the other devel versions that I have loaded right now).

@topepo topepo closed this as completed Sep 15, 2020
@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Feb 21, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants