Why is stratification minimum based on percent rather than absolute number of observations? #162

cportner · 2020-06-12T15:48:04Z

I have a data set with ~ 1,500,000 individual-level observations from 236 surveys. Since each survey is collected independently from the other surveys, I would like to do the resampling for the bootstrap at the survey-level. The problem is that although the surveys contribute a substantial absolute number of observations, each survey is only approximately 0.5% of the total sample. Hence, if I try to stratify by survey using rsample and bootstraps, I get "Too little data to stratify. Unstratified resampling will be used."

I am unsure whether the 10% minimum for resampling is for practical or statistical reasons. If it is for practical reasons, is there a way around this minimum (short of writing my own rsample from the ground up!)? If it is for statistical reasons, do you have a reference that explains it? I have not been able to find anything on this.

The text was updated successfully, but these errors were encountered:

topepo · 2020-06-12T16:56:54Z

See the discussion in #110

You might look into using group_vfold_cv() instead.

cportner · 2020-06-12T22:50:50Z

Thank you, but I thought that vfold_cv and group_vfold_cv return only part of the sample, while I need the actual resampling with replacement that comes with bootstraps.

ecsalomon · 2020-08-18T13:49:05Z

I am having a similar issue with initial_split() trying to stratify a single train/test split for ~35,000 observations. The smallest group on the stratifier has about 550 observations. While this is a small group, it seems reasonable to be able to perform an 80/20 split on it.

topepo · 2020-09-15T00:16:59Z

The problem is a trade-off between false- and true-positives (where the event is "too little data"). No matter where/how we draw the line, there are going to be cases where someone's legitimate data analysis needs are not met.

Luckily, you can make whatever type of rsample object that you want using make_splits() and manual_rset().

@cportner There's no reprex here but this probably does what you want (but if not, you'll get the gist of it):

library(tidymodels)
#> ── Attaching packages ───────────────────────────────────────────── tidymodels 0.1.1 ──
#> ✓ broom     0.7.0          ✓ recipes   0.1.13    
#> ✓ dials     0.0.8.9001     ✓ rsample   0.0.7.9000
#> ✓ dplyr     1.0.2          ✓ tibble    3.0.3     
#> ✓ ggplot2   3.3.2          ✓ tidyr     1.1.2     
#> ✓ infer     0.5.2          ✓ tune      0.1.1.9000
#> ✓ modeldata 0.0.2          ✓ workflows 0.1.3.9000
#> ✓ parsnip   0.1.3          ✓ yardstick 0.0.7     
#> ✓ purrr     0.3.4
#> ── Conflicts ──────────────────────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard() masks scales::discard()
#> x dplyr::filter()  masks stats::filter()
#> x dplyr::lag()     masks stats::lag()
#> x recipes::step()  masks stats::step()

set.seed(1231)
ex_data <-
  tibble(survey = sample(letters, 500, replace = TRUE))

n <- nrow(ex_data)
all_ind <- 1:n

split_data <- 
  ex_data %>% 
  mutate(.row = row_number()) %>%
  group_nest(survey) %>% 
  mutate(
    sampled = map(data, ~ sample(.x, replace = TRUE)),
    sampled = map(sampled, 
                  ~ list(analysis = .x$.row, 
                         assessment = all_ind[!(all_ind %in% unique(.x$.row))]
                  )
    )
  )

splits <- map(split_data$sampled, make_splits, data = ex_data)

bt_strat_splits <- manual_rset(splits, split_data$survey)
bt_strat_splits
#> # Manual resampling 
#> # A tibble: 26 x 2
#>    splits           id   
#>    <list>           <chr>
#>  1 <split [18/482]> a    
#>  2 <split [15/485]> b    
#>  3 <split [18/482]> c    
#>  4 <split [22/478]> d    
#>  5 <split [18/482]> e    
#>  6 <split [23/477]> f    
#>  7 <split [21/479]> g    
#>  8 <split [25/475]> h    
#>  9 <split [22/478]> i    
#> 10 <split [22/478]> j    
#> # … with 16 more rows

^{Created on 2020-09-14 by the reprex package (v0.3.0)}

@ecsalomon see #158 and #164 for examples with an initial split.

You'll need the GH version of rsample (but not any of the other devel versions that I have loaded right now).

github-actions · 2021-02-21T00:56:25Z

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

topepo added the discussion label Aug 28, 2020

topepo closed this as completed Sep 15, 2020

github-actions bot locked and limited conversation to collaborators Feb 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is stratification minimum based on percent rather than absolute number of observations? #162

Why is stratification minimum based on percent rather than absolute number of observations? #162

cportner commented Jun 12, 2020

topepo commented Jun 12, 2020

cportner commented Jun 12, 2020

ecsalomon commented Aug 18, 2020

topepo commented Sep 15, 2020

github-actions bot commented Feb 21, 2021

Why is stratification minimum based on percent rather than absolute number of observations? #162

Why is stratification minimum based on percent rather than absolute number of observations? #162

Comments

cportner commented Jun 12, 2020

topepo commented Jun 12, 2020

cportner commented Jun 12, 2020

ecsalomon commented Aug 18, 2020

topepo commented Sep 15, 2020

github-actions bot commented Feb 21, 2021