-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why is stratification minimum based on percent rather than absolute number of observations? #162
Comments
See the discussion in #110 You might look into using |
Thank you, but I thought that |
I am having a similar issue with |
The problem is a trade-off between false- and true-positives (where the event is "too little data"). No matter where/how we draw the line, there are going to be cases where someone's legitimate data analysis needs are not met. Luckily, you can make whatever type of @cportner There's no reprex here but this probably does what you want (but if not, you'll get the gist of it): library(tidymodels)
#> ── Attaching packages ───────────────────────────────────────────── tidymodels 0.1.1 ──
#> ✓ broom 0.7.0 ✓ recipes 0.1.13
#> ✓ dials 0.0.8.9001 ✓ rsample 0.0.7.9000
#> ✓ dplyr 1.0.2 ✓ tibble 3.0.3
#> ✓ ggplot2 3.3.2 ✓ tidyr 1.1.2
#> ✓ infer 0.5.2 ✓ tune 0.1.1.9000
#> ✓ modeldata 0.0.2 ✓ workflows 0.1.3.9000
#> ✓ parsnip 0.1.3 ✓ yardstick 0.0.7
#> ✓ purrr 0.3.4
#> ── Conflicts ──────────────────────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard() masks scales::discard()
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
#> x recipes::step() masks stats::step()
set.seed(1231)
ex_data <-
tibble(survey = sample(letters, 500, replace = TRUE))
n <- nrow(ex_data)
all_ind <- 1:n
split_data <-
ex_data %>%
mutate(.row = row_number()) %>%
group_nest(survey) %>%
mutate(
sampled = map(data, ~ sample(.x, replace = TRUE)),
sampled = map(sampled,
~ list(analysis = .x$.row,
assessment = all_ind[!(all_ind %in% unique(.x$.row))]
)
)
)
splits <- map(split_data$sampled, make_splits, data = ex_data)
bt_strat_splits <- manual_rset(splits, split_data$survey)
bt_strat_splits
#> # Manual resampling
#> # A tibble: 26 x 2
#> splits id
#> <list> <chr>
#> 1 <split [18/482]> a
#> 2 <split [15/485]> b
#> 3 <split [18/482]> c
#> 4 <split [22/478]> d
#> 5 <split [18/482]> e
#> 6 <split [23/477]> f
#> 7 <split [21/479]> g
#> 8 <split [25/475]> h
#> 9 <split [22/478]> i
#> 10 <split [22/478]> j
#> # … with 16 more rows Created on 2020-09-14 by the reprex package (v0.3.0) @ecsalomon see #158 and #164 for examples with an initial split. You'll need the GH version of |
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue. |
I have a data set with ~ 1,500,000 individual-level observations from 236 surveys. Since each survey is collected independently from the other surveys, I would like to do the resampling for the bootstrap at the survey-level. The problem is that although the surveys contribute a substantial absolute number of observations, each survey is only approximately 0.5% of the total sample. Hence, if I try to stratify by survey using
rsample
andbootstraps
, I get "Too little data to stratify. Unstratified resampling will be used."I am unsure whether the 10% minimum for resampling is for practical or statistical reasons. If it is for practical reasons, is there a way around this minimum (short of writing my own
rsample
from the ground up!)? If it is for statistical reasons, do you have a reference that explains it? I have not been able to find anything on this.The text was updated successfully, but these errors were encountered: