Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stratum sample sizes aren't maintained using rsample bootstraps() function. #86

Closed
JVAdams opened this issue Mar 2, 2019 · 4 comments · Fixed by #149
Closed

Stratum sample sizes aren't maintained using rsample bootstraps() function. #86

JVAdams opened this issue Mar 2, 2019 · 4 comments · Fixed by #149

Comments

@JVAdams
Copy link

JVAdams commented Mar 2, 2019

When taking stratified bootstrap samples from a data frame, I expected separate bootstrap samples to be taken within each stratum, so that the resulting bootstrap sample has the same number of observations in each stratum as the original data frame. However, that is not the case when using the bootstraps() function of the rsample package. When I run this code:

library(rsample)

mydf <- data.frame(A=1:58, B=rep(1:4, c(6, 6, 23, 23)))
lboots <- bootstraps(mydf, times=3, strata="B")$splits
lbootsdf <- lapply(lboots, as.data.frame)

with(mydf, table(B))
lapply(lbootsdf, function(df) table(df$B))

These are the results I get:

B
 1  2  3  4 
 6  6 23 23 


$`1`
 1  2  3  4 
10  5 20 23 

$`2`
 1  2  3  4 
 3  8 24 23 

$`3`
 1  2  3  4 
 4  5 24 25 

I was expecting to see 6 1's, 6 2's, 23 3's, and 23 4's in each of the three bootstrap samples.

When I posted a query on stackoverflow, joran commented that the function make_strata by default pools strata below 15% of the total, with no way to adjust that parameter from the calling functions, like boostraps(). This pooling is not mentioned in the help documentation for the bootstraps() function.

sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets 
[6] methods   base     

other attached packages:
[1] rsample_0.0.4 tidyr_0.8.2  

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0       crayon_1.3.4     dplyr_0.8.0.1   
 [4] assertthat_0.2.0 R6_2.4.0         magrittr_1.5    
 [7] pillar_1.3.1     rlang_0.3.1      rstudioapi_0.9.0
[10] generics_0.0.2   tools_3.5.2      glue_1.3.0      
[13] purrr_0.3.0      yaml_2.2.0       compiler_3.5.2  
[16] pkgconfig_2.0.2  tidyselect_0.2.5 tibble_2.0.1    
@sxmorgan
Copy link

I don't have a solution, but came here hoping for one because I am having the exact same issue... did you have any luck or come up with a workaround?

@JVAdams
Copy link
Author

JVAdams commented Jul 24, 2019

See the answers to my stackoverflow query.

@juliasilge
Copy link
Member

The PR in #149 lowers the threshold for strata pooling to 10% of the total and adds documentation to each function so that users can be more clear on what's going on with their groups!

@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Feb 21, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants