Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify strata_check defaults from vfold_cv #110

Closed
skinnider opened this issue Sep 12, 2019 · 6 comments
Closed

Modify strata_check defaults from vfold_cv #110

skinnider opened this issue Sep 12, 2019 · 6 comments
Labels
feature a feature request or enhancement upcoming release

Comments

@skinnider
Copy link

I have a dataset with seven classes, and 20 observations in each class. I want to do three-fold cross-validation on this dataset, using stratified sampling, because when using unstratified sampling there is a chance that a particular fold will not contain any positive examples for a given class.

However, when I try to do this with vfold_cv I get the following warning:

library(tidyverse)
library(rsample)

X = matrix(rnorm(140 * 100), ncol = 100, nrow = 140)
y = rep(letters[1:7], each = 20)
dat = as.data.frame(X) %>%
  mutate(label = y)
cv = vfold_cv(dat, v = 3, strata = 'label')
#> Warning: Too little data to stratify. Unstratified resampling will be used.

Created on 2019-09-12 by the reprex package (v0.3.0)

It seems this is related to the check_strata function, specifically the default value pool = 0.15. In the context of check_strata, pcts is a vector of length 7 where every value is equal to 1 / 7 and so the function returns a single stratum:

    num_vals <- unique(x)
    n <- length(x)
    num_miss <- sum(is.na(x))
    if (length(num_vals) <= nunique | is.character(x) | is.factor(x)) {
        x <- factor(x)
        xtab <- sort(table(x))
        pcts <- xtab/n
        if (all(pcts < pool)) {
            warning("Too little data to stratify. Unstratified resampling ", 
                "will be used.", call. = FALSE)
            return(factor(rep("strata1", n)))
        }

This wouldn't be an issue if I could just change the default value of pool from vfold_cv but at present it doesn't seem like I can. Is it possible to pass the ellipsis from vfold_cv->vfold_splits->make_strata?

sessionInfo():

R version 3.5.2 (2018-12-20)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rsample_0.0.5     forcats_0.4.0     stringr_1.4.0     dplyr_0.8.3       purrr_0.3.2       readr_1.3.1       tidyr_0.8.99.9000
 [8] tibble_2.1.3      ggplot2_3.2.1     tidyverse_1.2.1  

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.2       cellranger_1.1.0 pillar_1.4.2     compiler_3.5.2   tools_3.5.2      digest_0.6.20    zeallot_0.1.0    jsonlite_1.6    
 [9] lubridate_1.7.4  lifecycle_0.1.0  nlme_3.1-141     gtable_0.3.0     lattice_0.20-38  pkgconfig_2.0.2  rlang_0.4.0      cli_1.1.0       
[17] rstudioapi_0.10  parallel_3.5.2   yaml_2.2.0       haven_2.1.1      furrr_0.1.0      withr_2.1.2      xml2_1.2.2       httr_1.4.1      
[25] globals_0.12.4   generics_0.0.2   vctrs_0.2.0      hms_0.5.1        grid_3.5.2       tidyselect_0.2.5 glue_1.3.1       listenv_0.7.0   
[33] R6_2.4.0         readxl_1.3.1     modelr_0.1.5     magrittr_1.5     codetools_0.2-16 backports_1.1.4  scales_1.0.0     rvest_0.3.4     
[41] assertthat_0.2.1 future_1.14.0    colorspace_1.4-1 stringi_1.4.3    lazyeval_0.2.2   munsell_0.5.0    broom_0.5.2      crayon_1.3.4    
@CamsterMamster
Copy link

Hi Michael (@skinnider),

May I ask if you have an update on this issue you've raised. I'm just working on doing some cross-validation and stratified samplified and I've exactly the same question.

I've implemented my own 'version' of vfold_cv->vfold_splits with a pool parameter to be passed through all the way down to make_strata and I'm happy to submit as a PR if that will be useful.

Let me know
Cheers,
Camille

@swt30
Copy link

swt30 commented Feb 21, 2020

@skinnider, I work with Camille and we've put together the PR #132 above to address this. If this is still an issue for you then feel free to give it a try and let us know if it works in your case.

@topepo
Copy link
Member

topepo commented Feb 27, 2020

I'd rather try to solve this by lowering the threshold a bit (as opposed to adding another argument). With caret, we added options like this and had more problems with people passing in bad values. I'd rather lower the "line of dignity" a bit but still keep it around.

@juliasilge juliasilge added the feature a feature request or enhancement label May 1, 2020
@juliasilge
Copy link
Member

Let's try lowering this default for the next release.

@juliasilge
Copy link
Member

The PR in #149 lowers the threshold so that your example would successfully stratify and no longer give a warning:

library(tidyverse)
library(rsample)

X <- matrix(rnorm(140 * 100), ncol = 100, nrow = 140)
y <- rep(letters[1:7], each = 20)

df <- tibble(X) %>%
  mutate(label = y)

vfold_cv(df, v = 3, strata = label)
#> #  3-fold cross-validation using stratification 
#> # A tibble: 3 x 2
#>   splits          id   
#>   <named list>    <chr>
#> 1 <split [91/49]> Fold1
#> 2 <split [91/49]> Fold2
#> 3 <split [98/42]> Fold3

Created on 2020-05-07 by the reprex package (v0.3.0)

@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Feb 21, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement upcoming release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants