Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lower threshold for strata pooling #149

Merged
merged 6 commits into from
May 19, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# rsample (development version)

* Lower threshold for pooling strata to 10% (from 15%) (#149).

# `rsample` 0.0.6

* Added `validation_set()` for making a single resample.
Expand Down
3 changes: 2 additions & 1 deletion R/boot.R
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,9 @@
#' bootstrap results.
#' The `strata` argument is based on a similar argument in the random forest
#' package were the bootstrap samples are conducted *within the stratification
#' variable*. The can help ensure that the number of data points in the
#' variable*. This can help ensure that the number of data points in the
#' bootstrap sample is equivalent to the proportions in the original data set.
#' (Strata below 10% of the total are pooled together.)
#' @inheritParams vfold_cv
#' @param times The number of bootstrap samples.
#' @param strata A variable that is used to conduct stratified sampling. When
Expand Down
6 changes: 3 additions & 3 deletions R/initial_split.R
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,9 @@
#' _first_ `prop` samples for training, instead of a random selection.
#' `training` and `testing` are used to extract the resulting data.
#' @details The `strata` argument causes the random sampling to be conducted
#' *within the stratification variable*. The can help ensure that the number of
#' data points in the training data is equivalent to the proportions in the
#' original data set.
#' *within the stratification variable*. This can help ensure that the number
#' of data points in the training data is equivalent to the proportions in the
#' original data set. (Strata below 10% of the total are pooled together.)
#' @inheritParams vfold_cv
#' @param prop The proportion of data to be retained for modeling/analysis.
#' @param strata A variable that is used to conduct stratified sampling to
Expand Down
2 changes: 1 addition & 1 deletion R/make_strata.R
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@
#' quantile(x6, probs = (0:10)/10)
#' table(make_strata(x6, breaks = 10))
#' @export
make_strata <- function(x, breaks = 4, nunique = 5, pool = .15, depth = 20) {
make_strata <- function(x, breaks = 4, nunique = 5, pool = .1, depth = 20) {
num_vals <- unique(x)
n <- length(x)
num_miss <- sum(is.na(x))
Expand Down
4 changes: 2 additions & 2 deletions R/mc.R
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@
#' replacement) of the original data set to be used for analysis. All other
#' data points are added to the assessment set.
#' @details The `strata` argument causes the random sampling to be conducted
#' *within the stratification variable*. The can help ensure that the number of
#' *within the stratification variable*. This can help ensure that the number of
#' data points in the analysis data is equivalent to the proportions in the
#' original data set.
#' original data set. (Strata below 10% of the total are pooled together.)
#' @inheritParams vfold_cv
#' @param prop The proportion of data to be retained for modeling/analysis.
#' @param times The number of times to repeat the sampling.
Expand Down
2 changes: 1 addition & 1 deletion R/validation_split.R
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
#' @details The `strata` argument causes the random sampling to be conducted
#' *within the stratification variable*. This can help ensure that the number of
#' data points in the analysis data is equivalent to the proportions in the
#' original data set.
#' original data set. (Strata below 10% of the total are pooled together.)
#' @inheritParams vfold_cv
#' @param prop The proportion of data to be retained for modeling/analysis.
#' @param strata A variable that is used to conduct stratified sampling to
Expand Down
4 changes: 2 additions & 2 deletions R/vfold.R
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,9 @@
#' to V.
#' @details
#' The `strata` argument causes the random sampling to be conducted *within
#' the stratification variable*. The can help ensure that the number of data
#' the stratification variable*. This can help ensure that the number of data
#' points in the analysis data is equivalent to the proportions in the original
#' data set.
#' data set. (Strata below 10% of the total are pooled together.)
#' When more than one repeat is requested, the basic V-fold cross-validation
#' is conducted each time. For example, if three repeats are used with `v =
#' 10`, there are a total of 30 splits which as three groups of 10 that are
Expand Down
12 changes: 9 additions & 3 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ knitr::opts_chunk$set(
)
```

# rsample
# rsample <a href='https://tidymodels.github.io/rsample/'><img src='man/figures/logo.png' align="right" height="139" /></a>

<!-- badges: start -->
[![R build status](https://github.com/tidymodels/rsample/workflows/R-CMD-check/badge.svg)](https://github.com/tidymodels/rsample/actions)
Expand All @@ -23,7 +23,10 @@ knitr::opts_chunk$set(
![](https://img.shields.io/badge/lifecycle-maturing-blue.svg)
<!-- badges: end -->

`rsample` contains a set of functions that can create different types of resamples and corresponding classes for their analysis.

## Overview

`rsample` contains a set of functions to create different types of resamples and corresponding classes for their analysis.
The goal is to have a modular set of methods that can be used across different R packages for:

* traditional resampling techniques for estimating the sampling distribution of a statistic and
Expand Down Expand Up @@ -57,8 +60,11 @@ as.numeric(lobstr::obj_size(boots)/lobstr::obj_size(LetterRecognition))
#> [1] 2.528326
```

<sup>Created on 2020-05-07 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0)</sup>

The memory usage for 50 bootstrap samples is less than 3-fold more than the original data set.


## Installation

To install it, use:
Expand All @@ -84,7 +90,7 @@ We welcome contributions of all types!

If you have never made a pull request to an R package before, `rsample` is an excellent place to start. Find an [issue](https://github.com/tidymodels/rsample/issues/) with the **Beginner Friendly** tag and comment that you'd like to take it on and we'll help you get started.

We encourage typo corrections, bug reports, bug fixes and feature requests. Feedback on the clarity of the documentation is especially valuable.
We encourage typo corrections, bug reports, bug fixes, and feature requests. Feedback on the clarity of the documentation is especially valuable.


## Code of Conduct
Expand Down
60 changes: 22 additions & 38 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,38 +1,28 @@

# rsample
# rsample <a href='https://tidymodels.github.io/rsample/'><img src='man/figures/logo.png' align="right" height="139" /></a>

<!-- badges: start -->

[![R build
status](https://github.com/tidymodels/rsample/workflows/R-CMD-check/badge.svg)](https://github.com/tidymodels/rsample/actions)
[![Codecov test
coverage](https://codecov.io/gh/tidymodels/rsample/branch/master/graph/badge.svg)](https://codecov.io/gh/tidymodels/rsample?branch=master)
[![CRAN\_Status\_Badge](http://www.r-pkg.org/badges/version/rsample)](https://cran.r-project.org/package=rsample)
[![R build status](https://github.com/tidymodels/rsample/workflows/R-CMD-check/badge.svg)](https://github.com/tidymodels/rsample/actions)
[![Codecov test coverage](https://codecov.io/gh/tidymodels/rsample/branch/master/graph/badge.svg)](https://codecov.io/gh/tidymodels/rsample?branch=master)
[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/rsample)](https://cran.r-project.org/package=rsample)
[![Downloads](http://cranlogs.r-pkg.org/badges/rsample)](https://cran.r-project.org/package=rsample)
![](https://img.shields.io/badge/lifecycle-maturing-blue.svg)
<!-- badges: end -->

`rsample` contains a set of functions that can create different types of
resamples and corresponding classes for their analysis. The goal is to
have a modular set of methods that can be used across different R
packages for:

- traditional resampling techniques for estimating the sampling
distribution of a statistic and
- estimating model performance using a holdout set

The scope of `rsample` is to provide the basic building blocks for
creating and analyzing resamples of a data set but does not include code
for modeling or calculating statistics. The “Working with Resample Sets”
vignette gives demonstrations of how `rsample` tools can be used.
## Overview

Note that resampled data sets created by `rsample` are directly
accessible in a resampling object but do not contain much overhead in
memory. Since the original data is not modified, R does not make an
automatic copy.
`rsample` contains a set of functions to create different types of resamples and corresponding classes for their analysis.
The goal is to have a modular set of methods that can be used across different R packages for:

* traditional resampling techniques for estimating the sampling distribution of a statistic and
* estimating model performance using a holdout set

The scope of `rsample` is to provide the basic building blocks for creating and analyzing resamples of a data set but does not include code for modeling or calculating statistics. The "Working with Resample Sets" vignette gives demonstrations of how `rsample` tools can be used.

For example, creating 50 bootstraps of a data set does not create an
object that is 50-fold larger in memory:
Note that resampled data sets created by `rsample` are directly accessible in a resampling object but do not contain much overhead in memory. Since the original data is not modified, R does not make an automatic copy.

For example, creating 50 bootstraps of a data set does not create an object that is 50-fold larger in memory:

``` r
library(rsample)
Expand All @@ -56,9 +46,12 @@ as.numeric(lobstr::obj_size(boots)/lobstr::obj_size(LetterRecognition))
#> [1] 2.528326
```

<sup>Created on 2020-05-07 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0)</sup>

The memory usage for 50 bootstrap samples is less than 3-fold more than
the original data set.


## Installation

To install it, use:
Expand All @@ -80,21 +73,12 @@ install_dev("rsample")

## Contributing

We welcome contributions of all types\!
We welcome contributions of all types!

If you have never made a pull request to an R package before, `rsample`
is an excellent place to start. Find an
[issue](https://github.com/tidymodels/rsample/issues/) with the
**Beginner Friendly** tag and comment that you’d like to take it on and
we’ll help you get started.
If you have never made a pull request to an R package before, `rsample` is an excellent place to start. Find an [issue](https://github.com/tidymodels/rsample/issues/) with the **Beginner Friendly** tag and comment that you'd like to take it on and we'll help you get started.

We encourage typo corrections, bug reports, bug fixes and feature
requests. Feedback on the clarity of the documentation is especially
valuable.
We encourage typo corrections, bug reports, bug fixes, and feature requests. Feedback on the clarity of the documentation is especially valuable.

## Code of Conduct

Please note that the rsample project is released with a [Contributor
Code of
Conduct](https://tidymodels.github.io/rsample/CODE_OF_CONDUCT.html). By
contributing to this project, you agree to abide by its terms.
Please note that the rsample project is released with a [Contributor Code of Conduct](https://tidymodels.github.io/rsample/CODE_OF_CONDUCT.html). By contributing to this project, you agree to abide by its terms.
3 changes: 2 additions & 1 deletion man/bootstraps.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 3 additions & 3 deletions man/initial_split.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion man/make_strata.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions man/mc_cv.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion man/validation_split.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions man/vfold_cv.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion tests/testthat/test_strata.R
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ test_that('simple character', {
})

test_that('bad data', {
x3 <- factor(rep(LETTERS[1:10], each = 50))
x3 <- factor(rep(LETTERS[1:15], each = 50))
expect_warning(make_strata(x3))
expect_warning(make_strata(mtcars$mpg))
})
Expand Down