-
Notifications
You must be signed in to change notification settings - Fork 27
/
README.Rmd
92 lines (57 loc) · 9.25 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
---
output: github_document
---
<!-- badges: start -->
[![DOI badge](https://joss.theoj.org/papers/10.21105/joss.04471/status.svg)](https://doi.org/10.21105/joss.04471)
[![R build status](https://github.com/simonpcouch/stacks/workflows/R-CMD-check/badge.svg)](https://github.com/tidymodels/stacks/actions)
[![Codecov test coverage](https://codecov.io/gh/tidymodels/stacks/branch/main/graph/badge.svg)](https://app.codecov.io/gh/tidymodels/stacks?branch=main)
[![CRAN status](https://www.r-pkg.org/badges/version/stacks)](https://CRAN.R-project.org/package=stacks)
<!-- badges: end -->
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## stacks - tidy model stacking <a href='https://stacks.tidymodels.org'><img src='man/figures/logo.png' alt = 'A hexagonal logo. Dark blue text reads "stacks", cascading over a stack of pancakes on a light blue background.' align="right" height="280" /></a>
stacks is an R package for model stacking that aligns with the tidymodels. Model stacking is an ensembling method that takes the outputs of many models and combines them to generate a new model—referred to as an _ensemble_ in this package—that generates predictions informed by each of its _members_.
The process goes something like this:
1. Define candidate ensemble members using functionality from [rsample](https://rsample.tidymodels.org/), [parsnip](https://parsnip.tidymodels.org/), [workflows](https://workflows.tidymodels.org/), [recipes](https://recipes.tidymodels.org/), and [tune](https://tune.tidymodels.org/)
2. Initialize a `data_stack` object with `stacks()`
3. Iteratively add candidate ensemble members to the `data_stack` with `add_candidates()`
4. Evaluate how to combine their predictions with `blend_predictions()`
5. Fit candidate ensemble members with non-zero stacking coefficients with `fit_members()`
6. Predict on new data with `predict()`
You can install the package with the following code:
```{r, eval = FALSE}
install.packages("stacks")
```
Install the development version with:
```{r, eval = FALSE}
# install.packages("pak")
pak::pak("tidymodels/stacks")
```
stacks is generalized with respect to:
* Model type: Any model type implemented in [parsnip](https://parsnip.tidymodels.org/) or extension packages is fair game to add to a stacks model stack. [Here](https://www.tidymodels.org/find/parsnip/)'s a table of many of the implemented model types in the tidymodels core, with a link there to an article about implementing your own model classes as well.
* Cross-validation scheme: Any resampling algorithm implemented in [rsample](https://rsample.tidymodels.org/) or extension packages is fair game for resampling data for use in training a model stack.
* Error metric: Any metric function implemented in [yardstick](https://yardstick.tidymodels.org/) or extension packages is fair game for evaluating model stacks and their members. That package provides some infrastructure for creating your own metric functions as well!
stacks uses a regularized linear model to combine predictions from ensemble members, though this model type is only one of many possible learning algorithms that could be used to fit a stacked ensemble model. For implementations of additional ensemble learning algorithms, check out [h2o](https://docs.h2o.ai/h2o/latest-stable/h2o-r/docs/reference/h2o.stackedEnsemble.html) and [SuperLearner](https://CRAN.R-project.org/package=SuperLearner).
Rather than diving right into the implementation, we'll focus here on how the pieces fit together, conceptually, in building an ensemble with `stacks`. See the `basics` vignette for an example of the API in action!
## a grammar
At the highest level, ensembles are formed from _model definitions_. In this package, model definitions are an instance of a minimal [workflow](https://workflows.tidymodels.org/), containing a _model specification_ (as defined in the [parsnip](https://parsnip.tidymodels.org/) package) and, optionally, a _preprocessor_ (as defined in the [recipes](https://recipes.tidymodels.org/) package). Model definitions specify the form of candidate ensemble members.
![A diagram representing "model definitions," which specify the form of candidate ensemble members. Three colored boxes represent three different model types; a K-nearest neighbors model (in salmon), a linear regression model (in yellow), and a support vector machine model (in green).](man/figures/model_defs.png)
To be used in the same ensemble, each of these model definitions must share the same _resample_. This [rsample](https://rsample.tidymodels.org/) `rset` object, when paired with the model definitions, can be used to generate the tuning/fitting results objects for the candidate _ensemble members_ with tune.
![A diagram representing "candidate members" generated from each model definition. Four salmon-colored boxes labeled "KNN" represent K-nearest neighbors models trained on the resamples with differing hyperparameters. Similarly, the linear regression model generates one candidate member, and the support vector machine model generates six.](man/figures/candidates.png)
Candidate members first come together in a `data_stack` object through the `add_candidates()` function. Principally, these objects are just [tibble](https://tibble.tidyverse.org/)s, where the first column gives the true outcome in the assessment set (the portion of the training set used for model validation), and the remaining columns give the predictions from each candidate ensemble member. (When the outcome is numeric, there's only one column per candidate ensemble member. Classification requires as many columns per candidate as there are levels in the outcome variable.) They also bring along a few extra attributes to keep track of model definitions.
![A diagram representing a "data stack," a specific kind of data frame. Colored "columns" depict, in white, the true value of the outcome variable in the validation set, followed by four columns (in salmon) representing the predictions from the K-nearest neighbors model, one column (in tan) representing the linear regression model, and six (in green) representing the support vector machine model.](man/figures/data_stack.png)
Then, the data stack can be evaluated using `blend_predictions()` to determine to how best to combine the outputs from each of the candidate members. In the stacking literature, this process is commonly called _metalearning_.
The outputs of each member are likely highly correlated. Thus, depending on the degree of regularization you choose, the coefficients for the inputs of (possibly) many of the members will zero out—their predictions will have no influence on the final output, and those terms will thus be thrown out.
![A diagram representing "stacking coefficients," the coefficients of the linear model combining each of the candidate member predictions to generate the ensemble's ultimate prediction. Boxes for each of the candidate members are placed besides each other, filled in with color if the coefficient for the associated candidate member is nonzero.](man/figures/coefs.png)
These stacking coefficients determine which candidate ensemble members will become ensemble members. Candidates with non-zero stacking coefficients are then fitted on the whole training set, altogether making up a `model_stack` object.
![A diagram representing the "model stack" class, which collates the stacking coefficients and members (candidate members with nonzero stacking coefficients that are trained on the full training set). The representation of the stacking coefficients is as before, where the members (shown next to their associated stacking coefficients) are colored-in pentagons. Model stacks are a list subclass.](man/figures/class_model_stack.png)
This model stack object, outputted from `fit_members()`, is ready to predict on new data! The trained ensemble members are often referred to as _base models_ in the stacking literature.
The full visual outline for these steps can be found [here](https://github.com/tidymodels/stacks/blob/main/inst/figs/outline.png). The API for the package closely mirrors these ideas. See the `basics` vignette for an example of how this grammar is implemented!
## contributing
This project is released with a [Contributor Code of Conduct](https://github.com/tidymodels/stacks/blob/main/.github/CODE_OF_CONDUCT.md). By contributing to this project, you agree to abide by its terms.
- For questions and discussions about tidymodels packages, modeling, and machine learning, please [post on Posit Community](https://forum.posit.co/new-topic?category_id=15&tags=tidymodels,question).
- If you think you have encountered a bug, please [submit an issue](https://github.com/tidymodels/stacks/issues).
- Either way, learn how to create and share a [reprex](https://reprex.tidyverse.org/articles/articles/learn-reprex.html) (a minimal, reproducible example), to clearly communicate about your code.
- Check out further details on [contributing guidelines for tidymodels packages](https://www.tidymodels.org/contribute/) and [how to get help](https://www.tidymodels.org/help/).
In the stacks package, some test objects take too long to build with every commit. If your contribution changes the structure of `data_stack` or `model_stacks` objects, please regenerate these test objects by running the scripts in `man-roxygen/example_models.Rmd`, including those with chunk options `eval = FALSE`.