Skip to content

Commit

Permalink
merge pr #83: add alt text
Browse files Browse the repository at this point in the history
  • Loading branch information
simonpcouch authored May 20, 2021
2 parents 35cb85c + 728f89b commit 7db8d82
Show file tree
Hide file tree
Showing 3 changed files with 54 additions and 30 deletions.
10 changes: 5 additions & 5 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -51,25 +51,25 @@ Rather than diving right into the implementation, we'll focus here on how the pi

At the highest level, ensembles are formed from _model definitions_. In this package, model definitions are an instance of a minimal [workflow](https://workflows.tidymodels.org/), containing a _model specification_ (as defined in the [parsnip](https://parsnip.tidymodels.org/) package) and, optionally, a _preprocessor_ (as defined in the [recipes](https://recipes.tidymodels.org/) package). Model definitions specify the form of candidate ensemble members.

![](man/figures/model_defs.png)
![A diagram representing "model definitions," which specify the form of candidate ensemble members. Three colored boxes represent three different model types; a K-nearest neighbors model (in salmon), a linear regression model (in yellow), and a support vector machine model (in green).](man/figures/model_defs.png)

To be used in the same ensemble, each of these model definitions must share the same _resample_. This [rsample](https://rsample.tidymodels.org/) `rset` object, when paired with the model definitions, can be used to generate the tuning/fitting results objects for the candidate _ensemble members_ with tune.

![](man/figures/candidates.png)
![A diagram representing "candidate members" generated from each model definition. Four salmon-colored boxes labeled "KNN" represent K-nearest neighbors models trained on the resamples with differing hyperparameters. Similarly, the linear regression model generates one candidate member, and the support vector machine model generates six.](man/figures/candidates.png)

Candidate members first come together in a `data_stack` object through the `add_candidates()` function. Principally, these objects are just [tibble](https://tibble.tidyverse.org/)s, where the first column gives the true outcome in the assessment set (the portion of the training set used for model validation), and the remaining columns give the predictions from each candidate ensemble member. (When the outcome is numeric, there's only one column per candidate ensemble member. Classification requires as many columns per candidate as there are levels in the outcome variable.) They also bring along a few extra attributes to keep track of model definitions.

![](man/figures/data_stack.png)
![A diagram representing a "data stack," a specific kind of data frame. Colored "columns" depict, in white, the true value of the outcome variable in the validation set, followed by four columns (in salmon) representing the predictions from the K-nearest neighbors model, one column (in tan) representing the linear regression model, and six (in green) representing the support vector machine model.](man/figures/data_stack.png)

Then, the data stack can be evaluated using `blend_predictions()` to determine to how best to combine the outputs from each of the candidate members. In the stacking literature, this process is commonly called _metalearning_.

The outputs of each member are likely highly correlated. Thus, depending on the degree of regularization you choose, the coefficients for the inputs of (possibly) many of the members will zero out—their predictions will have no influence on the final output, and those terms will thus be thrown out.

![](man/figures/coefs.png)
![A diagram representing "stacking coefficients," the coefficients of the linear model combining each of the candidate member predictions to generate the ensemble's ultimate prediction. Boxes for each of the candidate members are placed besides each other, filled in with color if the coefficient for the associated candidate member is nonzero.](man/figures/coefs.png)

These stacking coefficients determine which candidate ensemble members will become ensemble members. Candidates with non-zero stacking coefficients are then fitted on the whole training set, altogether making up a `model_stack` object.

![](man/figures/class_model_stack.png)
![A diagram representing the "model stack" class, which collates the stacking coefficients and members (candidate members with nonzero stacking coefficients that are trained on the full training set). The representation of the stacking coefficients is as before, where the members (shown next to their associated stacking coefficients) are colored-in pentagons. Model stacks are a list subclass.](man/figures/class_model_stack.png)

This model stack object, outputted from `fit_members()`, is ready to predict on new data! The trained ensemble members are often referred to as _base models_ in the stacking literature.

Expand Down
62 changes: 43 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,35 +48,33 @@ remotes::install_github("tidymodels/stacks", ref = "main")

stacks is generalized with respect to:

- Model type: Any model type implemented in
- Model type: Any model type implemented in
[parsnip](https://parsnip.tidymodels.org/) or adjacent packages is
fair game to add to a stacks model stack.
[Here](https://www.tidymodels.org/find/parsnip/)’s a table of many
of the implemented model types in the tidymodels core, with a link
there to an article about implementing your own model classes as
well.
- Cross-validation scheme: Any resampling algorithm implemented in
- Cross-validation scheme: Any resampling algorithm implemented in
[rsample](https://rsample.tidymodels.org/) or adjacent packages is
fair game for resampling data for use in training a model stack.
- Error metric: Any metric function implemented in
- Error metric: Any metric function implemented in
[yardstick](https://yardstick.tidymodels.org/) or adjacent packages
is fair game for evaluating model stacks and their members. That
package provides some infrastructure for creating your own metric
functions as well\!
functions as well!

stacks uses a regularized linear model to combine predictions from
ensemble members, though this model type is only one of many possible
learning algorithms that could be used to fit a stacked ensemble model.
For implementations of additional ensemble learning algorithms, check
out
[h2o](http://docs.h2o.ai/h2o/latest-stable/h2o-r/docs/reference/h2o.stackedEnsemble.html)
and
[SuperLearner](https://CRAN.R-project.org/package=SuperLearner).
and [SuperLearner](https://CRAN.R-project.org/package=SuperLearner).

Rather than diving right into the implementation, we’ll focus here on
how the pieces fit together, conceptually, in building an ensemble with
`stacks`. See the `basics` vignette for an example of the API in
action\!
`stacks`. See the `basics` vignette for an example of the API in action!

## a grammar

Expand All @@ -89,15 +87,24 @@ specification* (as defined in the
[recipes](https://recipes.tidymodels.org/) package). Model definitions
specify the form of candidate ensemble members.

![](man/figures/model_defs.png)
![A diagram representing “model definitions,” which specify the form of
candidate ensemble members. Three colored boxes represent three
different model types; a K-nearest neighbors model (in salmon), a linear
regression model (in yellow), and a support vector machine model (in
green).](man/figures/model_defs.png)

To be used in the same ensemble, each of these model definitions must
share the same *resample*. This
[rsample](https://rsample.tidymodels.org/) `rset` object, when paired
with the model definitions, can be used to generate the tuning/fitting
results objects for the candidate *ensemble members* with tune.

![](man/figures/candidates.png)
![A diagram representing “candidate members” generated from each model
definition. Four salmon-colored boxes labeled “KNN” represent K-nearest
neighbors models trained on the resamples with differing
hyperparameters. Similarly, the linear regression model generates one
candidate member, and the support vector machine model generates
six.](man/figures/candidates.png)

Candidate members first come together in a `data_stack` object through
the `add_candidates()` function. Principally, these objects are just
Expand All @@ -110,7 +117,13 @@ Classification requires as many columns per candidate as there are
levels in the outcome variable.) They also bring along a few extra
attributes to keep track of model definitions.

![](man/figures/data_stack.png)
![A diagram representing a “data stack,” a specific kind of data frame.
Colored “columns” depict, in white, the true value of the outcome
variable in the validation set, followed by four columns (in salmon)
representing the predictions from the K-nearest neighbors model, one
column (in tan) representing the linear regression model, and six (in
green) representing the support vector machine
model.](man/figures/data_stack.png)

Then, the data stack can be evaluated using `blend_predictions()` to
determine to how best to combine the outputs from each of the candidate
Expand All @@ -123,43 +136,54 @@ inputs of (possibly) many of the members will zero out—their predictions
will have no influence on the final output, and those terms will thus be
thrown out.

![](man/figures/coefs.png)
![A diagram representing “stacking coefficients,” the coefficients of
the linear model combining each of the candidate member predictions to
generate the ensemble’s ultimate prediction. Boxes for each of the
candidate members are placed besides each other, filled in with color if
the coefficient for the associated candidate member is
nonzero.](man/figures/coefs.png)

These stacking coefficients determine which candidate ensemble members
will become ensemble members. Candidates with non-zero stacking
coefficients are then fitted on the whole training set, altogether
making up a `model_stack` object.

![](man/figures/class_model_stack.png)
![A diagram representing the “model stack” class, which collates the
stacking coefficients and members (candidate members with nonzero
stacking coefficients that are trained on the full training set). The
representation of the stacking coefficients is as before, where the
members (shown next to their associated stacking coefficients) are
colored-in pentagons. Model stacks are a list
subclass.](man/figures/class_model_stack.png)

This model stack object, outputted from `fit_members()`, is ready to
predict on new data\! The trained ensemble members are often referred to
predict on new data! The trained ensemble members are often referred to
as *base models* in the stacking literature.

The full visual outline for these steps can be found
[here](https://github.com/tidymodels/stacks/blob/main/inst/figs/outline.png).
The API for the package closely mirrors these ideas. See the `basics`
vignette for an example of how this grammar is implemented\!
vignette for an example of how this grammar is implemented!

## contributing

This project is released with a [Contributor Code of
Conduct](https://github.com/tidymodels/stacks/blob/main/CODE_OF_CONDUCT.md).
By contributing to this project, you agree to abide by its terms.

- For questions and discussions about tidymodels packages, modeling,
- For questions and discussions about tidymodels packages, modeling,
and machine learning, please [post on RStudio
Community](https://community.rstudio.com/new-topic?category_id=15&tags=tidymodels,question).

- If you think you have encountered a bug, please [submit an
- If you think you have encountered a bug, please [submit an
issue](https://github.com/tidymodels/stacks/issues).

- Either way, learn how to create and share a
- Either way, learn how to create and share a
[reprex](https://reprex.tidyverse.org/articles/articles/learn-reprex.html)
(a minimal, reproducible example), to clearly communicate about your
code.

- Check out further details on [contributing guidelines for tidymodels
- Check out further details on [contributing guidelines for tidymodels
packages](https://www.tidymodels.org/contribute/) and [how to get
help](https://www.tidymodels.org/help/).

Expand Down
12 changes: 6 additions & 6 deletions vignettes/basics.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ Let's give this a go!

At the highest level, ensembles are formed from _model definitions_. In this package, model definitions are an instance of a minimal [`workflow`](https://workflows.tidymodels.org/), containing a _model specification_ (as defined in the [`parsnip`](https://parsnip.tidymodels.org/) package) and, optionally, a _preprocessor_ (as defined in the [`recipes`](https://recipes.tidymodels.org/) package). Model definitions specify the form of candidate ensemble members.

```{r, echo = FALSE}
```{r, echo = FALSE, fig.alt = "A diagram representing 'model definitions,' which specify the form of candidate ensemble members. Three colored boxes represent three different model types; a K-nearest neighbors model (in salmon), a linear regression model (in yellow), and a support vector machine model (in green)."}
knitr::include_graphics("https://raw.githubusercontent.com/tidymodels/stacks/main/man/figures/model_defs.png")
```

Expand Down Expand Up @@ -252,7 +252,7 @@ svm_res

Altogether, we've created three model definitions, where the K-nearest neighbors model definition specifies 4 model configurations, the linear regression specifies 1, and the support vector machine specifies 6.

```{r, echo = FALSE}
```{r, echo = FALSE, fig.alt = "A diagram representing 'candidate members' generated from each model definition. Four salmon-colored boxes labeled 'KNN' represent K-nearest neighbors models trained on the resamples with differing hyperparameters. Similarly, the linear regression (LM) model generates one candidate member, and the support vector machine (SVM) model generates six."}
knitr::include_graphics("https://raw.githubusercontent.com/tidymodels/stacks/main/man/figures/candidates.png")
```

Expand All @@ -262,7 +262,7 @@ With these three model definitions fully specified, we are ready to begin stacki

The first step to building an ensemble with stacks is to create a `data_stack` object—in this package, data stacks are tibbles (with some extra attributes) that contain the assessment set predictions for each candidate ensemble member.

```{r, echo = FALSE}
```{r, echo = FALSE, fig.alt = "A diagram representing a 'data stack,' a specific kind of data frame. Colored 'columns' depict, in white, the true value of the outcome variable in the validation set, followed by four columns (in salmon) representing the predictions from the K-nearest neighbors model, one column (in tan) representing the linear regression model, and six (in green) representing the support vector machine model."}
knitr::include_graphics("https://raw.githubusercontent.com/tidymodels/stacks/main/man/figures/data_stack.png")
```

Expand Down Expand Up @@ -308,7 +308,7 @@ tree_frogs_model_st <-

The `blend_predictions` function determines how member model output will ultimately be combined in the final prediction by fitting a LASSO model on the data stack, predicting the true assessment set outcome using the predictions from each of the candidate members. Candidates with nonzero stacking coefficients become members.

```{r, echo = FALSE}
```{r, echo = FALSE, fig.alt = "A diagram representing 'stacking coefficients,' the coefficients of the linear model combining each of the candidate member predictions to generate the ensemble's ultimate prediction. Boxes for each of the candidate members are placed besides each other, filled in with color if the coefficient for the associated candidate member is nonzero."}
knitr::include_graphics("https://raw.githubusercontent.com/tidymodels/stacks/main/man/figures/coefs.png")
```

Expand Down Expand Up @@ -339,13 +339,13 @@ tree_frogs_model_st <-
fit_members()
```

```{r, echo = FALSE}
```{r, echo = FALSE, fig.alt = "A diagram representing the ensemble members, where each are pentagons labeled and colored-in according to the candidate members they arose from."}
knitr::include_graphics("https://raw.githubusercontent.com/tidymodels/stacks/main/man/figures/members.png")
```

Model stacks can be thought of as a group of fitted member models and a set of instructions on how to combine their predictions.

```{r, echo = FALSE}
```{r, echo = FALSE, fig.alt = "A diagram representing the 'model stack' class, which collates the stacking coefficients and members (candidate members with nonzero stacking coefficients that are trained on the full training set). The representation of the stacking coefficients and members is as before. Model stacks are a list subclass."}
knitr::include_graphics("https://raw.githubusercontent.com/tidymodels/stacks/main/man/figures/class_model_stack.png")
```

Expand Down

0 comments on commit 7db8d82

Please sign in to comment.