Skip to content

Commit

Permalink
Quarto conversion (#320)
Browse files Browse the repository at this point in the history
* Rename files to qmds

* Remove (PART) designation

* Remove yaml + update conditional content formatting for Quarto

* Move part headers to other files

* Delete _bookdown.yml and _output_yml and take notes of bits not yet used

* Add _quarto.yml, trying to capture all bookdown things

* css -> scss

* Add /quarto to gitignore

* Conditional formatting convert to quarto format

* No build tool at RStudio project level

* Rename exercise files to qmds

* Place _s in document names to be included

* child document -> includes

* Add some more needed packages

* More packages

* Move exercise data to top level

* Re-label footnotes

* Convert chunk headers to yaml style

* Fix up scss file

* Move code to the bottom of file

* Update pkg versions

* Make mainfont Atkinson Hyperlegible

* Add bibliography

* Update figure and table crossrefs and captions

* Update to native pipe

* Remove unused line

* Update freeze
  • Loading branch information
mine-cetinkaya-rundel authored Aug 3, 2023
1 parent d40aece commit df12ab4
Show file tree
Hide file tree
Showing 620 changed files with 19,257 additions and 2,339 deletions.
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,5 @@ ims.toc
ims.tex
ims.pdf
ims.md
ims.synctex(busy)
ims.synctex(busy)
/.quarto/
261 changes: 148 additions & 113 deletions 01-data-hello.Rmd → 01-data-hello.qmd

Large diffs are not rendered by default.

232 changes: 173 additions & 59 deletions 02-data-design.Rmd → 02-data-design.qmd

Large diffs are not rendered by default.

196 changes: 109 additions & 87 deletions 03-data-applications.Rmd → 03-data-applications.qmd

Large diffs are not rendered by default.

400 changes: 235 additions & 165 deletions 04-explore-categorical.Rmd → 04-explore-categorical.qmd

Large diffs are not rendered by default.

384 changes: 234 additions & 150 deletions 05-explore-numerical.Rmd → 05-explore-numerical.qmd

Large diffs are not rendered by default.

164 changes: 104 additions & 60 deletions 06-explore-applications.Rmd → 06-explore-applications.qmd

Large diffs are not rendered by default.

209 changes: 152 additions & 57 deletions 07-model-slr.Rmd → 07-model-slr.qmd

Large diffs are not rendered by default.

90 changes: 53 additions & 37 deletions 08-model-mlr.Rmd → 08-model-mlr.qmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Linear regression with multiple predictors {#model-mlr}

```{r, include = FALSE}
```{r}
#| include: false
source("_common.R")
```

Expand Down Expand Up @@ -33,7 +34,8 @@ Based on the data in this dataset we have created two new variables: `credit_uti
We will refer to this modified dataset as `loans`.
:::

```{r loans-data-matrix}
```{r}
#| label: loans-data-matrix
loans <- loans_full_schema %>%
mutate(
credit_util = total_credit_utilized / total_credit_limit,
Expand All @@ -53,7 +55,8 @@ loans %>%
full_width = FALSE)
```

```{r loans-variables}
```{r}
#| label: loans-variables
loans_var_def <- tribble(
~variable, ~description,
"interest_rate", "Interest rate on the loan, in an annual percentage.",
Expand Down Expand Up @@ -84,7 +87,8 @@ $$\widehat{\texttt{interest_rate}} = 12.34 + 0.74 \times \texttt{bankruptcy}$$

Results of this model are shown in Table \@ref(tab:int-rate-bankruptcy).

```{r int-rate-bankruptcy}
```{r}
#| label: int-rate-bankruptcy
m_bankruptcy <- lm(interest_rate ~ bankruptcy, data = loans)
m_bankruptcy %>%
Expand Down Expand Up @@ -116,11 +120,13 @@ Each row represents the relative difference for each level of `verified_income`.
However, we are missing one of the levels: `Not Verified`.
The missing level is called the **reference level** and it represents the default level that other levels are measured against.

```{r include=FALSE}
```{r}
#| include: false
terms_chp_8 <- c("reference level")
```

```{r int-rate-ver-income}
```{r}
#| label: int-rate-ver-income
m_verified_income <- lm(interest_rate ~ verified_income, data = loans)
m_verified_income %>%
Expand Down Expand Up @@ -184,10 +190,10 @@ The average interest rate for these borrowers is 12.52%.
:::

::: {.guidedpractice data-latex=""}
Compute the average interest rate for borrowers whose income source and amount are both verified.[^model-mlr-1]
Compute the average interest rate for borrowers whose income source and amount are both verified.[^08-model-mlr-1]
:::

[^model-mlr-1]: When `verified_income` takes a value of `Verified`, then the corresponding variable takes a value of 1 while the other is 0: $11.10 + 1.42 \times 0 + 3.25 \times 1 = 14.35.$ The average interest rate for these borrowers is 14.35%.
[^08-model-mlr-1]: When `verified_income` takes a value of `Verified`, then the corresponding variable takes a value of 1 while the other is 0: $11.10 + 1.42 \times 0 + 3.25 \times 1 = 14.35.$ The average interest rate for these borrowers is 14.35%.

::: {.important data-latex=""}
**Predictors with several categories.**
Expand All @@ -197,10 +203,10 @@ For the last level that does not receive a coefficient, this is the reference le
:::

::: {.guidedpractice data-latex=""}
Interpret the coefficients from the model above.[^model-mlr-2]
Interpret the coefficients from the model above.[^08-model-mlr-2]
:::

[^model-mlr-2]: Each of the coefficients gives the incremental interest rate for the corresponding level relative to the `Not Verified` level, which is the reference level.
[^08-model-mlr-2]: Each of the coefficients gives the incremental interest rate for the corresponding level relative to the `Not Verified` level, which is the reference level.
For example, for a borrower whose income source and amount have been verified, the model predicts that they will have a 3.25% higher interest rate than a borrower who has not had their income source or amount verified.

The higher interest rate for borrowers who have verified their income source or amount is surprising.
Expand All @@ -212,10 +218,10 @@ For this reason, the borrower could be deemed higher risk, resulting in a higher
(What other confounding variables might explain this counter-intuitive relationship suggested by the model?)

::: {.guidedpractice data-latex=""}
How much larger of an interest rate would we expect for a borrower who has verified their income source and amount vs a borrower whose income source has only been verified?[^model-mlr-3]
How much larger of an interest rate would we expect for a borrower who has verified their income source and amount vs a borrower whose income source has only been verified?[^08-model-mlr-3]
:::

[^model-mlr-3]: Relative to the `Not Verified` category, the `Verified` category has an interest rate of 3.25% higher, while the `Source Verified` category is only 1.42% higher.
[^08-model-mlr-3]: Relative to the `Not Verified` category, the `Verified` category has an interest rate of 3.25% higher, while the `Source Verified` category is only 1.42% higher.
Thus, `Verified` borrowers will tend to get an interest rate about $3.25% - 1.42% = 1.83%$ higher than `Source Verified` borrowers.

## Many predictors in a model
Expand All @@ -225,7 +231,8 @@ For example, we might like to use the full context of borrowers to predict the i
This is the strategy used in **multiple regression**.
While we remain cautious about making any causal interpretations using multiple regression on observational data, such models are a common first step in gaining insights or providing some evidence of a causal connection.

```{r include=FALSE}
```{r}
#| include: false
terms_chp_8 <- c(terms_chp_8, "multiple regression")
```

Expand Down Expand Up @@ -259,7 +266,8 @@ We will discuss inference based on linear models in Chapter \@ref(inf-model-mlr)
We typically use a computer to minimize the sum of squares and compute point estimates, as shown in the sample output in Table \@ref(tab:loans-full).
Using this output, we identify $b_i,$ just as we did in the one-predictor case.

```{r loans-full}
```{r}
#| label: loans-full
m_full <- lm(interest_rate ~ ., data = loans)
m_full %>%
Expand Down Expand Up @@ -316,16 +324,16 @@ A total of seven variables were used as predictors to fit this model: `verified_
:::

::: {.guidedpractice data-latex=""}
Interpret the coefficient of the variable `credit_checks`.[^model-mlr-4]
Interpret the coefficient of the variable `credit_checks`.[^08-model-mlr-4]
:::

[^model-mlr-4]: All else held constant, for each additional inquiry into the applicant's credit during the last 12 months, we would expect the interest rate for the loan to be higher, on average, by 0.23 points.
[^08-model-mlr-4]: All else held constant, for each additional inquiry into the applicant's credit during the last 12 months, we would expect the interest rate for the loan to be higher, on average, by 0.23 points.

::: {.guidedpractice data-latex=""}
Compute the residual of the first observation in Table \@ref(tab:loans-data-matrix) using the full model.[^model-mlr-5]
Compute the residual of the first observation in Table \@ref(tab:loans-data-matrix) using the full model.[^08-model-mlr-5]
:::

[^model-mlr-5]: To compute the residual, we first need the predicted value, which we compute by plugging values into the equation from earlier.
[^08-model-mlr-5]: To compute the residual, we first need the predicted value, which we compute by plugging values into the equation from earlier.
For example, $\texttt{verified_income}_{\texttt{Source Verified}}$ takes a value of 0, $\texttt{verified_income}_{\texttt{Verified}}$ takes a value of 1 (since the borrower's income source and amount were verified), $\texttt{debt_to_income}$ was 18.01, and so on.
This leads to a prediction of $\widehat{\texttt{interest_rate}}_1 = 17.84$.
The observed interest rate was 14.07%, which leads to a residual of $e_1 = 14.07 - 17.84 = -3.77$.
Expand All @@ -347,17 +355,18 @@ The previous example describes a common issue in multiple regression: correlatio
We say the two predictor variables are collinear (pronounced as *co-linear*) when they are correlated, and this **multicollinearity** complicates model estimation.
While it is impossible to prevent multicollinearity from arising in observational data, experiments are usually designed to prevent predictors from being multicollinear.

```{r include=FALSE}
```{r}
#| include: false
terms_chp_8 <- c(terms_chp_8, "multicollinearity")
```

::: {.guidedpractice data-latex=""}
The estimated value of the intercept is 1.89, and one might be tempted to make some interpretation of this coefficient, such as, it is the model's predicted interest rate when each of the variables take value zero: income source is not verified, the borrower has no debt (debt-to-income and credit utilization are zero), and so on.
Is this reasonable?
Is there any value gained by making this interpretation?[^model-mlr-6]
Is there any value gained by making this interpretation?[^08-model-mlr-6]
:::

[^model-mlr-6]: Many of the variables do take a value 0 for at least one data point, and for those variables, it is reasonable.
[^08-model-mlr-6]: Many of the variables do take a value 0 for at least one data point, and for those variables, it is reasonable.
However, one variable never takes a value of zero: `term`, which describes the length of the loan, in months.
If `term` is set to zero, then the loan must be paid back immediately; the borrower must give the money back as soon as they receive it, which means it is not a real loan.
Ultimately, the interpretation of the intercept in this setting is not insightful.
Expand All @@ -372,10 +381,10 @@ This equation remains valid in the multiple regression framework, but a small en

::: {.guidedpractice data-latex=""}
The variance of the residuals for the model given in the earlier Guided Practice is 18.53, and the variance of the total price in all the auctions is 25.01.
Calculate $R^2$ for this model.[^model-mlr-7]
Calculate $R^2$ for this model.[^08-model-mlr-7]
:::

[^model-mlr-7]: $R^2 = 1 - \frac{18.53}{25.01} = 0.2591$.
[^08-model-mlr-7]: $R^2 = 1 - \frac{18.53}{25.01} = 0.2591$.

This strategy for estimating $R^2$ is acceptable when there is just a single variable.
However, it becomes less helpful when there are many variables.
Expand All @@ -401,33 +410,35 @@ where $n$ is the number of observations used to fit the model and $k$ is the num
Remember that a categorical predictor with $p$ levels will contribute $p - 1$ to the number of variables in the model.
:::

```{r include=FALSE}
```{r}
#| include: false
terms_chp_8 <- c(terms_chp_8, "adjusted R-squared")
```

Because $k$ is never negative, the adjusted $R^2$ will be smaller -- often times just a little smaller -- than the unadjusted $R^2$.
The reasoning behind the adjusted $R^2$ lies in the **degrees of freedom** associated with each variance, which is equal to $n - k - 1$ in the multiple regression context.
If we were to make predictions for *new data* using our current model, we would find that the unadjusted $R^2$ would tend to be slightly overly optimistic, while the adjusted $R^2$ formula helps correct this bias.

```{r include=FALSE}
```{r}
#| include: false
terms_chp_8 <- c(terms_chp_8, "degrees of freedom")
```

::: {.guidedpractice data-latex=""}
There were n = 10,000 auctions in the dataset and $k=9$ predictor variables in the model.
Use $n$, $k$, and the variances from the earlier Guided Practice to calculate $R_{adj}^2$ for the interest rate model.[^model-mlr-8]
Use $n$, $k$, and the variances from the earlier Guided Practice to calculate $R_{adj}^2$ for the interest rate model.[^08-model-mlr-8]
:::

[^model-mlr-8]: $R_{adj}^2 = 1 - \frac{18.53}{25.01}\times \frac{10000-1}{10000-9-1} = 0.2584$.
[^08-model-mlr-8]: $R_{adj}^2 = 1 - \frac{18.53}{25.01}\times \frac{10000-1}{10000-9-1} = 0.2584$.
While the difference is very small, it will be important when we fine tune the model in the next section.

::: {.guidedpractice data-latex=""}
Suppose you added another predictor to the model, but the variance of the errors $Var(e_i)$ didn't go down.
What would happen to the $R^2$?
What would happen to the adjusted $R^2$?[^model-mlr-9]
What would happen to the adjusted $R^2$?[^08-model-mlr-9]
:::

[^model-mlr-9]: The unadjusted $R^2$ would stay the same and the adjusted $R^2$ would go down.
[^08-model-mlr-9]: The unadjusted $R^2$ would stay the same and the adjusted $R^2$ would go down.

Adjusted $R^2$ could also have been used in Chapter \@ref(model-slr) where we introduced regression models with a single predictor.
However, when there is only $k = 1$ predictors, adjusted $R^2$ is very close to regular $R^2$, so this nuance isn't typically important when the model has only one predictor.
Expand All @@ -439,14 +450,16 @@ Sometimes including variables that are not evidently important can actually redu
In this section, we discuss model selection strategies, which will help us eliminate variables from the model that are found to be less important.
It's common (and hip, at least in the statistical world) to refer to models that have undergone such variable pruning as **parsimonious**.

```{r include=FALSE}
```{r}
#| include: false
terms_chp_8 <- c(terms_chp_8, "parsimonious")
```

In practice, the model that includes all available predictors is often referred to as the **full model**.
The full model may not be the best model, and if it isn't, we want to identify a smaller model that is preferable.

```{r include=FALSE}
```{r}
#| include: false
terms_chp_8 <- c(terms_chp_8, "full model")
```

Expand All @@ -455,7 +468,8 @@ terms_chp_8 <- c(terms_chp_8, "full model")
Two common strategies for adding or removing variables in a multiple regression model are called backward elimination and forward selection.
These techniques are often referred to as **stepwise selection** strategies, because they add or delete one variable at a time as they "step" through the candidate predictors.

```{r include=FALSE}
```{r}
#| include: false
terms_chp_8 <- c(terms_chp_8, "stepwise selection")
```

Expand All @@ -464,7 +478,8 @@ terms_chp_8 <- c(terms_chp_8, "stepwise selection")
**Forward selection** is the reverse of the backward elimination technique.
Instead, of eliminating variables one-at-a-time, we add variables one-at-a-time until we cannot find any variables that improve the model any further.

```{r include=FALSE}
```{r}
#| include: false
terms_chp_8 <- c(terms_chp_8, "backward elimination", "forward selection")
```

Expand All @@ -477,7 +492,8 @@ Adjusted $R^2$ describes the strength of a model fit, and it is a useful tool fo
Let's consider two models, which are shown in Table \@ref(tab:loans-full-for-model-selection) and Table \@ref(tab:loans-full-except-issue-month).
The first table summarizes the full model since it includes all predictors, while the second does not include the `issue_month` variable.

```{r loans-full-for-model-selection}
```{r}
#| label: loans-full-for-model-selection
options(digits = 6) # to get more digits
m_full_r_sq_adj <- glance(m_full)$adj.r.squared %>% round(4)
options(digits = 3) # to get back to default set in _common.R
Expand All @@ -503,7 +519,8 @@ m_full_w_rsq %>%
row_spec(11:12, italic = TRUE)
```

```{r loans-full-except-issue-month}
```{r}
#| label: loans-full-except-issue-month
m_full_minus_issue_month <- lm(interest_rate ~ . - issue_month, data = loans)
options(digits = 6) # to get more digits
Expand Down Expand Up @@ -704,6 +721,5 @@ make_terms_table(terms_chp_8)
Answers to odd numbered exercises can be found in Appendix \@ref(exercise-solutions-08).

::: {.exercises data-latex=""}
```{r exercises-08, child = "exercises/08-ex-model-mlr.Rmd"}
```
{{< include exercises/_08-ex-model-mlr.qmd >}}
:::
Loading

0 comments on commit df12ab4

Please sign in to comment.