Quarto conversion (#320)

* Rename files to qmds * Remove (PART) designation * Remove yaml + update conditional content formatting for Quarto * Move part headers to other files * Delete _bookdown.yml and _output_yml and take notes of bits not yet used * Add _quarto.yml, trying to capture all bookdown things * css -> scss * Add /quarto to gitignore * Conditional formatting convert to quarto format * No build tool at RStudio project level * Rename exercise files to qmds * Place _s in document names to be included * child document -> includes * Add some more needed packages * More packages * Move exercise data to top level * Re-label footnotes * Convert chunk headers to yaml style * Fix up scss file * Move code to the bottom of file * Update pkg versions * Make mainfont Atkinson Hyperlegible * Add bibliography * Update figure and table crossrefs and captions * Update to native pipe * Remove unused line * Update freeze
OpenIntroStat · Aug 3, 2023 · df12ab4 · df12ab4
1 parent d40aece
commit df12ab4
Show file tree

Hide file tree

Showing 620 changed files with 19,257 additions and 2,339 deletions.
diff --git a/.gitignore b/.gitignore
@@ -18,4 +18,5 @@ ims.toc
 ims.tex
 ims.pdf
 ims.md
-ims.synctex(busy)
+ims.synctex(busy)
+/.quarto/
diff --git a/01-data-hello.Rmd → 01-data-hello.qmd b/01-data-hello.Rmd → 01-data-hello.qmd
diff --git a/02-data-design.Rmd → 02-data-design.qmd b/02-data-design.Rmd → 02-data-design.qmd
diff --git a/03-data-applications.Rmd → 03-data-applications.qmd b/03-data-applications.Rmd → 03-data-applications.qmd
diff --git a/04-explore-categorical.Rmd → 04-explore-categorical.qmd b/04-explore-categorical.Rmd → 04-explore-categorical.qmd
diff --git a/05-explore-numerical.Rmd → 05-explore-numerical.qmd b/05-explore-numerical.Rmd → 05-explore-numerical.qmd
diff --git a/06-explore-applications.Rmd → 06-explore-applications.qmd b/06-explore-applications.Rmd → 06-explore-applications.qmd
diff --git a/07-model-slr.Rmd → 07-model-slr.qmd b/07-model-slr.Rmd → 07-model-slr.qmd
diff --git a/08-model-mlr.Rmd → 08-model-mlr.qmd b/08-model-mlr.Rmd → 08-model-mlr.qmd
@@ -1,6 +1,7 @@
 # Linear regression with multiple predictors {#model-mlr}
 
-```{r, include = FALSE}
+```{r}
+#| include: false
 source("_common.R")
 ```
 
@@ -33,7 +34,8 @@ Based on the data in this dataset we have created two new variables: `credit_uti
 We will refer to this modified dataset as `loans`.
 :::
 
-```{r loans-data-matrix}
+```{r}
+#| label: loans-data-matrix
 loans <- loans_full_schema %>%
   mutate(
     credit_util = total_credit_utilized / total_credit_limit,
@@ -53,7 +55,8 @@ loans %>%
                 full_width = FALSE)
 ```
 
-```{r loans-variables}
+```{r}
+#| label: loans-variables
 loans_var_def <- tribble(
   ~variable,         ~description,
   "interest_rate",   "Interest rate on the loan, in an annual percentage.",
@@ -84,7 +87,8 @@ $$\widehat{\texttt{interest_rate}} = 12.34 + 0.74 \times \texttt{bankruptcy}$$
 
 Results of this model are shown in Table \@ref(tab:int-rate-bankruptcy).
 
-```{r int-rate-bankruptcy}
+```{r}
+#| label: int-rate-bankruptcy
 m_bankruptcy <- lm(interest_rate ~ bankruptcy, data = loans)
 
 m_bankruptcy %>%
@@ -116,11 +120,13 @@ Each row represents the relative difference for each level of `verified_income`.
 However, we are missing one of the levels: `Not Verified`.
 The missing level is called the **reference level** and it represents the default level that other levels are measured against.
 
-```{r include=FALSE}
+```{r}
+#| include: false
 terms_chp_8 <- c("reference level")
 ```
 
-```{r int-rate-ver-income}
+```{r}
+#| label: int-rate-ver-income
 m_verified_income <- lm(interest_rate ~ verified_income, data = loans)
 
 m_verified_income %>%
@@ -184,10 +190,10 @@ The average interest rate for these borrowers is 12.52%.
 :::
 
 ::: {.guidedpractice data-latex=""}
-Compute the average interest rate for borrowers whose income source and amount are both verified.[^model-mlr-1]
+Compute the average interest rate for borrowers whose income source and amount are both verified.[^08-model-mlr-1]
 :::
 
-[^model-mlr-1]: When `verified_income` takes a value of `Verified`, then the corresponding variable takes a value of 1 while the other is 0: $11.10 + 1.42 \times 0 + 3.25 \times 1 = 14.35.$ The average interest rate for these borrowers is 14.35%.
+[^08-model-mlr-1]: When `verified_income` takes a value of `Verified`, then the corresponding variable takes a value of 1 while the other is 0: $11.10 + 1.42 \times 0 + 3.25 \times 1 = 14.35.$ The average interest rate for these borrowers is 14.35%.
 
 ::: {.important data-latex=""}
 **Predictors with several categories.**
@@ -197,10 +203,10 @@ For the last level that does not receive a coefficient, this is the reference le
 :::
 
 ::: {.guidedpractice data-latex=""}
-Interpret the coefficients from the model above.[^model-mlr-2]
+Interpret the coefficients from the model above.[^08-model-mlr-2]
 :::
 
-[^model-mlr-2]: Each of the coefficients gives the incremental interest rate for the corresponding level relative to the `Not Verified` level, which is the reference level.
+[^08-model-mlr-2]: Each of the coefficients gives the incremental interest rate for the corresponding level relative to the `Not Verified` level, which is the reference level.
     For example, for a borrower whose income source and amount have been verified, the model predicts that they will have a 3.25% higher interest rate than a borrower who has not had their income source or amount verified.
 
 The higher interest rate for borrowers who have verified their income source or amount is surprising.
@@ -212,10 +218,10 @@ For this reason, the borrower could be deemed higher risk, resulting in a higher
 (What other confounding variables might explain this counter-intuitive relationship suggested by the model?)
 
 ::: {.guidedpractice data-latex=""}
-How much larger of an interest rate would we expect for a borrower who has verified their income source and amount vs a borrower whose income source has only been verified?[^model-mlr-3]
+How much larger of an interest rate would we expect for a borrower who has verified their income source and amount vs a borrower whose income source has only been verified?[^08-model-mlr-3]
 :::
 
-[^model-mlr-3]: Relative to the `Not Verified` category, the `Verified` category has an interest rate of 3.25% higher, while the `Source Verified` category is only 1.42% higher.
+[^08-model-mlr-3]: Relative to the `Not Verified` category, the `Verified` category has an interest rate of 3.25% higher, while the `Source Verified` category is only 1.42% higher.
     Thus, `Verified` borrowers will tend to get an interest rate about $3.25% - 1.42% = 1.83%$ higher than `Source Verified` borrowers.
 
 ## Many predictors in a model
@@ -225,7 +231,8 @@ For example, we might like to use the full context of borrowers to predict the i
 This is the strategy used in **multiple regression**.
 While we remain cautious about making any causal interpretations using multiple regression on observational data, such models are a common first step in gaining insights or providing some evidence of a causal connection.
 
-```{r include=FALSE}
+```{r}
+#| include: false
 terms_chp_8 <- c(terms_chp_8, "multiple regression")
 ```
 
@@ -259,7 +266,8 @@ We will discuss inference based on linear models in Chapter \@ref(inf-model-mlr)
 We typically use a computer to minimize the sum of squares and compute point estimates, as shown in the sample output in Table \@ref(tab:loans-full).
 Using this output, we identify $b_i,$ just as we did in the one-predictor case.
 
-```{r loans-full}
+```{r}
+#| label: loans-full
 m_full <- lm(interest_rate ~ ., data = loans)
 
 m_full %>%
@@ -316,16 +324,16 @@ A total of seven variables were used as predictors to fit this model: `verified_
 :::
 
 ::: {.guidedpractice data-latex=""}
-Interpret the coefficient of the variable `credit_checks`.[^model-mlr-4]
+Interpret the coefficient of the variable `credit_checks`.[^08-model-mlr-4]
 :::
 
-[^model-mlr-4]: All else held constant, for each additional inquiry into the applicant's credit during the last 12 months, we would expect the interest rate for the loan to be higher, on average, by 0.23 points.
+[^08-model-mlr-4]: All else held constant, for each additional inquiry into the applicant's credit during the last 12 months, we would expect the interest rate for the loan to be higher, on average, by 0.23 points.
 
 ::: {.guidedpractice data-latex=""}
-Compute the residual of the first observation in Table \@ref(tab:loans-data-matrix) using the full model.[^model-mlr-5]
+Compute the residual of the first observation in Table \@ref(tab:loans-data-matrix) using the full model.[^08-model-mlr-5]
 :::
 
-[^model-mlr-5]: To compute the residual, we first need the predicted value, which we compute by plugging values into the equation from earlier.
+[^08-model-mlr-5]: To compute the residual, we first need the predicted value, which we compute by plugging values into the equation from earlier.
     For example, $\texttt{verified_income}_{\texttt{Source Verified}}$ takes a value of 0, $\texttt{verified_income}_{\texttt{Verified}}$ takes a value of 1 (since the borrower's income source and amount were verified), $\texttt{debt_to_income}$ was 18.01, and so on.
     This leads to a prediction of $\widehat{\texttt{interest_rate}}_1 = 17.84$.
     The observed interest rate was 14.07%, which leads to a residual of $e_1 = 14.07 - 17.84 = -3.77$.
@@ -347,17 +355,18 @@ The previous example describes a common issue in multiple regression: correlatio
 We say the two predictor variables are collinear (pronounced as *co-linear*) when they are correlated, and this **multicollinearity** complicates model estimation.
 While it is impossible to prevent multicollinearity from arising in observational data, experiments are usually designed to prevent predictors from being multicollinear.
 
-```{r include=FALSE}
+```{r}
+#| include: false
 terms_chp_8 <- c(terms_chp_8, "multicollinearity")
 ```
 
 ::: {.guidedpractice data-latex=""}
 The estimated value of the intercept is 1.89, and one might be tempted to make some interpretation of this coefficient, such as, it is the model's predicted interest rate when each of the variables take value zero: income source is not verified, the borrower has no debt (debt-to-income and credit utilization are zero), and so on.
 Is this reasonable?
-Is there any value gained by making this interpretation?[^model-mlr-6]
+Is there any value gained by making this interpretation?[^08-model-mlr-6]
 :::
 
-[^model-mlr-6]: Many of the variables do take a value 0 for at least one data point, and for those variables, it is reasonable.
+[^08-model-mlr-6]: Many of the variables do take a value 0 for at least one data point, and for those variables, it is reasonable.
     However, one variable never takes a value of zero: `term`, which describes the length of the loan, in months.
     If `term` is set to zero, then the loan must be paid back immediately; the borrower must give the money back as soon as they receive it, which means it is not a real loan.
     Ultimately, the interpretation of the intercept in this setting is not insightful.
@@ -372,10 +381,10 @@ This equation remains valid in the multiple regression framework, but a small en
 
 ::: {.guidedpractice data-latex=""}
 The variance of the residuals for the model given in the earlier Guided Practice is 18.53, and the variance of the total price in all the auctions is 25.01.
-Calculate $R^2$ for this model.[^model-mlr-7]
+Calculate $R^2$ for this model.[^08-model-mlr-7]
 :::
 
-[^model-mlr-7]: $R^2 = 1 - \frac{18.53}{25.01} = 0.2591$.
+[^08-model-mlr-7]: $R^2 = 1 - \frac{18.53}{25.01} = 0.2591$.
 
 This strategy for estimating $R^2$ is acceptable when there is just a single variable.
 However, it becomes less helpful when there are many variables.
@@ -401,33 +410,35 @@ where $n$ is the number of observations used to fit the model and $k$ is the num
 Remember that a categorical predictor with $p$ levels will contribute $p - 1$ to the number of variables in the model.
 :::
 
-```{r include=FALSE}
+```{r}
+#| include: false
 terms_chp_8 <- c(terms_chp_8, "adjusted R-squared")
 ```
 
 Because $k$ is never negative, the adjusted $R^2$ will be smaller -- often times just a little smaller -- than the unadjusted $R^2$.
 The reasoning behind the adjusted $R^2$ lies in the **degrees of freedom** associated with each variance, which is equal to $n - k - 1$ in the multiple regression context.
 If we were to make predictions for *new data* using our current model, we would find that the unadjusted $R^2$ would tend to be slightly overly optimistic, while the adjusted $R^2$ formula helps correct this bias.
 
-```{r include=FALSE}
+```{r}
+#| include: false
 terms_chp_8 <- c(terms_chp_8, "degrees of freedom")
 ```
 
 ::: {.guidedpractice data-latex=""}
 There were n = 10,000 auctions in the dataset and $k=9$ predictor variables in the model.
-Use $n$, $k$, and the variances from the earlier Guided Practice to calculate $R_{adj}^2$ for the interest rate model.[^model-mlr-8]
+Use $n$, $k$, and the variances from the earlier Guided Practice to calculate $R_{adj}^2$ for the interest rate model.[^08-model-mlr-8]
 :::
 
-[^model-mlr-8]: $R_{adj}^2 = 1 - \frac{18.53}{25.01}\times \frac{10000-1}{10000-9-1} = 0.2584$.
+[^08-model-mlr-8]: $R_{adj}^2 = 1 - \frac{18.53}{25.01}\times \frac{10000-1}{10000-9-1} = 0.2584$.
     While the difference is very small, it will be important when we fine tune the model in the next section.
 
 ::: {.guidedpractice data-latex=""}
 Suppose you added another predictor to the model, but the variance of the errors $Var(e_i)$ didn't go down.
 What would happen to the $R^2$?
-What would happen to the adjusted $R^2$?[^model-mlr-9]
+What would happen to the adjusted $R^2$?[^08-model-mlr-9]
 :::
 
-[^model-mlr-9]: The unadjusted $R^2$ would stay the same and the adjusted $R^2$ would go down.
+[^08-model-mlr-9]: The unadjusted $R^2$ would stay the same and the adjusted $R^2$ would go down.
 
 Adjusted $R^2$ could also have been used in Chapter \@ref(model-slr) where we introduced regression models with a single predictor.
 However, when there is only $k = 1$ predictors, adjusted $R^2$ is very close to regular $R^2$, so this nuance isn't typically important when the model has only one predictor.
@@ -439,14 +450,16 @@ Sometimes including variables that are not evidently important can actually redu
 In this section, we discuss model selection strategies, which will help us eliminate variables from the model that are found to be less important.
 It's common (and hip, at least in the statistical world) to refer to models that have undergone such variable pruning as **parsimonious**.
 
-```{r include=FALSE}
+```{r}
+#| include: false
 terms_chp_8 <- c(terms_chp_8, "parsimonious")
 ```
 
 In practice, the model that includes all available predictors is often referred to as the **full model**.
 The full model may not be the best model, and if it isn't, we want to identify a smaller model that is preferable.
 
-```{r include=FALSE}
+```{r}
+#| include: false
 terms_chp_8 <- c(terms_chp_8, "full model")
 ```
 
@@ -455,7 +468,8 @@ terms_chp_8 <- c(terms_chp_8, "full model")
 Two common strategies for adding or removing variables in a multiple regression model are called backward elimination and forward selection.
 These techniques are often referred to as **stepwise selection** strategies, because they add or delete one variable at a time as they "step" through the candidate predictors.
 
-```{r include=FALSE}
+```{r}
+#| include: false
 terms_chp_8 <- c(terms_chp_8, "stepwise selection")
 ```
 
@@ -464,7 +478,8 @@ terms_chp_8 <- c(terms_chp_8, "stepwise selection")
 **Forward selection** is the reverse of the backward elimination technique.
 Instead, of eliminating variables one-at-a-time, we add variables one-at-a-time until we cannot find any variables that improve the model any further.
 
-```{r include=FALSE}
+```{r}
+#| include: false
 terms_chp_8 <- c(terms_chp_8, "backward elimination", "forward selection")
 ```
 
@@ -477,7 +492,8 @@ Adjusted $R^2$ describes the strength of a model fit, and it is a useful tool fo
 Let's consider two models, which are shown in Table \@ref(tab:loans-full-for-model-selection) and Table \@ref(tab:loans-full-except-issue-month).
 The first table summarizes the full model since it includes all predictors, while the second does not include the `issue_month` variable.
 
-```{r loans-full-for-model-selection}
+```{r}
+#| label: loans-full-for-model-selection
 options(digits = 6) # to get more digits
 m_full_r_sq_adj <- glance(m_full)$adj.r.squared %>% round(4)
 options(digits = 3) # to get back to default set in _common.R
@@ -503,7 +519,8 @@ m_full_w_rsq %>%
   row_spec(11:12, italic = TRUE)
 ```
 
-```{r loans-full-except-issue-month}
+```{r}
+#| label: loans-full-except-issue-month
 m_full_minus_issue_month <- lm(interest_rate ~ . - issue_month, data = loans)
 
 options(digits = 6) # to get more digits
@@ -704,6 +721,5 @@ make_terms_table(terms_chp_8)
 Answers to odd numbered exercises can be found in Appendix \@ref(exercise-solutions-08).
 
 ::: {.exercises data-latex=""}
-```{r exercises-08, child = "exercises/08-ex-model-mlr.Rmd"}
-```
+{{< include exercises/_08-ex-model-mlr.qmd >}}
 :::