diff --git a/highdim/dimension-reduction.qmd b/highdim/dimension-reduction.qmd index 35cee92..bf1f1a6 100644 --- a/highdim/dimension-reduction.qmd +++ b/highdim/dimension-reduction.qmd @@ -1,8 +1,8 @@ # Dimension reduction -A typical machine learning task involves working with a large number of predictors, which makes visualization somewhat challenging. We have shown methods for visualizing univariate and paired data, but plots that reveal relationships between many variables are more complicated in higher dimensions. For example, to compare each of the 784 features in our predicting digits example, we would have to create, for example, 306,936 scatterplots. Creating one single scatter-plot of the data is impossible due to the high dimensionality. +A typical machine learning task involves working with a large number of predictors, which makes visualization somewhat challenging. We have shown methods for visualizing univariate and paired data, but plots that reveal relationships between many variables are more complicated in higher dimensions. For example, to compare each of the 784 features in our predicting digits example, we would have to create 306,936 scatterplots. Creating one single scatterplot of the data is impossible due to the high dimensionality. -Here we describe powerful techniques useful for exploratory data analysis, among other things, generally referred to as *dimension reduction*. The general idea is to reduce the dimension of the dataset while preserving important characteristics, such as the distance between features or observations. With fewer dimensions, visualization then becomes more feasible. The general technique behind it all, the singular value decomposition, is also useful in other contexts. Principal component analysis (PCA) is the approach we will be showing. We will motivate the ideas with a simple illustrative example and present some mathematical concepts needed to understand PCA. We finish the chapter with demonstrating the use of PCA in two more complex datasets. +Here we describe powerful techniques useful for exploratory data analysis, among other things, generally referred to as *dimension reduction*. The general idea is to reduce the dimension of the dataset while preserving important characteristics, such as the distance between features or observations. With fewer dimensions, visualization then becomes more feasible. The general technique behind it all, the singular value decomposition, is also useful in other contexts. Principal component analysis (PCA) is the approach we will be showing. We will motivate the ideas with a simple illustrative example and present some mathematical concepts needed to understand PCA. We finish the chapter by demonstrating the use of PCA in two more complex datasets. ## Motivation: preserving distance @@ -30,7 +30,7 @@ lines(x[c(2, 51),], col = "red", lwd = 2) points(x[c(1, 2, 51),], pch = 16) ``` -Our features are $n$ two-dimensional points, the two heights. For illustrative purposes, we will act as if visualizing two dimensions is too challenging and we want to explore the data through a histogram of a one-dimensional variable. We therefore want to reduce the dimensions from two to one, but still be able to understand important characteristics of the data, for example that the observations cluster into two groups: adults and children. To show the ideas presented here are generally useful, we will standardize the data so that observations are in standard units rather than inches: +Our features are $n$ two-dimensional points, the two heights. For illustrative purposes, we will pretend that visualizing two dimensions is too challenging and we want to explore the data through a histogram of a one-dimensional variable. We therefore want to reduce the dimensions from two to one, but still be able to understand important characteristics of the data, in particular that the observations cluster into two groups: adults and children. To show the ideas presented here are generally useful, we will standardize the data so that observations are in standard units rather than inches: ```{r} library(matrixStats) @@ -38,7 +38,7 @@ x <- sweep(x, 2, colMeans(x)) x <- sweep(x, 2, colSds(x), "/") ``` -In the figure above we show the distance between observation 1 and 2 (blue), and observation 1 and 51 (red). Note that the blue line is shorter, which implies 1 and 2 are closer. +In the figure above, we show the distance between observation 1 and 2 (blue), and observation 1 and 51 (red). Note that the blue line is shorter, which implies that 1 and 2 are closer. We can compute these distances using `dist`: @@ -56,13 +56,13 @@ Let's start with the naive approach of simply removing one of the two dimensions z <- x[,1] ``` -To make the distances comparable, we divide the sum of squares by the number of dimensions being added. So for the two dimensional case we have +To make the distances comparable, we divide the sum of squares by the number of dimensions being added. So for the two dimensional case, we have: $$ \sqrt{ \frac{1}{2} \sum_{j=1}^2 (x_{1,j}-x_{2,j})^2 }, $$ -so to make the distances comparable we divide by $\sqrt{2}$: +To make the distances comparable, we divide by $\sqrt{2}$: ```{r} #| eval: false @@ -76,9 +76,9 @@ plot(dist(x) / sqrt(2), dist(z)) abline(0, 1, col = "red") ``` -This one number summary does ok at preserving distances, but, can we pick a one-dimensional summary that makes the approximation even better? +FIX (does ok -> is ok?) This one number summary does ok at preserving distances, but, can we pick a one-dimensional summary that improves the approximation? -If we look back at the scatterplot and visualize a line between any pair of points, the length of this line is the distance between the two points. These lines tend to go along the direction of the diagonal. We will learn that we can *rotate* the points in a way that preserve the distance between points, while increasing the variability in one dimension and reducing it on the other. By doing this, we keep more of the *information* about distances in the first dimension. In the next section we describe a mathematical approach that permits us to find rotations that preserve distance between points. We can then find the rotation that maximizes the variability in the first dimension. +If we look back at the scatterplot and visualize a line between any pair of points, the length of this line is the distance between the two points. These lines tend to go along the direction of the diagonal. We will learn that we can *rotate* the points in a way that preserve the distance between points, while increasing the variability in one dimension and reducing it on the other. Using this method, we keep more of the *information* about distances in the first dimension. In the next section, we describe a mathematical approach that permits us to find rotations that preserve distance between points. We can then find the rotation that maximizes the variability in the first dimension. ## Rotations @@ -88,9 +88,9 @@ $$ x_1 = r \cos\phi, \,\, x_2 = r \sin\phi $$ -with $r$ the length of the hypotenuse and $\phi$ the angel between the hypotenuse and the x-axis. +with $r$ the length of the hypotenuse and $\phi$ the angle between the hypotenuse and the x-axis. -We can *rotate* the point $(x_1, x_2)^\top$ around a circle with center $(0,0)^\top$ and radius $r$ by an angle $\theta$ by changing the angle in the previous equation to $\phi + \theta$: +FIX We can *rotate* the point $(x_1, x_2)^\top$ around a circle with center $(0,0)^\top$ and radius $r$ by an angle $\theta$ by changing the angle in the previous equation to $\phi + \theta$: $$ z_1 = r \cos(\phi+ \theta), \,\, @@ -125,7 +125,7 @@ text(cos(phi + theta), sin(phi + theta), draw.circle(theta, start = phi, r = 0.3) ``` -We can use trigonometric identities to rewrite $(z_1, z_2)$ in the following way: +We can use trigonometric identities to rewrite $(z_1, z_2)$ as follows: $$ \begin{align} @@ -151,11 +151,11 @@ lines(z[c(2,51),], col = "red", lwd = 2) points(z[c(1,2,51),], pch = 16) ``` -Note that while the variability of $x_1$ and $x_2$ are similar, the variability of $z_1$ is much larger than the variability of $z_2$. Also note that the distances between points appear to be preserved. In the next sections, we show, mathematically, that this in fact the case. +Note that while the variability of $x_1$ and $x_2$ are similar, the variability of $z_1$ is much larger than the variability of $z_2$. Also, notice that the distances between points appear to be preserved. In the next sections, we show mathematically that this in fact the case. ## Linear transformations -Any time a matrix $\mathbf{X}$ is multiplied by another matrix $\mathbf{A}$, we refer to the product $\mathbf{Z} = \mathbf{X}\mathbf{A}$ as a linear transformation of $\mathbf{X}$. Below we show that the rotations described above are a linear transformation. To see this, note that for any row $i$, the first entry was: +Any time a matrix $\mathbf{X}$ is multiplied by another matrix $\mathbf{A}$, we refer to the product $\mathbf{Z} = \mathbf{X}\mathbf{A}$ as a linear transformation of $\mathbf{X}$. Below, we show that the rotations described above are a linear transformation. To see this, note that for any row $i$, the first entry was: $$z_{i,1} = a_{1,1} x_{i,1} + a_{2,1} x_{i,2}$$ @@ -183,7 +183,7 @@ x_1\\x_2 \end{pmatrix} $$ -An advantage of using linear algebra is that we can write the transformation for the entire dataset by saving all observations in a $N \times 2$ matrix +An advantage of using linear algebra is that we can write the transformation for the entire dataset by saving all observations in a $N \times 2$ matrix: $$ \mathbf{X} \equiv @@ -199,7 +199,7 @@ x_{n,1}&x_{n,2} \end{bmatrix} $$ -We can then obtained the rotated values $\mathbf{z}_i$ for each row $i$ by applying a *linear transformation* of $X$: +We can then obtain the rotated values $\mathbf{z}_i$ for each row $i$ by applying a *linear transformation* of $X$: $$ \mathbf{Z} = \mathbf{X} \mathbf{A} @@ -217,7 +217,7 @@ a_{2,1}&a_{2,2} $$ -If we define +If we define: ```{r} theta <- 2*pi*-45/360 #convert to radians @@ -234,11 +234,11 @@ rotate <- function(x, theta){ } ``` -The columns of $\mathbf{A}$ are referred to as _directions_ because if we draw a vector from $(0,0)$ to $(a_{1,j}, a_{2,j})$ it points in the direction of the line that will become the $j-th$ dimension. +The columns of $\mathbf{A}$ are referred to as _directions_ because if we draw a vector from $(0,0)$ to $(a_{1,j}, a_{2,j})$, it points in the direction of the line that will become the $j-th$ dimension. -Another advantage of linear algebra is that if we can find the inverse matrix of $\mathbf{A}^\top$ we can convert $\mathbf{Z}$ back to $\mathbf{X}$ again using a linear transformation. +Another advantage of linear algebra is that if we can find the inverse matrix of $\mathbf{A}^\top$, we can convert $\mathbf{Z}$ back to $\mathbf{X}$ again using a linear transformation. -In this particular case we can use trigonometry to show that +In this particular case, we can use trigonometry to show that: $$ x_{i,1} = b_{1,1} z_{i,1} + b_{2,1} z_{i,2}\\ @@ -247,7 +247,7 @@ $$ with $b_{2,1} = \cos\theta$, $b_{2,1} = \sin\theta$, $b_{1,2} = -\sin\theta$, and $b_{2,2} = \cos\theta$. -This implies that +This implies that: $$ \mathbf{X} = \mathbf{Z} @@ -261,7 +261,7 @@ $$ \mathbf{Z} \mathbf{A}^\top = \mathbf{X} \mathbf{A}\mathbf{A}^\top\ = \mathbf{X} $$ -and therefore that $\mathbf{A}^\top$ is the inverse of $\mathbf{A}$. This also implies that all the information in $\mathbf{X}$ is included in the rotation $\mathbf{Z}$, and it can be retrieved via a linear transformation. A consequence is that for any rotation the distances are preserved. Here is an example for a 30 degree rotation, but it works for any angle: +and therefore that $\mathbf{A}^\top$ is the inverse of $\mathbf{A}$. This also implies that all the information in $\mathbf{X}$ is included in the rotation $\mathbf{Z}$, and it can be retrieved via a linear transformation. A consequence is that for any rotation the distances are preserved. Here is an example for a 30 degree rotation, although it works for any angle: ```{r} all.equal(as.matrix(dist(rotate(x, 30))), as.matrix(dist(x))) @@ -280,10 +280,10 @@ $$ with $\mathbf{z}_h$ and $\mathbf{z}_i$ the $p \times 1$ column vectors stored in the $h$-th and $i$-th rows of $\mathbf{X}$, respectively. ::: callout-note -Remember that we represent the rows of a matrix as column vectors. This explains why we use $\mathbf{A}$ when showing the multiplication for the matrix $\mathbf{Z}=\mathbf{X}\mathbf{A}$ but transpose the operation when showing the transformation for just one observation: $\mathbf{z}_i = \mathbf{A}^\top\mathbf{x}_i$ +Remember that we represent the rows of a matrix as column vectors. This explains why we use $\mathbf{A}$ when showing the multiplication for the matrix $\mathbf{Z}=\mathbf{X}\mathbf{A}$, but transpose the operation when showing the transformation for just one observation: $\mathbf{z}_i = \mathbf{A}^\top\mathbf{x}_i$ ::: -Using linear algebra, we can rewrite the quantity above as +Using linear algebra, we can rewrite the quantity above as: $$ ||\mathbf{z}_h - \mathbf{z}_i|| = @@ -291,9 +291,9 @@ $$ (\mathbf{x}_h - \mathbf{x}_i)^\top \mathbf{A} \mathbf{A}^\top (\mathbf{x}_h - \mathbf{x}_i) $$ -Note that if $\mathbf{A} \mathbf{A} ^\top= \mathbf{I}$ then the distance between the $h$th and $i$th rows is the same for the original and transformed data. +Note that if $\mathbf{A} \mathbf{A} ^\top= \mathbf{I}$, then the distance between the $h$th and $i$th rows is the same for the original and transformed data. -We refer to transformation with the property $\mathbf{A} \mathbf{A}^\top = \mathbf{I}$ as an *orthogonal transformations* and they are guaranteed to preserves the distance between any two points. +We refer to transformation with the property $\mathbf{A} \mathbf{A}^\top = \mathbf{I}$ as *orthogonal transformations*. These are guaranteed to preserve the distance between any two points. We previously demonstrated our rotation has this property. We can confirm using R: @@ -301,7 +301,7 @@ We previously demonstrated our rotation has this property. We can confirm using A %*% t(A) ``` -Notice that $\mathbf{A}$ being orthogonal also guarantees that the total sum of squares (TSS) of $\mathbf{X}$, defined as $\sum_{i=1}^n \sum_{j=1}^p x_{i,j}^2$ is equal to the total sum of squares of the rotation $\mathbf{Z} = \mathbf{X}\mathbf{A}^\top$. To show this notice that if we denote the rows of $\mathbf{Z}$ as $\mathbf{z}_1, \dots, \mathbf{z}_n$, then sum of squares can be written as: +Notice that $\mathbf{A}$ being orthogonal also guarantees that the total sum of squares (TSS) of $\mathbf{X}$, defined as $\sum_{i=1}^n \sum_{j=1}^p x_{i,j}^2$ is equal to the total sum of squares of the rotation $\mathbf{Z} = \mathbf{X}\mathbf{A}^\top$. To illustrate, observe that if we denote the rows of $\mathbf{Z}$ as $\mathbf{z}_1, \dots, \mathbf{z}_n$, then sum of squares can be written as: $$ \sum_{1=1}^n ||\mathbf{z}_i||^2 = \sum_{i=1}^n ||\mathbf{A}^\top\mathbf{x}_i||^2 = \sum_{i=1}^n \mathbf{x}_i^\top \mathbf{A}\mathbf{A}^\top \mathbf{x}_i = \sum_{i=1}^n \mathbf{x}_i^\top\mathbf{x}_i = \sum_{i=1}^n||\mathbf{x}_i||^2 @@ -316,25 +316,25 @@ sum(x^2) sum(z^2) ``` -This can be interpreted as a consequence of the fact that orthogonal transformation guarantee that all the information is preserved. +This can be interpreted as a consequence of the fact that an orthogonal transformation guarantees that all the information is preserved. -However, although the total is preserved, the sum of squares for the individual columns changes. Here we compute the proportion of TSS attributed to each column, referred to as the *variance explained* or *variance captured* by each column, for $\mathbf{X}$ +However, although the total is preserved, the sum of squares for the individual columns changes. Here we compute the proportion of TSS attributed to each column, referred to as the *variance explained* or *variance captured* by each column, for $\mathbf{X}$: ```{r} colSums(x^2)/sum(x^2) ``` -and $\mathbf{Z}$ +and $\mathbf{Z}$: ```{r} colSums(z^2)/sum(z^2) ``` -In the next section we describe how this last mathematical result can be useful. +In the next section, we describe how this last mathematical result can be useful. ## Principal Component Analysis (PCA) {#sec-pca} -We have established that orthogonal transformations preserve the the distance between observations and the total sum of squares. We have also established that, while the TSS remains the same, the way this total is distributed across the columns can change. +We have established that orthogonal transformations preserve the distance between observations and the total sum of squares. We have also established that, while the TSS remains the same, the way this total is distributed across the columns can change. The general idea behind Principal Component Analysis (PCA) is to try to find orthogonal transformations that concentrate the variance explained in the first few columns. We can then focus on these few columns, effectively reducing the dimension of the problem. In our specific example, we are looking for the rotation that maximizes the variance explained in the first column. The following code performs a grid search across rotations from -90 to 0: @@ -353,7 +353,7 @@ variance_explained <- v[1,]/sum(x^2) plot(angles, variance_explained, type = "l") ``` -We find that a -45 degree rotation appears to achieve the maximum, with over 98% of the total variability explained by the first dimension. We denote this rotation matrix with $\mathbf{V}$: +We find that a -45 degree rotation appears to achieve the maximum, with over 98% of the total variability explained by the first dimension. We denote this rotation matrix with $\mathbf{V}$: ```{r} theta <- 2*pi*-45/360 #convert to radians @@ -361,7 +361,7 @@ V <- matrix(c(cos(theta), -sin(theta), sin(theta), cos(theta)), 2, 2) ``` -We can rotate the entire dataset using +We can rotate the entire dataset using: $$ \mathbf{Z} = \mathbf{X}\mathbf{V} @@ -457,7 +457,7 @@ hist(x[,2], breaks = seq(-4,4,0.5)) hist(z[,1], breaks = seq(-4,4,0.5)) ``` -We can visualize these to see how the first component summarizes the data. In the plot below red represents high values and blue negative values: +We can visualize these to see how the first component summarizes the data. In the plot below, red represents high values and blue negative values: ```{r illustrate-pca-twin-heights, echo=FALSE, height = 5, out.width="70%"} illustrate_pca <- function(x, flip=1, @@ -499,7 +499,7 @@ illustrate_pca(x, flip = -1) ``` -This idea generalizes to dimensions higher than 2. As we did in our two dimensional example, we start by finding the $p\times1$ vector $\mathbf{v}_1$ with$||\mathbf{v}_1||=1$ that maximizes $||\mathbf{X} \mathbf{v}_1||$. $\mathbf{X} \mathbf{a}_1$ is the first PC. To find the second PC, we subtract the variation explained by first PC from $\mathbf{X}$: +This idea generalizes to dimensions higher than 2. FIX As demonstrated in our two dimensional example, we start by finding the $p\times1$ vector $\mathbf{v}_1$ with$||\mathbf{v}_1||=1$ that maximizes $||\mathbf{X} \mathbf{v}_1||$. $\mathbf{X} \mathbf{a}_1$ is the first PC. To find the second PC, we subtract the variation explained by first PC from $\mathbf{X}$: $$ \mathbf{r} = \mathbf{X} - \mathbf{X} \mathbf{v}_1 \mathbf{v}_1^\top @@ -515,25 +515,25 @@ $$ \mathbf{Z} = \mathbf{X}\mathbf{V} $$ -The ideas of distance preervation extends to higher dimensions. For a multidimensional matrix with $p$ columns, the $\mathbf{A}$ transformation preserves distance between rows, but with the variance exaplined by the columns in decreasing order. -If the variances of the columns $\mathbf{Z}_j$, $j>k$ are very small, these dimensions have little to contribute to the distance calculation and we can approximate distance between any two points with just $k$ dimensions. If $k$ is much smaller than $p$, then we can achieve a very efficient summary of our data. +The ideas of distance preservation extends to higher dimensions. For a multidimensional matrix with $p$ columns, the $\mathbf{A}$ transformation preserves the distance between rows, but with the variance exaplined by the columns in decreasing order. +If the variances of the columns $\mathbf{Z}_j$, $j>k$ are very small, these dimensions have little to contribute to the distance calculation and we can approximate the distance between any two points with just $k$ dimensions. If $k$ is much smaller than $p$, then we can achieve a very efficient summary of our data. :::{.callout-warning} -Notice that the solution to this maximization problem is not unique because $||\mathbf{X} \mathbf{v}|| = ||-\mathbf{X} \mathbf{v}||$. Also note that if we multiply a column of $\mathbf{A}$ by $-1$ we still represent $\mathbf{X}$ as $\mathbf{Z}\mathbf{V}^\top$ as long as we also multiple the corresponsing column of $\matbf{V}$ by -1. This implies that sign of each column of the rotation $\mathbf{V}$ and principal component matrix $\mathbf{Z}$ is arbitrary. +Notice that the solution to this maximization problem is not unique because $||\mathbf{X} \mathbf{v}|| = ||-\mathbf{X} \mathbf{v}||$. Also, note that if we multiply a column of $\mathbf{A}$ by $-1$, we still represent $\mathbf{X}$ as $\mathbf{Z}\mathbf{V}^\top$ as long as we also multiple the corresponding column of $\matbf{V}$ by -1. FIX This implies that sign of each column of the rotation $\mathbf{V}$ and principal component matrix $\mathbf{Z}$ is arbitrary. ::: -In R we can find the principal components of any matrix with the function `prcomp`: +In R, we can find the principal components of any matrix with the function `prcomp`: ```{r} pca <- prcomp(x, center = FALSE) ``` -Note that default behavior is to center the columns of `x` before computing the PCs, an operation we don't need because our matrix is scaled. +Keep in mind that default behavior is to center the columns of `x` before computing the PCs, an operation we don't need because our matrix is scaled. The object `pca` includes the rotated data $Z$ in `pca$x` and the rotation $\mathbf{V}$ in `pca$rotation`. -We can see that columns of the `pca$rotation` are indeed the rotation obtained with -45 (remember the sign is arbitrary) +We can see that columns of the `pca$rotation` are indeed the rotation obtained with -45 (remember the sign is arbitrary): ```{r} pca$rotation @@ -552,7 +552,7 @@ summary(pca) ``` -We also see that we can transform between `x` ($\mathbf{X}$) and `pca$x` ($\mathbf{Z}$) as explained with mathematical formulas above: +FIX We also see that we can transform between `x` ($\mathbf{X}$) and `pca$x` ($\mathbf{Z}$) as explained with mathematical formulas above: ```{r} all.equal(pca$x, x %*% pca$rotation) @@ -570,9 +570,9 @@ The iris data is a widely used example in data analysis courses. It includes fou names(iris) ``` -If you print `iris$Species` you will see that the data is ordered by the species. +If you print `iris$Species`, you will see that the data is ordered by the species. -If we visualize the distances we can clearly see the three species with one species very different from the other two: +If we visualize the distances, we can clearly see the three species with one species very different from the other two: ```{r, eval=FALSE} x <- iris[,1:4] |> as.matrix() @@ -593,14 +593,14 @@ Our features matrix has four dimensions, but three are very correlated: cor(x) ``` -If we apply PCA, we should be able to approximate this distance with just two dimensions, compressing the highly correlated dimensions. Using the `summary` function we can see the variability explained by each PC: +If we apply PCA, we should be able to approximate this distance with just two dimensions, compressing the highly correlated dimensions. Using the `summary` function, we can see the variability explained by each PC: ```{r} pca <- prcomp(x) summary(pca) ``` -The first two dimensions account for almot 98% of the variability. Thus we should be able to approximate the distance very well with two dimensions. We confirm this by computing the distance from first two dimensions and comparing to the original: +The first two dimensions account for almot 98% of the variability. Thus, we should be able to approximate the distance very well with two dimensions. We confirm this by computing the distance from first two dimensions and comparing to the original: ```{r, eval = FALSE} d_approx <- dist(pca$x[, 1:2]) @@ -623,23 +623,23 @@ data.frame(pca$x[,1:2], Species = iris$Species) |> coord_fixed(ratio = 1) ``` -We color the the observation by their labels and notice that with these two dimensions we achieve almost perfect separation. +We color the observations by their labels and notice that, with these two dimensions, we achieve almost perfect separation. -Looking more closely at the resulting PCs and rotaions: +Looking more closely at the resulting PCs and rotations: ```{r illustrate-pca-twin-heights-iris, echo=FALSE, fig.height = 6, out.width="70%"} rafalib::mypar() illustrate_pca(x) ``` -we learn that the first PC is obtained by taking a weighted avarge of sepal length, petal length, and petal width, since these are red in first column, and subtracts a weighted sepal width, since this is blue. The second PC is a weighted average of weighted average of petal length and petal width minus a weighted average of sepal length and petal width. +FIX we learn that the first PC is obtained by taking a weighted average of sepal length, petal length, and petal width, since these are red in first column, and subtract a weighted sepal width, since this is blue. The second PC is a weighted average of weighted average of petal length and petal width, minus a weighted average of sepal length and petal width. ### MNIST example The written digits example has 784 features. Is there any room for data reduction? We will use PCA to answer this. -Let's load the data if not already loaded: +If not already loaded, let's begin by loading the data: ```{r} library(dslabs) @@ -648,7 +648,7 @@ if (!exists("mnist")) mnist <- read_mnist() Because the pixels are so small, we expect pixels close to each other on the grid to be correlated, meaning that dimension reduction should be possible. -Let's compute the PCs. This will take a few seconds as it is a rather large matrix. +Let's compute the PCs. This will take a few seconds as it is a rather large matrix: ```{r, cache=TRUE} pca <- prcomp(mnist$train$images) @@ -663,29 +663,30 @@ plot(pca$sdev^2/sum(pca$sdev^2), xlab = "PC", ylab = "Variance explained") plot(pca$sdev^2/sum(pca$sdev^2), xlab = "PC", ylab = "Variance explained") ``` -We can see that the first few PCs already explain a large percent of the variability: +We can see that the first few PCs already explain a large percent of the variability. -And just by looking at the first two PCs we already see information about the labels. Here is a random sample of 500 digits: +Simply by looking at the first two PCs we already see information about the labels. Here is a random sample of 500 digits: -```{r mnist-pca-1-2-scatter} +```{r mnist-pca-1-2-scatter, echo=FALSE} data.frame(PC1 = pca$x[,1], PC2 = pca$x[,2], label = factor(mnist$train$label)) |> - sample_n(500) |> + dplyr::sample_n(500) |> ggplot(aes(PC1, PC2, fill = label)) + geom_point(cex = 3, pch = 21) ``` -We can also *see* the linear combinations on the grid to get an idea of how pixels are getting to compute the first four principal components: +FIX We can also *see* the linear combinations on the grid to get an idea of how pixels are getting to compute the first four principal components: ```{r mnist-pca-1-4, echo = FALSE, out.width="100%", fig.width=6, fig.height=1.75} library(RColorBrewer) tmp <- lapply( c(1:4,781:784), function(i){ expand.grid(Row = 1:28, Column = 1:28) |> - mutate(id = i, label = paste0("PC",i), + dplyr::mutate(id = i, label = paste0("PC",i), value = pca$rotation[,i]) }) tmp <- Reduce(rbind, tmp) -tmp |> filter(id <= 4) |> +tmp |> + dplyr::filter(id <= 4) |> ggplot(aes(Row, Column, fill = value)) + geom_raster() + scale_y_reverse() + @@ -693,7 +694,7 @@ tmp |> filter(id <= 4) |> facet_wrap(~label, nrow = 1) ``` -We can clearly see that first PC appears to be separating the 1s (red) from the 0s (blue). We can kind of make out numbers in the other three PCs as well. By looking at the PCs stratified by digit we get futher insights. For example, we see that the second PC separates 4s, 7s, and 9s from the rest: +We can clearly see that first PC appears to be separating the 1s (red) from the 0s (blue). We can vaguely discern numbers in the other three PCs as well. By looking at the PCs stratified by digits, we get futher insights. For example, we see that the second PC separates 4s, 7s, and 9s from the rest: ```{r digit-pc-boxplot} #| echo: false @@ -705,7 +706,7 @@ data.frame(label = factor(mnist$train$labels), PC2 = pca$x[,2]) |> We can also confirm that the lower variance PCs appear related to unimportant variability, mainly smudges in the corners: ```{r mnist-pca-last,, echo = FALSE, out.width="100%", fig.width=6, fig.height=1.75} -tmp |> filter(id > 5) |> +tmp |> dplyr::filter(id > 5) |> ggplot(aes(Row, Column, fill = value)) + geom_raster() + scale_y_reverse() + @@ -721,11 +722,11 @@ tmp |> filter(id > 5) |> dim(tissue_gene_expression$x) ``` -We want to get an idea of which observations are close to each other, but the predictors are 500-dimensional so plotting is difficult. Plot the first two principal components with color representing tissue type. +We hope to get an idea of which observations are close to each other, but the predictors are 500-dimensional so plotting is difficult. Plot the first two principal components with color representing tissue type. 2\. The predictors for each observation are measured on the same measurement device (a gene expression microarray) after an experimental procedure. A different device and procedure is used for each observation. This may introduce biases that affect all predictors for each observation in the same way. To explore the effect of this potential bias, for each observation, compute the average across all predictors and then plot this against the first PC with color representing tissue. Report the correlation. -3\. We see an association with the first PC and the observation averages. Redo the PCA but only after removing the center. +3\. We see an association with the first PC and the observation averages. Redo the PCA, but only after removing the center. 4\. For the first 10 PCs, make a boxplot showing the values for each tissue. diff --git a/highdim/intro-highdim.qmd b/highdim/intro-highdim.qmd index 4d45215..fab1b1a 100644 --- a/highdim/intro-highdim.qmd +++ b/highdim/intro-highdim.qmd @@ -1,8 +1,8 @@ # High dimensional data {.unnumbered} -There is a variety of computational techniques and statistical concepts that are useful for analysis of datasets for which each observation is associated with a large number of numerical variables. In this chapter we provide a basic introduction to these techniques and concepts by describing matrix operations in R, dimension reduction, regularization, and matrix factorization. Handwritten digits data and movie recommendation systems serve as motivating examples. +There is a variety of computational techniques and statistical concepts that are useful for analysis of datasets for which each observation is associated with a large number of numerical variables. In this chapter, we provide a basic introduction to these techniques and concepts by describing matrix operations in R, dimension reduction, regularization, and matrix factorization. Handwritten digits data and movie recommendation systems serve as motivating examples. -A task that serves as motivation for this part of the book is quantifying the similarity between any two observations. For example, we might want to know how much two handwritten digits look like each other. However, note that each observations is associated with $28 \times 28 = 784$ pixels so we can't simply use subtraction as we would do if our data was one dimensional. +A task that serves as motivation for this part of the book is quantifying the similarity between any two observations. For example, we might want to know how much two handwritten digits look like each other. However, note that each observation is associated with $28 \times 28 = 784$ pixels so we can't simply use subtraction as we would if our data was one dimensional. Instead, we will define observations as *points* in a *high-dimensional* space and mathematically define a *distance*. Many machine learning techniques, discussed in the next part of the book, require this calculation. -Additionally, this part of the book discusses dimension reduction. Here we search of data summaries that result in more manageable lower dimension versions of the data, but preserve most or all the *information* we need. Here too we can use distance between observations as specific challenge: we will reduce the dimensions summarize the data into lower dimensions, but in a way that preserves the distance between any two observations. We use *linear algebra* as a mathematical foundation for all the techniques presented here. +Additionally, this part of the book discusses dimension reduction. FIX Here we search of data summaries that result in more manageable lower dimension versions of the data, but preserve most or all the *information* we need. FIX Here too we can use distance between observations as specific challenge: we will reduce the dimensions summarize the data into lower dimensions, but in a way that preserves the distance between any two observations. We use *linear algebra* as a mathematical foundation for all the techniques presented here. diff --git a/highdim/linear-algebra.qmd b/highdim/linear-algebra.qmd index 863b41c..09a7c95 100644 --- a/highdim/linear-algebra.qmd +++ b/highdim/linear-algebra.qmd @@ -12,9 +12,9 @@ Linear algebra is the main mathematical technique used to describe and motivate ## Matrix multiplication -A commonly used operation in data analysis is matrix multiplication. Here we define and motivate the operation. +A commonly used operation in data analysis is matrix multiplication. Here, we define and motivate the operation. -Linear algebra was born from mathematicians developing systematic ways to solve systems of linear equations, for example +Linear algebra originated from mathematicians developing systematic ways to solve systems of linear equations. For example: $$ \begin{align} @@ -26,7 +26,7 @@ $$ Mathematicians figured out that by representing these linear systems of equations using matrices and vectors, predefined algorithms could be designed to solve any system of linear equations. A basic linear algebra class will teach some of these algorithms, such as Gaussian elimination, the Gauss-Jordan elimination, and the LU and QR decompositions. These methods are usually covered in detail in university level linear algebra courses. -To explain matrix multiplication, define two matrices $\mathbf{A}$ and $\mathbf{B}$ +To explain matrix multiplication, define two matrices: $\mathbf{A}$ and $\mathbf{B}$ $$ \mathbf{A} = @@ -44,7 +44,7 @@ b_{n1}&b_{n2}&\dots&b_{np} \end{pmatrix} $$ -and define the product of matrices $\mathbf{A}$ and $\mathbf{B}$ as the matrix $\mathbf{C} = \mathbf{A}\mathbf{B}$ that has entries $c_{ij}$ equal to the sum of the component-wise product of the $i$th row of $\mathbf{A}$ with the $j$th column of $\mathbf{B}$. Using R code we can define $\mathbf{C}= \mathbf{A}\mathbf{B}$ as follows: +and define the product of matrices $\mathbf{A}$ and $\mathbf{B}$ as the matrix $\mathbf{C} = \mathbf{A}\mathbf{B}$ that has entries $c_{ij}$ equal to the sum of the component-wise product of the $i$th row of $\mathbf{A}$ with the $j$th column of $\mathbf{B}$. Using R code, we can define $\mathbf{C}= \mathbf{A}\mathbf{B}$ as follows: ```{r, eval=FALSE} m <- nrow(A) @@ -123,7 +123,7 @@ x_n \end{pmatrix} $$ -and rewriting the equation simply as +and rewriting the equation simply as: $$ \mathbf{A}\mathbf{x} = \mathbf{b} @@ -144,16 +144,17 @@ solve(A, b) ``` :::{.callout-note} -The function `solve` works well when dealing with small to medium-sized matrices with a similar range for each column and not too many 0s. The function `qr.solve` can be used when this is not the case. +The function `solve` works well when dealing with small to medium-sized matrices with a similar range for each column and not too many 0s. The function `qr.solve` can be used when this is not the case. ::: ## The identity matrix -The identity matrix, represented with a bold $\mathbf{I}$, is like the number 1, but for matrices: if you multiply a matrix by the identity matrix, you get back the matrix. +FIX The identity matrix, represented with a bold $\mathbf{I}$, is like the number 1, but for matrices: if you multiply a matrix by the identity matrix, you get back the matrix. +FIX: $$ \mathbf{I}\mathbf{x} = \mathbf{x} -$$ If you do some math with the definition of matrix multiplication you will realize that $\mathbf{1}$ is a matrix with the same number of rows and columns (refereed to as square matrix) with 0s everywhere except the diagonal: +$$ If you do some math with the definition of matrix multiplication, you will realize that $\mathbf{1}$ is a matrix with the same number of rows and columns (referred to as square matrix) with 0s everywhere except the diagonal: $$ \mathbf{I}=\begin{pmatrix} @@ -162,7 +163,7 @@ $$ \vdots&\vdots&\ddots&\vdots\\ 0&0&\dots&1 \end{pmatrix} -$$ It also implies that due to the definition of an inverse matrix we have +$$ It also implies that, due to the definition of an inverse matrix, we have: $$ \mathbf{A}^{-1}\mathbf{A} = \mathbf{1} @@ -178,7 +179,7 @@ solve(A) %*% b Many of the analyses we perform with high-dimensional data relate directly or indirectly to distance. For example, most machine learning techniques rely on being able to define distances between observations, using features or predictors. Clustering algorithms, for example, search of observations that are *similar*. But what does this mean mathematically? -To define distance, we introduce another linear algebra concept: the *norm*. Recall that a point in two dimensions can represented in polar coordinates as: +To define distance, we introduce another linear algebra concept: the *norm*. Recall that a point in two dimensions can be represented in polar coordinates as: ```{r, echo=FALSE, fig.asp=0.7} draw.circle <- function(angle, start = 0, center = c(0,0), r = 0.25){ @@ -200,7 +201,7 @@ text(cos(theta), sin(theta), expression('(' * x[1] * ',' * x[2] * ') = (' * pha draw.circle(theta) ``` -with $\theta = \arctan{\frac{x2}{x1}}$ and $r = \sqrt{x_1^2 + x_2^2}$. If we think of the point as two dimensional column vector $\mathbf{x} = (x_1, x_2)^\top$, $r$ defines the norm of $\mathbf{x}$. The norm can be thought of as the *size* of the two-dimensional vector disregarding the direction: if we change the angle, the vector changes but the size does not. The point of defining the norm is that we can extrapolated the concept of *size* to higher dimensions. Specifically, we write the norm for any vector $\mathbf{x}$ as: +with $\theta = \arctan{\frac{x2}{x1}}$ and $r = \sqrt{x_1^2 + x_2^2}$. If we think of the point as two dimensional column vector $\mathbf{x} = (x_1, x_2)^\top$, $r$ defines the norm of $\mathbf{x}$. The norm can be thought of as the *size* of the two-dimensional vector disregarding the direction: if we change the angle, the vector changes but the size does not. The point of defining the norm is that we can extrapolate the concept of *size* to higher dimensions. Specifically, we write the norm for any vector $\mathbf{x}$ as: $$ ||\mathbf{x}|| = \sqrt{x_1^2 + x_2^2 + \dots + x_p^2} @@ -212,7 +213,7 @@ $$ ||\mathbf{x}||^2 = \mathbf{x}^\top\mathbf{x} $$ -To define distance, suppose we have two two-dimensional points $\mathbf{x}_1$ and $\mathbf{x}_2$. We can define how similar they are by simply using euclidean distance. +To define distance, suppose we have two two-dimensional points: $\mathbf{x}_1$ and $\mathbf{x}_2$. We can define how similar they are by simply using euclidean distance: ```{r, echo=FALSE, fig.asp=0.7} rafalib::mypar() @@ -262,15 +263,15 @@ We can compute the distances between each pair using the definitions we just lea c(sum((x_1 - x_2)^2), sum((x_1 - x_3)^2), sum((x_2 - x_3)^2)) |> sqrt() ``` -In R, the function `crossprod(x)` is convenient for computing norms it multiplies `t(x)` by `x` +In R, the function `crossprod(x)` is convenient for computing norms. It multiplies `t(x)` by `x`: ```{r} c(crossprod(x_1 - x_2), crossprod(x_1 - x_3), crossprod(x_2 - x_3)) |> sqrt() ``` -Note `crossprod` takes a matrix as the first argument and therefore the vectors used here are being coerced into single column matrices. Also note that `crossprod(x,y)` multiples `t(x)` by `y`. +Note `crossprod` takes a matrix as the first argument. As a result, the vectors used here are being coerced into single column matrices. Also, note that `crossprod(x,y)` multiples `t(x)` by `y`. -We can see that the distance is smaller between the first two. This is in agreement with the fact that the first two are 2s and the third is a 7. +We can see that the distance is smaller between the first two. This agrees with the fact that the first two are 2s and the third is a 7. ```{r} y[c(6, 17, 16)] @@ -289,7 +290,7 @@ There are several machine learning related functions in R that take objects of c d ``` -We can quickly see an image of the distances between observations using this function. As an example, we compute the distance between each of the first 300 observations and then make an image: +FIX [alternate for sent that follows: The ?? function allows us to quickly see an image of distances between observations.] We can quickly see an image of the distances between observations using this function. As an example, we compute the distance between each of the first 300 observations and then make an image: ```{r distance-image, fig.width = 4, fig.height = 4, eval=FALSE} d <- dist(x[1:300,]) @@ -318,7 +319,7 @@ image(as.matrix(d)[order(y[1:300]), order(y[1:300])]) We can think of all predictors $(x_{i,1}, \dots, x_{i,p})^\top$ for all observations $i=1,\dots,n$ as $n$ $p$-dimensional points. A *space* can be thought of as the collection of all possible points that should be considered for the data analysis in question. This includes points we could see, but have not been observed yet. In the case of the handwritten digits, we can think of the predictor space as any point $(x_{1}, \dots, x_{p})^\top$ as long as each entry $x_i, \, i = 1, \dots, p$ is between 0 and 255. -Some Machine Learning algorithms also define subspaces. A common approach is to define neighborhoods of points that are close to a *center*. We can do this by selecting a center $\mathbf{x}_0$, a minimum distance $r$, and defining the subspace as the collection of points $\mathbf{x}$ that satisfy +FIX Some Machine Learning algorithms also define subspaces. A common approach is to define neighborhoods of points that are close to a *center*. We can do this by selecting a center $\mathbf{x}_0$, a minimum distance $r$, and defining the subspace as the collection of points $\mathbf{x}$ that satisfy: $$ || \mathbf{x} - \mathbf{x}_0 || \leq r. @@ -330,7 +331,7 @@ Other machine learning algorithms partition the predictor space into non-overlap ## Exercises -1\. Generate two matrix, `A` and `B`, containing randomly generated and normally distributed numbers. The dimensions of these two matrices should $4 \times 3$ and $3 \times 6$, respectively. Confirm that `C <- A %*% B` produce the same results as: +1\. Generate two matrix, `A` and `B`, containing randomly generated and normally distributed numbers. The dimensions of these two matrices should be $4 \times 3$ and $3 \times 6$, respectively. Confirm that `C <- A %*% B` produces the same results as: ```{r, eval=FALSE} m <- nrow(A) @@ -354,7 +355,7 @@ x + y + z + w &= 10\\ \end{align} $$ -3\. Define `x` +3\. Define `x`: ```{r} #| eval: false @@ -364,7 +365,7 @@ x <- mnist$train$images[1:300,] y <- mnist$train$labels[1:300] ``` -and compute the distance matrix +and compute the distance matrix: ```{r} #| eval: false @@ -373,8 +374,8 @@ d <- dist(x) class(d) ``` -Generate a boxplot showing the distances for the second row of `d` stratified by digits. Do not include the distance to itself which we know it is 0. Can you predict what digit is represented by the second row of `x`? +Generate a boxplot showing the distances for the second row of `d` stratified by digits. Do not include the distance to itself, which we know is 0. Can you predict what digit is represented by the second row of `x`? -4\. Use the `apply` function and matrix algebra to compute the distance between the second digit `mnist$train$images[4,]` and all other digits represented in `mnist$train$images`. Then generate as boxplot as in exercise 2 and predict what digit is the fourth row. +4\. Use the `apply` function and matrix algebra to compute the distance between the second digit `mnist$train$images[4,]` and all other digits represented in `mnist$train$images`. Then generate a boxplot as in exercise 2 and predict what digit is the fourth row. 5\. Compute the distance between each feature and the feature representing the middle pixel (row 14 column 14). Create an image plot of where the distance is shown with color in the pixel position. diff --git a/highdim/matrices-in-R.qmd b/highdim/matrices-in-R.qmd index b5649ff..2c7de01 100644 --- a/highdim/matrices-in-R.qmd +++ b/highdim/matrices-in-R.qmd @@ -1,16 +1,16 @@ # Matrices in R -When the number of variables associated with each observation is large and they can all be represented as a number, it is often more convenient to store them in a matrix and perform the analysis with linear algebra operations, rather than storing in a data frame and performing the analysis with **tidyverse** or **data.table** functions. With matrices, variables for each observation are stored in a row, resulting in a matrix with as many columns as variables. In statistics we refer to values represented in the rows of the matrix as the *covariates* or *pedictors* and in machine learning we refer to them as the *features*. +When the number of variables associated with each observation is large and they can all be represented as a number, it is often more convenient to store them in a matrix and perform the analysis with linear algebra operations, rather than storing them in a data frame and performing the analysis with **tidyverse** or **data.table** functions. With matrices, variables for each observation are stored in a row, resulting in a matrix with as many columns as variables. In statistics, we refer to values represented in the rows of the matrix as the *covariates* or *pedictors* and, in machine learning, we refer to them as the *features*. -In linear algebra we have three types of objects: scalars, vectors, and matrices. We have already learned about vectors in R, and, although there is no data type for scalars, we can represent them as vectors of length 1. In this chapter we learn how to work with matrices in R and relate them to linear algebra notation and concepts. +In linear algebra, we have three types of objects: scalars, vectors, and matrices. We have already learned about vectors in R, and, although there is no data type for scalars, we can represent them as vectors of length 1. In this chapter, we learn how to work with matrices in R and relate them to linear algebra notation and concepts. ## Case study: MNIST {#sec-mnist} -An example comes from handwritten digits. The first step in handling mail received in the post office is sorting letters by zip code: +FIX An example comes from handwritten digits. The first step in handling mail received in the post office is to sort letters by zip code: ![](../ml/img/how-to-write-a-address-on-an-envelope-how-to-write-the-address-on-an-envelope-write-address-on-envelope-india-finishedenvelope-x69070.png){width="40%" style="display:block; margin:auto;"} -In the Machine Learning part of this book we will describe how we can build computer algorithms to read handwritten digits, which robots then use to sort the letters. To build these algorithms, we first need to collect data, which in this case is a high-dimensional dataset. +In the Machine Learning part of this book, we will describe how we can build computer algorithms to read handwritten digits, which robots then use to sort the letters. To do this, we first need to collect data, which in this case is a high-dimensional dataset. The MNIST dataset was generated by digitizing thousands of handwritten digits, already read and annotated by humans[^matrices-in-r-1]. Below are three images of written digits. @@ -43,7 +43,7 @@ tmp |> ggplot(aes(Row, Column, fill = value)) + facet_grid(.~label) ``` -For each digitized image, indexed by $i$, we are provided 784 variables and a categorical outcome, or *label*, representing which digit among $0, 1, 2, 3, 4, 5, 6, 7 , 8,$ and $9$ the image is representing. Let's load the data using the **dslabs** package: +For each digitized image, indexed by $i$, we are provided with 784 variables and a categorical outcome, or *label*, representing the digit among $0, 1, 2, 3, 4, 5, 6, 7 , 8,$ and $9$ that the image is representing. Let's load the data using the **dslabs** package: ```{r echo=FALSE} library(tidyverse) @@ -73,15 +73,15 @@ table(mnist$train$labels) To motivate the use of matrices in R, we will pose six tasks related to the handwritten digits data and then show the fast and simple code that solves them. -1\. Visualize the original image. The pixel intensities are provided as rows in a matrix. We will show how to conver these to a matrix that we can visualize. +1\. Visualize the original image. The pixel intensities are provided as rows in a matrix. We will show how to convert these to a matrix that we can visualize. 2\. Do some digits require more ink to write than others? We will study the distribution of the total pixel darkness and how it varies by digits. 3\. Are some pixels uninformative? We will study the variation of each pixel across digits and remove predictors (columns) associated with pixels that don't change much and thus can't provide much information for classification. -4\. Can we remove smudges? We will first, look at the distribution of all pixel values. Then we will use this to pick a cutoff to define unwritten space. Then, set anything below that cutoff to 0. +4\. Can we remove smudges? We will first look at the distribution of all pixel values. Next, we will use this to pick a cutoff to define unwritten space. Then, we set anything below that cutoff to 0. -5\. Binarize the data. First, we will look at the distribution of all pixel values. We will then use this to pick a cutoff to distinguish between writing and no writing. Then, we will convert all entries into either 1 or 0. +5\. Binarize the data. First, we will look at the distribution of all pixel values. We will then use this to pick a cutoff to distinguish between writing and no writing. Afterward, we will convert all entries into either 1 or 0. 6\. Standardize the digits. We will scale each of the predictors in each entry to have the same average and standard deviation. @@ -98,9 +98,9 @@ y <- mnist$train$labels ## Dimensions of a matrix -The dimension of a matrix isan important characteristic needed to assure that certain linear algebra operations can be performed. The dimension, is a two-number summary defined as the number of rows $\times$ the number of columns. +The dimension of a matrix is an important characteristic needed to assure that certain linear algebra operations can be performed. The dimension is a two-number summary defined as the number of rows $\times$ the number of columns. -The `nrow` function tells us how how many rows tha matrix has: +The `nrow` function tells us how many rows that matrix has: ```{r} nrow(x) @@ -121,9 +121,9 @@ dim(x) ``` -## Creating a a matrix +## Creating a matrix -In R we can create a matrix using the `matrix` function. The first argument is a vector containing the elements that will fill up the matrix. The second and third arguments determine the number of row and columns, respectively. So a typical way to create a matrix is to first obtain a vector of numbers containing the elements of the matrix and feeding it to the `matrix` function. For example, to create a $100 \times 2$ matrix of normally distributed random variables we write: +In R, we can create a matrix using the `matrix` function. The first argument is a vector containing the elements that will fill up the matrix. The second and third arguments determine the number of row and columns, respectively. So a typical way to create a matrix is to first obtain a vector of numbers containing the elements of the matrix and feeding it to the `matrix` function. For example, to create a $100 \times 2$ matrix of normally distributed random variables, we write: ```{r} z <- matrix(rnorm(100*2), 100, 2) @@ -136,7 +136,7 @@ Note that by default the matrix is filled in column by column: matrix(1:15, 3, 5) ``` -To fill the matrix row by row we can use `byrow` argument: +To fill the matrix row by row, we can use the `byrow` argument: ```{r} matrix(1:15, 3, 5, byrow = TRUE) @@ -150,7 +150,7 @@ as.vector(matrix(1:15, 3, 5)) :::{callout-warning} -If the product of columns and rows does not match the length of the vector provided in the first argument, `matrix` recycles values. If the length of the vector is a sub-multiple or multiple of the number of rows this happens **without warning**: +If the product of columns and rows does not match the length of the vector provided in the first argument, `matrix` recycles values. If the length of the vector is a sub-multiple or multiple of the number of rows, this happens **without warning**: ```{r} matrix(1:3, 3, 5) @@ -159,20 +159,20 @@ matrix(1:3, 3, 5) ## Subsetting -We can extract a specific entry from matrix, for example the 300th row and 100th column, we use write: +FIX To extract a specific entry from a matrix, for example the 300th row and 100th column, we use write: ```{r, eval=FALSE} x[300,100] ``` -We can extract subsets of the matrices by using vectors of indexes. As an example, we can extract the first 100 pixles from the first 300 observations like this: +We can extract subsets of the matrices by using vectors of indexes. FIX For example, we can extract the first 100 pixels from the first 300 observations like this: and rows like this: ```{r, eval=FALSE} x[1:300,1:100] ``` -To extract an entire row or subset of rows, we leave the column dimension blank. So the following code returns all the pixes for the first 300 observations: +To extract an entire row or subset of rows, we leave the column dimension blank. So the following code returns all the pixels for the first 300 observations: ```{r, eval=FALSE} x[1:300,] @@ -186,13 +186,13 @@ x[,1:100] ``` :::{.callout-warning} -If we subset just one row or just one column, the resulting object is no longer a matrix. Here is an example: +If we subset just one row or just one column, the resulting object is no longer a matrix. For example: ```{r} dim(x[300,]) ``` -To avoid this we can use the `drop` argument: +To avoid this, we can use the `drop` argument: ```{r} dim(x[100,,drop = FALSE]) @@ -201,19 +201,19 @@ dim(x[100,,drop = FALSE]) ### Task 1: Visualize the original image {.unnumbered} -As an example, let's try to visualize the third observation. From the label we know this is a: +For instance, let's try to visualize the third observation. From the label, we know this is a: ```{r} mnist$train$label[3] ``` -The third row of the matrix `x[3,]` contains the 784 pixels intesities. We can assume these were entered in order and convert them back to a $28 \times 28$ matrix using: +The third row of the matrix `x[3,]` contains the 784 pixel intensities. We can assume these were entered in order and convert them back to a $28 \times 28$ matrix using: ```{r} grid <- matrix(x[3,], 28, 28) ``` -To visualize the data we can use `image`, which shows an image of its third argument, with the first two arguments to determine the position on the x and y axes, respectively. Because the top of this plot is pixel 1, which is shown at the bottom, the image is flipped. To code below includes code showing how to flip it back: +FIX To visualize the data, we can use `image`, which shows an image of its third argument, with the first two arguments to determine the position on the x and y axes, respectively. Because the top of this plot is pixel 1, which is shown at the bottom, the image is flipped. FIX To code below includes code showing how to flip it back: ```{r, eval=FALSE} image(1:28, 1:28, grid) @@ -263,7 +263,7 @@ x_p \end{bmatrix} $$ -To distinguish between features associated with the observations $i=1,\dots,n$ we add an index: +To distinguish between features associated with the observations $i=1,\dots,n$, we add an index: $$ \mathbf{x}_i = \begin{bmatrix} @@ -275,7 +275,7 @@ x_{i,p} $$ ::: {callout-warning} -Bold lower case letter are also commonly used to represent matrix columns rather than rows. This can be confusing because $\mathbf{x}_1$ can represent either the first row or the first column of $\mathbf{X}$. One way to distinguish is using notation similar to computer code: using the colon $:$ to represent *all*. So $\mathbf{X}_{1,:}$ is a row, the first row and all the columns, and $\mathbf{X}_{:,1}$ is a column, the first column and all the rows. Another approach is to distinguish by the letter used to index, with $i$ used for rows and $j$ used for columns. So $\mathbf{x}_i$ is the $i$th row and $\mathbf{x}_j$ is the $j$th column. With this approach it is important to clarify which dimension, row or column, is being represented. Further confusion can arise because, as discussed, it is common to represent all vectors as 1 column matrices, including the rows of a matrix. +Bold lower case letters are also commonly used to represent matrix columns rather than rows. This can be confusing because $\mathbf{x}_1$ can represent either the first row or the first column of $\mathbf{X}$. One way to distinguish is to use notation similar to computer code: using the colon $:$ to represent *all*. FIX So $\mathbf{X}_{1,:}$ is a row, the first row and all the columns, and $\mathbf{X}_{:,1}$ is a column, the first column and all the rows. Another approach is to distinguish by the letter used to index, with $i$ used for rows and $j$ used for columns. So $\mathbf{x}_i$ is the $i$th row and $\mathbf{x}_j$ is the $j$th column. With this approach, it is important to clarify which dimension, row or column is being represented. Further confusion can arise because, as aforementioned, it is common to represent all vectors as one column matrices, including the rows of a matrix. ::: ## The transpose @@ -326,21 +326,21 @@ $$ A common operation with matrices is to apply the same function to each row or to each column. For example, we may want to compute row averages and standard deviations. The `apply` function lets you do this. The first argument is the matrix, the second is the dimension, 1 for rows, 2 for columns, and the third is the function to be applied. -So, for example, to compute the averages and standard deviations of each row we write: +So, for example, to compute the averages and standard deviations of each row, we write: ```{r} avgs <- apply(x, 1, mean) sds <- apply(x, 1, sd) ``` -To compute these for the columns we simply change the 1 to a 2: +To compute these for the columns, we simply change the 1 to a 2: ```{r} avgs <- apply(x, 1, mean) sds <- apply(x, 1, sd) ``` -Because these operations are so common, special functions are available to perform them. So, for example, the functions `rowMeans` computes the average of each row +Because these operations are so common, special functions are available to perform them. So, for example, the functions `rowMeans` computes the average of each row: ```{r, message=FALSE} avg <- rowMeans(x) @@ -353,7 +353,7 @@ library(matrixStats) sds <- rowSds(x) ``` -The functions `colMeans`, and `colSds` provide the version for columns. For more fast implementations look at the functions available in **matrixStats**. +The functions `colMeans` and `colSds` provide the version for columns. For more fast implementations consider the functions available in **matrixStats**. ### Task 2: Do some digits require more ink to write than others? {.unnumbered} @@ -364,20 +364,20 @@ avg <- rowMeans(x) boxplot(avg ~ y) ``` -From this plot we see that, not surprisingly, 1s use less ink than the other digits. +From this plot we see that, not surprisingly, 1s use less ink than other digits. ## Conditional filtering -One of the advantages of matrices operations over **tidyverse** operations, is that we can easily select columns based on summaries of the columns. +One of the advantages of matrices operations over **tidyverse** operations is that we can easily select columns based on summaries of the columns. -Note that logical filters can be used to subset matrices in a similar way in which they can be used to subset vectors. Here is a simple examples subsetting columns with logicals: +Note that logical filters can be used to subset matrices in a similar way in which they can be used to subset vectors. Here is a simple example subsetting columns with logicals: ```{r} matrix(1:15, 3, 5)[,c(FALSE, TRUE, TRUE, FALSE, TRUE)] ``` -This implies that we can select rows with conditional expression. Here is practical example that removes all observations containing at least one `NA`: +FIX This implies that we can select rows with conditional expression. The following is practical example that removes all observations containing at least one `NA`: ```{r} #| eval: false @@ -395,7 +395,7 @@ x[!rowAnyNAs(x),] ### Task 3: Are some pixels uninformative? {.unnumbered} -We can use these ideas to remove columns associated with pixels that don't change much and thus do not informing digit classification. We will quantify the variation of each pixel with its standard deviation across all entries. Since each column represents a pixel, we use the `colSds` function from the **matrixStats** package: +We can use these ideas to remove columns associated with pixels that don't change much and thus do not inform digit classification. We will quantify the variation of each pixel with its standard deviation across all entries. Since each column represents a pixel, we use the `colSds` function from the **matrixStats** package: ```{r} sds <- colSds(x) @@ -439,7 +439,7 @@ Only the columns for which the standard deviation is above 60 are kept, which re ## Indexing with matrices -A operation that facilitates efficient coding is that we can change entries of a matrix based on conditionals applied to that same matrix. Here is a simple example: +FIX A operation that facilitates efficient coding is that we can change entries of a matrix based on conditionals applied to that same matrix. FIX Here is a simple example: To see what this does, we look at a smaller matrix: @@ -482,7 +482,7 @@ new_x[new_x < 50] <- 0 ### Task 5: Binarizing the data {.unnumbered} -The histogram above seems to suggest that this data is mostly binary. A pixel either has ink or does not. Using what we have learned, we can binarize the data using just matrix operations: +The histogram above seems to suggest that this data is mostly binary. A pixel either has ink or does not. Applying what we have learned, we can binarize the data using just matrix operations: ```{r} bin_x <- x @@ -526,7 +526,7 @@ $$ The same holds true for other arithmetic operations. -The function `sweep` facilitates this type of operation. It works similarly to `apply`. It takes each entry of a vector and applies an arithmetic operation to the corresponding row. Subtraction is the default artihmetic operation. So, for example, to center each row around the avarage we can use: +The function `sweep` facilitates this type of operation. It works similarly to `apply`. It takes each entry of a vector and applies an arithmetic operation to the corresponding row. Subtraction is the default arithmetic operation. So, for example, to center each row around the average, we can use: ```{r} #| eval: false @@ -536,14 +536,13 @@ sweep(x, 1, rowMeans(x)) ### Task 6: Standardize the digits {.unnumbered} -The way R vectorizes arithmetic opearions implies that we can scale each row of a matrix like this: +The way R vectorizes arithmetic operations implies that we can scale each row of a matrix as follows: ```{r, eval=FALSE} (x - rowMeans(x))/rowSds(x) ``` -If you want to scale each column, be careful since this approach does not work for columns. -For columns we can `sweep`: +Yet this approach does not work for columns. For columns, we can `sweep`: ```{r, eval=FALSE} x_mean_0 <- sweep(x, 2, colMeans(x)) @@ -555,13 +554,13 @@ To divide by the standard deviation, we change the default arithmetic operation x_standardized <- sweep(x_mean_0, 2, colSds(x), FUN = "/") ``` -In R, if you add, subtract, multiple or divide two matrices, the operation is done elementwise. For example, if two matrices are stored in `x` and `y`, then +In R, if you add, subtract, multiple or divide two matrices, the operation is done elementwise. For example, if two matrices are stored in `x` and `y`, then: ```{r, eval=FALSE} x*y ``` -does not result in matrix multiplication. Instead, the entry in row $i$ and column $j$ of this product is the product of the entryin row $i$ and column $j$ of `x` and `y`, respectively. +does not result in matrix multiplication. Instead, the entry in row $i$ and column $j$ of this product is the product of the entry in row $i$ and column $j$ of `x` and `y`, respectively. ## Exercises @@ -577,4 +576,4 @@ does not result in matrix multiplication. Instead, the entry in row $i$ and colu 6\. Compute the average of each column of `x`. -7\. For each digit in the MNIST training data, compute the proportion of pixels that are in a *grey area*, defined as values between 50 and 205. Make boxplot by digit class. Hint: use logical operators and `rowMeans`. +7\. For each digit in the MNIST training data, compute the proportion of pixels that are in a *grey area*, defined as values between 50 and 205. Make a boxplot by digit class. Hint: use logical operators and `rowMeans`. diff --git a/highdim/matrix-factorization.qmd b/highdim/matrix-factorization.qmd index f300d30..458bf78 100644 --- a/highdim/matrix-factorization.qmd +++ b/highdim/matrix-factorization.qmd @@ -48,7 +48,7 @@ lambda <- lambdas[which.min(rmses)] b_reg <- sums / (n + lambda) ``` -In the previous chapter we described how the model: +In the previous chapter, we described how the model: $$ Y_{i,j} = \mu + \alpha_i + \beta_j + \varepsilon_{i,j} @@ -56,7 +56,7 @@ $$ can be used to model movie ratings and help make useful predictions. This model accounts for user differences through $\alpha_i$ and movie effects through $\beta_j$. However, the model ignores an important source of information related to the fact that groups of movies, have similar rating patterns and groups of users have similar rating patterns as well. -To see an example of this we compute residuals +To see an example of this, we compute residuals: $$ r_{i,j} = y_{i,j} - (\hat{\mu} + \hat{\alpha}_i + \hat{\beta}_j) @@ -92,7 +92,7 @@ rm(rr) We see that the correlation varies from very strong to negative across the other three movies. -In this chapter we introduce Factor Analysis, an approach that permits us to model these correlations and improve our prediction, and the Singular Value Decomposition, which permtis us to fit the model. As we will see, this approach is related to principal component analysis (PCA). We describe these concepts in the context of movie recommendation systems. +In this chapter, we introduce Factor Analysis, an approach that permits us to model these correlations and improve our prediction, and the Singular Value Decomposition, which permits us to fit the model. As we will see, this approach is related to principal component analysis (PCA). We describe these concepts in the context of movie recommendation systems. ## Factor analysis {#sec-factor-analysis} @@ -111,13 +111,13 @@ rownames(q) <- c(short_names) e <- p %*% t(q) + matrix(rnorm(nrow(p)*nrow(q), 0, 0.5), nrow(p), nrow(q)) ``` -We start with a simple illustration. We simulate $\varepsilon_{i,j}$ for 6 movies and 120 users and save it in `e`. If we examine the correlation we notice there a pattern: +We start with a simple illustration. We simulate $\varepsilon_{i,j}$ for 6 movies and 120 users and save it in `e`. If we examine the correlation, we notice a pattern: ```{r} cor(e) ``` -It seems their is positive correlation within mob and romance movies, and negative across the two genres. In statistics, we define _factors_ as unobserved or _latent_ variables that are inferred from the patterns of correlations or associations between the observed variables. We can quantify a factor that distinguishes between mob and romance movies with: +It seems there is positive correlation within mob and romance movies, and negative across the two genres. In statistics, we define _factors_ as unobserved or _latent_ variables that are inferred from the patterns of correlations or associations between the observed variables. We can quantify a factor that distinguishes between mob and romance movies with: ```{r} q <- c(-1, -1, -1, 1, 1, 1) @@ -129,10 +129,10 @@ To determine which users prefer each genre, we can fit a linear model to each us p <- apply(e, 1, function(y) lm(y~q-1)$coef) ``` -Note we use the `-1` because the errors have mean 0 and we don't need an intercept. +Notice we use the `-1` because the errors have mean 0 and we don't need an intercept. :::{.callout-note} -There is a much faster way to make this computation using linear algebra. This is because the `lm` function is computing the least squares estimates by taking the derivative of the sum of squares, equaling it to 0, and noting the solution $\hat{\boldsymbol{\beta}}$ satisfies +There is a much faster way to make this computation using linear algebra. This is because the `lm` function is computing the least squares estimates by taking the derivative of the sum of squares, equaling it to 0, and noting the solution $\hat{\boldsymbol{\beta}}$ satisfies: $$ (\mathbf{q}^\top\mathbf{q}) \, \hat{\boldsymbol{\beta}} = \mathbf{q}^\top \mathbf{y} @@ -145,13 +145,13 @@ p <- t(qr.solve(crossprod(q)) %*% t(q) %*% t(e)) ``` ::: -We can see that there type of users: +FIX We can see that there type of users: ```{r} hist(p, breaks = seq(-2,2,0.1)) ``` -love mob and hate romance, don't care, and love romance and hate mob movies. We see that we approximate $\varepsilon_{i,j}$ with $p_iq_j$. We convert the vectors to matrices to be able to use linear algebra: +love mob and hate romance movies, don't care, and love romance and hate mob movies. We see that we approximate $\varepsilon_{i,j}$ with $p_iq_j$. We convert the vectors to matrices to be able to use linear algebra: ```{r} p <- matrix(p); q <- matrix(q) @@ -177,7 +177,7 @@ And obtain estimates for each user: p <- t(apply(e, 1, function(y) lm(y~q-1)$coef)) ``` -We use the transpose becuase `apply` binds results into columns and we want a row for each user. +We use the transpose because `apply` binds results into columns and we want a row for each user. Our approximation based on two factors does a even better job of predicting how our residuals deviate from 0: @@ -185,7 +185,7 @@ Our approximation based on two factors does a even better job of predicting how plot(p %*% t(q), e) ``` -This analysis provides insights into the process generating our data. Note that it also provides compression: the $\120 \times 6$ matrix $\boldsymbol{\varepsilon}$, with 720 observation, is well approximated by a matrix multiplication of a $120 \times 2$ matrix $\mathbf{P}$ and a $6 \times 2$ matrix $\mathbf{Q}$, a total of 252 parameters. +This analysis provides insights into the process generating our data. Note that it also provides compression: the $\120 \times 6$ matrix $\boldsymbol{\varepsilon}$, with 720 observation, is well approximated by a matrix multiplication of a $120 \times 2$ matrix $\mathbf{P}$ and a $6 \times 2$ matrix $\mathbf{Q}$, a total of 252 parameters. Our approximation with two factors can be written as: @@ -193,14 +193,14 @@ $$ \varepsilon_{i,j} \approx p_{i,1}q_{j,1} + p_{i,2}q_{j,2} \mbox{ or } \boldsymbol{\varepsilon} = \mathbf{P}\mathbf{Q}^\top $$ -In our example with simulated data we deduced the factors $\mathbf{p}_1$ and $\mathbf{p}_2$ from the sample correlation and our knowledge of movies. These ended up working well. However, in general deducing factors is not this easy. Furthermore, factors that provide good approximation might be more complicated than containing just two values. For example, _The Godfather III_ might be consider both a mob and romance movie and we would not know what value to assign it in `q`. +In our example with simulated data, we deduced the factors $\mathbf{p}_1$ and $\mathbf{p}_2$ from the sample correlation and our knowledge of movies. These ended up working well. However, in general deducing factors is not this easy. Furthermore, factors that provide good approximation might be more complicated than containing just two values. For example, _The Godfather III_ might be considered both a mob and romance movie and we would not know what value to assign it in `q`. -So, can we estimate the factors? A challenge is that if $\mathbf{P}$ is unknown our model is no longer linear: we can use `lm` to estimate both $\mathbf{P}$ and $\mathbf{Q}$. In the next section we describe a technique that permits us to estimate to this. +So, can we estimate the factors? A challenge is that if $\mathbf{P}$ is unknown our model is no longer linear: we can use `lm` to estimate both $\mathbf{P}$ and $\mathbf{Q}$. In the next section, we describe a technique that permits us to estimate to this. ## Connection to PCA -Notice that if we perform PCA on the matrix $\boldsymbol{\varepsilon}$ we obtain a transformation $\mathbf{V}$ that permits us to rewrite +Notice that if we perform PCA on the matrix $\boldsymbol{\varepsilon}$, we obtain a transformation $\mathbf{V}$ that permits us to rewrite: $$ \boldsymbol{\varepsilon} = \mathbf{Z} \mathbf{V}^\top @@ -214,13 +214,13 @@ Let's perform PCA and examine the results: pca <- prcomp(e, center = FALSE) ``` -First notice that the first two PCs explain over 95% of the variability: +First, notice that the first two PCs explain over 95% of the variability: ```{r} pca$sdev^2/sum(pca$sdev^2) ``` -Next notice that the first column of $\mathbf{V}$ +Next, notice that the first column of $\mathbf{V}$: ```{r} pca$rotation[,1] @@ -228,7 +228,7 @@ pca$rotation[,1] is assigning positive values to the mob movies and negative values to the romance movies. -The second column +The second column: ```{r} pca$rotation[,2] @@ -236,8 +236,7 @@ pca$rotation[,2] is coding for Al Pacino movies. -PCA is automatically finding what we deduced with our knowledge of movies. -This is not a coincidence. +PCA is automatically finding what we deduced with our knowledge of movies. This is not a coincidence. Assume that data $\mathbf{Y}$ follows the model: @@ -245,26 +244,26 @@ $$ Y_{i,j} = \sum_{k=1}^K p_{i,k}q_{j,k} + \varepsilon_{i,j} $$ -If we define the matrices $\mathbf{Y}$ and $\boldsymbol{\varepsilon}$ to have $y_{i,j}$ and $\varepsilon_{i,j}$ in the $i$th row and $j$th column, respectively, and $\mathbf{P}$ and $\mathbf{Q}$ to have entries $p_{i,k}$ and $q_{i,k}$ in the $i$th and $k$th column, respecively, we can rewrite the model as: +If we define the matrices $\mathbf{Y}$ and $\boldsymbol{\varepsilon}$ to have $y_{i,j}$ and $\varepsilon_{i,j}$ in the $i$th row and $j$th column, respectively, and $\mathbf{P}$ and $\mathbf{Q}$ to have entries $p_{i,k}$ and $q_{i,k}$ in the $i$th and $k$th column, respectively, we can rewrite the model as: $$ \mathbf{Y} = \mathbf{P}\mathbf{Q} ^\top + \boldsymbol{\varepsilon} $$ -Notice this model is not identifiable since we can multiply the $\mathbf{P}$ by any positive constant and obtain the same model by dividing $\mathbf{Q}$ by this same constant. To avoid this we impose the constrain that the matrix $\mathbf{Q}$ is orthogonal: +Notice this model is not identifiable since we can multiply the $\mathbf{P}$ by any positive constant and obtain the same model by dividing $\mathbf{Q}$ by this same constant. FIX (constrainT?) To avoid this, we impose the constrain that the matrix $\mathbf{Q}$ is orthogonal: $$ \mathbf{Q}^\top\mathbf{Q} = \mathbf{I} $$ -The first $K$ columns of the principal components and assocaited rotation provide estimates of $\mathbf{P}$ and $\mathbf{Q}$ respectively. +The first $K$ columns of the principal components and associated rotation provide estimates of $\mathbf{P}$ and $\mathbf{Q}$ respectively. ## Case study: movie recommendations -Note that if we look at the correlation structure of the movies for which we simulated data in the previous sections we see structure as well: +Note that if we look at the correlation structure of the movies for which we simulated data in the previous sections, we see structure as well: ```{r} #| echo: false @@ -286,9 +285,9 @@ Y_{i,j} = \mu + \alpha_i + \beta_j + \sum_{k=1}^K p_{i,k}q_{j,k} +\varepsilon_{i $$ -Unfortunately, we can't fit this model with `prcomp` due to the missing values. We introduce the **missMDA** package that provides an approach `imputePCA` to fit such models when entries of the matrix are missing, a very common occurrence in movie recommendations. Also, because there are small sample sizes for several movie pairs it is useful to regularize the $p$s. The `imputePCA` function also permits regularization. +Unfortunately, we can't fit this model with `prcomp` due to the missing values. We introduce the **missMDA** package that provides an approach `imputePCA` to fit such models when matrix entries are missing, a very common occurrence in movie recommendations. Also, because there are small sample sizes for several movie pairs, it is useful to regularize the $p$s. The `imputePCA` function also permits regularization. -We estimate use the estimates for $\mu$, the $\alpha$s and the $\beta$s from the previous chapter and estimate two factors (`ncp = 2`) to movies rated more than 25 times (and Scent of a Woman), using the same penalty we used for the $\beta$s (`coeff.ridge = `) for the penalization. +FIX We estimate use the estimates for $\mu$, the $\alpha$s and the $\beta$s from the previous chapter and estimate two factors (`ncp = 2`) to movies rated more than 25 times (and _Scent of a Woman_), using the same penalty we used for the $\beta$s (`coeff.ridge = `) for the penalization. ```{r} library(missMDA) @@ -297,7 +296,7 @@ ind <- colSums(!is.na(y)) >= 25 | colnames(y) == "3252" imputed <- imputePCA(r[,ind], ncp = 2, coeff.ridge = lambda) ``` -To see how much we improve our previous prediction we construct a matrix with the ratings in the test set: +To see how much we improve our previous prediction, we construct a matrix with the ratings in the test set: ```{r} y_test <- select(test_set, movieId, userId, rating) |> @@ -327,8 +326,7 @@ We see that our prediction improves: rmse(y_test - pred[rownames(y_test), colnames(y_test)]) ``` -We note that further improvements can be obtained by optimizing the regularization penalty, considering more factors, and accounting for the fact -that missing that is likely more predictive of not liking the movie. +We note that further improvements can be obtained by optimizing the regularization penalty, considering more factors, FIX and accounting for the fact that missing that is likely more predictive of not liking the movie. ### Visualizing factors @@ -346,7 +344,7 @@ v <- pca$rotation rownames(v) <- with(movie_map, title[match(colnames(r[,ind]), movieId)]) ``` -and visually explore the results, we see that the mob discussed above are close to each other, as are the romance movie without Al Pacino. +and visually explore the results, we see that the mob movies discussed above are close to each other, as are the romance movies without Al Pacino. ```{r movies-pca, echo=FALSE, warning=FALSE} @@ -365,7 +363,7 @@ pcs |> ggplot(aes(PC1, PC2)) + geom_point(alpha = 1, color = "grey") + data = highlight, size = 2, color = "blue") ``` -By looking at the highest and lowest values for the first principal component we see a meaningful pattern. The first PC shows the difference between Hollywood blockbusters on one side: +By looking at the highest and lowest values for the first principal component, we see a meaningful pattern. The first PC shows the difference between Hollywood blockbusters on one side: ```{r, echo=FALSE} pcs |> select(name, PC1) |> arrange(PC1) |> slice(1:10) |> pull(name) @@ -379,18 +377,18 @@ pcs |> select(name, PC1) |> arrange(desc(PC1)) |> slice(1:10) |> pull(name) ## Singular Value Decomposition -The analysis performed here with PCA is often performed with a related technique called the Singular Value Decomposition (SVD). The SVD theorem states that any $N\times p$ matrix can be written as +The analysis performed here with PCA is often performed with a related technique called the Singular Value Decomposition (SVD). The SVD theorem states that any $N\times p$ matrix can be written as: $$ \mathbf{Y} = \mathbf{U}\mathbf{D}\mathbf{V}^\top $$ -With $\mathbf{U}$ and orthogonal $N\times p$ matrix, $\mathbf{V}$ an orthogonal $p \times p$ matrix, and $\mathbf{D}$ a diagonal matrix with $d_{1,1} \geq d_{2,2} \geq \dots \geq d_{p,p}$. SVD is related to PCA because we can show that $\mathbf{V}$ is the rotation that gives us principal components. This implies that $\mathbf{U}\mathbf{D}$ are the principal components. This also implies that if you square the diagonal entries of $\mathbf{D}$ you obtain the sum of squares of the principal components since +With $\mathbf{U}$ and orthogonal $N\times p$ matrix, $\mathbf{V}$ an orthogonal $p \times p$ matrix, and $\mathbf{D}$ a diagonal matrix with $d_{1,1} \geq d_{2,2} \geq \dots \geq d_{p,p}$. SVD is related to PCA because we can show that $\mathbf{V}$ is the rotation that gives us principal components. This implies that $\mathbf{U}\mathbf{D}$ are the principal components. This also implies that if you square the diagonal entries of $\mathbf{D}$, you obtain the sum of squares of the principal components since: $$ \mathbf{U}^\top\mathbf{D}\mathbf{U} = \mathbf{D}^2 $$ -In, we can can obtained the SVD using the function `svd`. To see the connection to PCA notice that: +FIX In, we can obtain the SVD using the function `svd`. To see the connection to PCA, notice that: ```{r} x <- matrix(rnorm(1000), 100, 10) @@ -402,14 +400,14 @@ all.equal(pca$sdev^2, s$d^2/(nrow(x) - 1)) all.equal(pca$x, s$u %*% diag(s$d), check.attributes = FALSE) ``` -In the exercises we show that `s$u %*% diag(s$d)` can be computed more efficiently as `sweep(s$u, 2, s$d, "*")`. +In the exercises, we show that `s$u %*% diag(s$d)` can be computed more efficiently as `sweep(s$u, 2, s$d, "*")`. ## Exercises In this exercise set, we use the singular value decomposition (SVD) to estimate factors in an example related to the first application of factor analysis: finding factors related to student performance in school. -We construct a dataset that represents grade scores for 100 students in 24 different subjects. The overall average has been removed so this data represents the percentage point each student received above or below the average test score. So a 0 represents an average grade (C), a 25 is a high grade (A+), and a -25 represents a low grade (F). You can simulate the data like this: +We construct a dataset that represents grade scores for 100 students in 24 different subjects. The overall average has been removed so this data represents the percentage points each student received above or below the average test score. So a 0 represents an average grade (C), a 25 is a high grade (A+), and a -25 represents a low grade (F). You can simulate the data like this: ```{r, eval=FALSE} @@ -426,7 +424,7 @@ colnames(y) <- c(paste(rep("Math",k), 1:k, sep="_"), paste(rep("Arts",k), 1:k, sep="_")) ``` -Our goal is to describe the student performances as succinctly as possible. For example, we want to know if these test results are all just random independent numbers. Are all students just about as good? Does being good in one subject imply you will be good in another? How does the SVD help with all this? We will go step by step to show that with just three relatively small pairs of vectors we can explain much of the variability in this $100 \times 24$ dataset. +Our goal is to describe the student performances as succinctly as possible. For example, we want to know if these test results are all simply random independent numbers. Are all students just about as good? Does being good in one subject imply one will be good in another? How does the SVD help with all this? We will go step by step to show that with just three relatively small pairs of vectors, we can explain much of the variability in this $100 \times 24$ dataset. You can visualize the 24 test scores for the 100 students by plotting an image: @@ -447,7 +445,7 @@ my_image(y) 1\. How would you describe the data based on this figure? a. The test scores are all independent of each other. -b. The students that test well are at the top of the image and there seem to be three groupings by subject. +b. The students that test well are at the top of the image and there seems to be three groupings by subject. c. The students that are good at math are not good at science. d. The students that are good at math are not good at humanities. @@ -462,12 +460,12 @@ axis(side = 2, 1:ncol(y), rev(colnames(y)), las = 2) Which of the following best describes what you see? a. The test scores are independent. -b. Math and science are highly correlated but the humanities are not. -c. There is high correlation between tests in the same subject but no correlation across subjects. +b. Math and science are highly correlated, but the humanities are not. +c. There is high correlation between tests in the same subject, but no correlation across subjects. d. There is a correlation among all tests, but higher if the tests are in science and math and even higher within each subject. -3\. Remember that orthogonality means that $U^{\top}U$ and $V^{\top}V$ are equal to the identity matrix. This implies that we can also rewrite the decomposition as +3\. Remember that orthogonality means that $U^{\top}U$ and $V^{\top}V$ are equal to the identity matrix. This implies that we can also rewrite the decomposition as: $$ \mathbf{Y V} = \mathbf{U D} \mbox{ or } \mathbf{U}^{\top}\mathbf{Y} = \mathbf{D V}^{\top}$$ @@ -496,14 +494,14 @@ plot `ss_y` against the column number and then do the same for `ss_yv`. What do 5\. We see that the variability of the columns of $\mathbf{YV}$ is decreasing. Furthermore, we see that, relative to the first three, the variability of the columns beyond the third is almost 0. Now notice that we didn't have to compute `ss_yv` because we already have the answer. How? Remember that $\mathbf{YV} = \mathbf{UD}$ and because $\mathbf{U}$ is orthogonal, we know that the sum of squares of the columns of $\mathbf{UD}$ are the diagonal entries of $\mathbf{D}$ squared. Confirm this by plotting the square root of `ss_yv` versus the diagonal entries of $\mathbf{D}$. -6\. From the above we know that the sum of squares of the columns of $\mathbf{Y}$ (the total sum of squares) add up to the sum of `s$d^2` and that the transformation $\mathbf{YV}$ gives us columns with sums of squares equal to `s$d^2`. Now compute what percent of the total variability is explained by just the first three columns of $\mathbf{YV}$. +6\. From the above we know that the sum of squares of the columns of $\mathbf{Y}$ (the total sum of squares) add up to the sum of `s$d^2`, and that the transformation $\mathbf{YV}$ gives us columns with sums of squares equal to `s$d^2`. Now compute what percent of the total variability is explained by just the first three columns of $\mathbf{YV}$. -7\. We see that almost 99% of the variability is explained by the first three columns of $\mathbf{YV} = \mathbf{UD}$. So we get the sense that we should be able to explain much of the variability and structure we found while exploring the data with a few columns. Before we continue, let's show a useful computational trick to avoid creating the matrix `diag(s$d)`. To motivate this, we note that if we write $\mathbf{U}$ out in its columns $[\mathbf{u}_1, \mathbf{u}_2, \dots, \mathbf{u}_p]$ then $\mathbf{UD}$ is equal to +7\. We see that almost 99% of the variability is explained by the first three columns of $\mathbf{YV} = \mathbf{UD}$. So we get the sense that we should be able to explain much of the variability and structure we found while exploring the data with a few columns. Before we continue, let's show a useful computational trick to avoid creating the matrix `diag(s$d)`. To motivate this, we note that if we write $\mathbf{U}$ out in its columns $[\mathbf{u}_1, \mathbf{u}_2, \dots, \mathbf{u}_p]$, then $\mathbf{UD}$ is equal to: $$\mathbf{UD} = [\mathbf{u}_1 d_{1,1}, \mathbf{u}_2 d_{2,2}, \dots, \mathbf{u}_p d_{p,p}]$$ -Use the `sweep` function to compute $UD$ without constructing `diag(s$d)` nor matrix multiplication. +FIX Use the `sweep` function to compute $UD$ without constructing `diag(s$d)` nor matrix multiplication. @@ -513,7 +511,7 @@ Use the `sweep` function to compute $UD$ without constructing `diag(s$d)` nor ma my_image(y) ``` -in which we can see that the student to student variability is quite large and that it appears that students that are good in one subject are good in all. This implies that the average (across all subjects) for each student should explain a lot of the variability. Compute the average score for each student and plot it against $\mathbf{u}_1 d_{1,1}$ and describe what you find. +in which we can see that the student to student variability is quite large and that it appears that students that are good in one subject are good in all. This implies that the average (across all subjects) for each student should explain a lot of the variability. Compute the average score for each student and plot it against $\mathbf{u}_1 d_{1,1}$, and describe what you find. @@ -522,12 +520,12 @@ in which we can see that the student to student variability is quite large and t $$ \mathbf{U D V}^{\top} = (-\mathbf{U}) D (-\mathbf{V})^{\top} $$ -With this in mind we see that the first column of $\mathbf{UD}$ is almost identical to the average score for each student except for the sign. +With this in mind, we see that the first column of $\mathbf{UD}$ is almost identical to the average score for each student except for the sign. This implies that multiplying $\mathbf{Y}$ by the first column of $\mathbf{V}$ must be performing a similar operation to taking the average. Make an image plot of $\mathbf{V}$ and describe the first column relative to others and how this relates to taking an average. -10\. We already saw that we can rewrite $UD$ as +10\. We already saw that we can rewrite $UD$ as: $$ \mathbf{u}_1 d_{1,1} + \mathbf{u}_2 d_{2,2} + \dots + \mathbf{u}_p d_{p,p}$ @@ -537,14 +535,14 @@ with $\mathbf{u}_j$ the j-th column of $\mathbf{U}$. This implies that we can re $$\mathbf{Y} = \mathbf{u}_1 d_{1,1} \mathbf{v}_1 ^{\top} + \mathbf{u}_2 d_{2,2} \mathbf{v}_2 ^{\top} + \dots + \mathbf{u}_p d_{p,p} \mathbf{v}_p ^{\top}$$ -with $\mathbf{V}_j$ the jth column of $\mathbf{V}$. Plot $\mathbf{u}_1$, then plot $\mathbf{v}_1^{\top}$ using the same range for the y-axis limits, then make an image of $\mathbf{u}_1 d_{1,1} \mathbf{v}_1 ^{\top}$ and compare it to the image of $\mathbf{Y}$. Hint: use the `my_image` function defined above and use the `drop=FALSE` argument to assure the subsets of matrices are matrices. +with $\mathbf{V}_j$ the jth column of $\mathbf{V}$. Plot $\mathbf{u}_1$, then plot $\mathbf{v}_1^{\top}$ using the same range for the y-axis limits. Then make an image of $\mathbf{u}_1 d_{1,1} \mathbf{v}_1 ^{\top}$ and compare it to the image of $\mathbf{Y}$. Hint: use the `my_image` function defined above and use the `drop=FALSE` argument to assure the subsets of matrices are matrices. 11\. We see that with just a vector of length 100, a scalar, and a vector of length 24, we actually come close to reconstructing the original $100 \times 24$ matrix. This is our first matrix factorization: $$ \mathbf{Y} \approx d_{1,1} \mathbf{u}_1 \mathbf{v}_1^{\top}$$ -We know it explains `s$d[1]^2/sum(s$d^2) * 100` percent of the total variability. Our approximation only explains the observation that good students tend to be good in all subjects. But another aspect of the original data that our approximation does not explain was the higher similarity we observed within subjects. We can see this by computing the difference between our approximation and original data and then computing the correlations. You can see this by running this code: +We know it explains `s$d[1]^2/sum(s$d^2) * 100` percent of the total variability. Our approximation only explains the observation that good students tend to be good in all subjects. But another aspect of the original data that our approximation does not explain was the higher similarity we observed within subjects. We can see this by computing the difference between our approximation and original data and then computing the correlations. You can see this by running this code: ```{r, eval=FALSE} resid <- y - with(s,(u[,1, drop=FALSE]*d[1]) %*% t(v[,1, drop=FALSE])) @@ -559,7 +557,7 @@ Now that we have removed the overall student effect, the correlation plot reveal $$ \mathbf{Y} \approx d_{1,1} \mathbf{u}_1 \mathbf{v}_1^{\top} + d_{2,2} \mathbf{u}_2 \mathbf{v}_2^{\top} $$ -We know it will explain +We know it will explain: ```{r,eval=FALSE} sum(s$d[1:2]^2)/sum(s$d^2) * 100 @@ -573,7 +571,7 @@ my_image(cor(resid), zlim = c(-1,1)) axis(side = 2, 1:ncol(y), rev(colnames(y)), las = 2) ``` -and see that the structure that is left is driven by the differences between math and science. Confirm this by plotting $\mathbf{u}_3$, then plot $\mathbf{v}_3^{\top}$ using the same range for the y-axis limits, then make an image of $\mathbf{u}_3 d_{3,3} \mathbf{v}_3 ^{\top}$ and compare it to the image of `resid`. +and see that the structure that is left is driven by the differences between math and science. Confirm this by plotting $\mathbf{u}_3$, then plot $\mathbf{v}_3^{\top}$ using the same range for the y-axis limits, then make an image of $\mathbf{u}_3 d_{3,3} \mathbf{v}_3 ^{\top}$ and compare it to the image of `resid`. 13\. The third column clearly relates to a student's difference in ability in math and science. We can see this most clearly from the plot of `s$v[,3]`. Adding the matrix we obtain with these two columns will help with our approximation: @@ -599,7 +597,7 @@ We no longer see structure in the residuals: they seem to be independent of each $$ \mathbf{Y} = \mathbf{u}_1 \mathbf{v}_1^{\top} + d_{2,2} \mathbf{u}_2 \mathbf{v}_2^{\top} + d_{3,3} \mathbf{u}_3 \mathbf{v}_3^{\top} + \varepsilon$$ -with $\varepsilon$ a matrix of independent identically distributed errors. This model is useful because we summarize of $100 \times 24$ observations with $3 \times (100+24+1) = 375$ numbers. Furthermore, the three components of the model have useful interpretations: 1) the overall ability of a student, 2) the difference in ability between the math/sciences and arts, and 3) the remaining differences between the three subjects. The sizes $d_{1,1}, d_{2,2}$ and $d_{3,3}$ tell us the variability explained by each component. Finally, note that the components $d_{j,j} \mathbf{u}_j \mathbf{v}_j^{\top}$ are equivalent to the jth principal component. +with $\varepsilon$ a matrix of independent identically distributed errors. This model is useful because we summarize $100 \times 24$ observations with $3 \times (100+24+1) = 375$ numbers. Furthermore, the three components of the model have useful interpretations: 1) the overall ability of a student, 2) the difference in ability between the math/sciences and arts, and 3) the remaining differences between the three subjects. The sizes $d_{1,1}, d_{2,2}$ and $d_{3,3}$ tell us the variability explained by each component. Finally, note that the components $d_{j,j} \mathbf{u}_j \mathbf{v}_j^{\top}$ are equivalent to the jth principal component. Finish the exercise by plotting an image of $Y$, an image of $d_{1,1} \mathbf{u}_1 \mathbf{v}_1^{\top} + d_{2,2} \mathbf{u}_2 \mathbf{v}_2^{\top} + d_{3,3} \mathbf{u}_3 \mathbf{v}_3^{\top}$ and an image of the residuals, all with the same `zlim`. diff --git a/highdim/regularization.qmd b/highdim/regularization.qmd index 516b282..0412474 100644 --- a/highdim/regularization.qmd +++ b/highdim/regularization.qmd @@ -2,9 +2,9 @@ ## Case study: recommendation systems {#sec-recommendation-systems} -Recommendation systems, like the one used by Amazon, operate by analyzing the ratings that customers give to various products. These ratings form a large dataset. The system uses this data to predict how likely a specific user is to favorably rate a particular product. For example, if the system predicts that a user is likely to give a high rating to a certain book or gadget, it will recommend that item to them. In essence, the system tries to guess which products a user will like based on the ratings provided by them and other customers for various items. This approach helps in personalizing recommendations to suit individual preferences. +Recommendation systems, such as the one used by Amazon, operate by analyzing the ratings that customers give to various products. These ratings form a large dataset. The system uses this data to predict how likely a specific user is to favorably rate a particular product. For example, if the system predicts that a user is likely to give a high rating to a certain book or gadget, it will recommend that item to them. In essence, the system tries to guess which products a user will like based on the ratings provided by them and other customers for various items. This approach helps in personalizing recommendations to suit individual preferences. -During its initial years of operation, Netflix used a 5-star recommendation system. One star suggests it is not a good movie, whereas five stars suggests it is an excellent movie. Here, we provide the basics of how these recommendations are made, motivated by some of the approaches taken by the winners of the _Netflix challenges_. +During its initial years of operation, Netflix used a 5-star recommendation system. One star suggested it was not a good movie, whereas five stars suggested it was an excellent movie. Here, we provide the basics of how these recommendations are made, motivated by some of the approaches taken by the winners of the _Netflix challenges_. In October 2006, Netflix offered a challenge to the data science community: improve our recommendation algorithm by 10% and win a million dollars. In September 2009, the winners were announced^[http://bits.blogs.nytimes.com/2009/09/21/netflix-awards-1-million-prize-and-starts-a-new-contest/]. You can read a summary of how the winning algorithm was put together here: [http://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary/](http://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary/) @@ -24,7 +24,7 @@ movielens |> as_tibble() |> head(5) Each row represents a rating given by one user to one movie. -It will later be convenient that our `userId` and `movieId` are factors so we change that: +It will later be convenient that our `userId` and `movieId` are factors, so we change that: ```{r} movielens <- mutate(movielens, userId = factor(userId), movieId = factor(movieId)) @@ -37,7 +37,7 @@ We can see the number of unique users that provided ratings and how many unique movielens |> summarize(n_distinct(userId), n_distinct(movieId)) ``` -If we multiply those two numbers, we get a number larger than 5 million, yet our data table has about 100,000 rows. This implies that not every user rated every movie. So we can think of these data as a very large matrix, with users on the rows and movies on the columns, with many empty cells. Here is the matrix for six users and four movies. +If we multiply those two numbers, we get a number larger than 5 million, yet our data table has about 100,000 rows. This implies that not every user rated every movie. We can think of these data as a very large matrix, with users on the rows and movies on the columns, with many empty cells. Here is the matrix for six users and four movies: ```{r, echo=FALSE} keep <- movielens |> @@ -62,7 +62,7 @@ if (knitr::is_html_output()) { } ``` -You can think of the task of a recommendation system as filling in the `NA`s in the table above. To see how _sparse_ the matrix is, here is the matrix for a random sample of 100 movies and 100 users with yellow indicating a user/movie combination for which we have a rating. +You can think of the task of a recommendation system as filling in the `NA`s in the table above. To see how _sparse_ the matrix is, here is the matrix for a random sample of 100 movies and 100 users with yellow indicating a user/movie combination for which we have a rating: ```{r sparsity-of-movie-recs, echo=FALSE, fig.width=3, fig.height=3, out.width="40%"} users <- sample(unique(movielens$userId), 100) @@ -81,7 +81,7 @@ movielens |> Let's look at some of the general properties of the data to better understand the challenges. -The first thing we notice is that some movies get rated more than others. Below is the distribution. This should not surprise us given that there are blockbuster movies watched by millions and artsy, independent movies watched by just a few. Our second observation is that some users are more active than others at rating movies: +The first thing we notice is that some movies get rated more than others. Below is the distribution. This is not surprising given that there are blockbuster movies watched by millions and artsy, independent movies watched by just a few. Our second observation is that some users are more active than others at rating movies: ```{r movie-id-and-user-hists, echo=FALSE, fig.width=6, fig.height=3} p1 <- movielens |> @@ -102,9 +102,9 @@ gridExtra::grid.arrange(p2, p1, ncol = 2) ``` -We need to build an algorithm with the collected data that will then be applied outside our control when users look for movie recommendations. To test our idea we will split the data into a training set, which we will use to develop our approach, and a test set in which we will compute the accuracy of our predictions. +We need to build an algorithm with the collected data that will then be applied outside our control when users look for movie recommendations. To test our idea, we will split the data into a training set, which we will use to develop our approach, and a test set in which we will compute the accuracy of our predictions. -We will do this only for users that have have provided at least 100 ratings. +We will do this only for users that have provided at least 100 ratings. ```{r} movielens <- movielens |> @@ -113,7 +113,7 @@ movielens <- movielens |> ungroup() ``` -For each one of these users we will take split their ratings into 80% for training and 20% for testing. +For each one of these users, we will split their ratings into 80% for training and 20% for testing. ```{r} set.seed(2006) @@ -125,13 +125,13 @@ test_set <- movielens[test_ind,] train_set <- movielens[-test_ind,] ``` -To make sure we don't include movies in the training set that are not in the training sets, we remove entries using the `semi_join` function: +To make sure we don't include movies in the training set that should not be there, we remove entries using the `semi_join` function: ```{r} test_set <- test_set |> semi_join(train_set, by = "movieId") ``` -We will use the array representation described in @sec-anova, for the training data: we denote ranking for movie $j$ by user $i$ as $y_{i,j}$. To create this matrix we use `pivot_wider`: +We will use the array representation described in @sec-anova, for the training data: we denote ranking for movie $j$ by user $i$ as $y_{i,j}$. To create this matrix, we use `pivot_wider`: ```{r} y <- select(train_set, movieId, userId, rating) |> @@ -140,7 +140,7 @@ y <- select(train_set, movieId, userId, rating) |> as.matrix() ``` -We be able to map movie ids to titles we create a lookup table: +FIX We be able to map movie ids to titles we create a lookup table: ```{r} movie_map <- train_set |> select(movieId, title) |> distinct(movieId, .keep_all = TRUE) @@ -150,7 +150,7 @@ Note that two different movies can have the same title. For example, our dataset ## Loss function {#sec-netflix-loss-function} -The Netflix challenge decided on a winner based on the root mean squared error (RMSE) computed on the test set. Specifically, if $y_{i,j}$ is the rating for movie $j$ by user $i$ **in the test set** and $\hat{y}_{i,j}$ is our prediction based on the training set, RMSE was defined as +The Netflix challenge decided on a winner based on the root mean squared error (RMSE) computed on the test set. Specifically, if $y_{i,j}$ is the rating for movie $j$ by user $i$ **in the test set** and $\hat{y}_{i,j}$ is our prediction based on the training set, RMSE was defined as: $$ \mbox{RMSE} = \sqrt{\frac{1}{N} \sum_{i,j}^{N} (y_{i,j} - \hat{y}_{i,j})^2} @@ -164,15 +164,15 @@ We can interpret the RMSE similarly to a standard deviation: it is the typical e rmse <- function(r) sqrt(mean(r^2)) ``` -In this chapter and the nextwe introduce two concepts, regularization and matrix factorization, that were used by the winners of the Netflix challenge to obtain the winning RMSE. +In this chapter and the next, we introduce two concepts, regularization and matrix factorization, that were used by the winners of the Netflix challenge to obtain the winning RMSE. :::{.callout-note} -In @sec-cross-validation we provide a formal discussion of the mean squared error. +In @sec-cross-validation, we provide a formal discussion of the mean squared error. ::: ## A first model -Let's start by building the simplest possible recommendation system: we predict the same rating for all movies regardless of user. What number should this prediction be? We can use a model based approach to answer this. A model that assumes the same rating for all movies and users with all the differences explained by random variation would look like this: +Let's start by building the simplest possible recommendation system: we predict the same rating for all movies regardless of user. What number should this prediction be? We can use a model based approach to answer this. A model that assumes the same rating for all movies and users with all the differences explained by random variation would look as follows: $$ @@ -185,7 +185,7 @@ with $\varepsilon_{i,j}$ independent errors sampled from the same distribution c mu <- mean(y, na.rm = TRUE) ``` -If we predict all unknown ratings with $\hat{\mu}$ we obtain the following RMSE: +If we predict all unknown ratings with $\hat{\mu}$, we obtain the following RMSE: ```{r} rmse(test_set$rating - mu) @@ -214,14 +214,14 @@ If we visualize the average rating for each user: hist(rowMeans(y, na.rm = TRUE), nclass = 30) ``` -we notice that there is substantial variability across users: some users are very cranky and others love most movies. To account for this we can use a linear model with a _treatment effect_ $\alpha_i$ for each user. The sum $\mu+\alpha_i$ can be interpreted as the typical rating user $i$ gives to movies. We can write the model as: +we notice that there is substantial variability across users: some users are very cranky and others love most movies. To account for this, we can use a linear model with a _treatment effect_ $\alpha_i$ for each user. The sum $\mu+\alpha_i$ can be interpreted as the typical rating user $i$ gives to movies. We can write the model as: $$ Y_{i,j} = \mu + \alpha_i + \varepsilon_{i,j} $$ -Statistics textbooks refer to the $\alpha$s as a treatment effects. In the Netflix challenge papers they refer to them as _bias_. +Statistics textbooks refer to the $\alpha$s as treatment effects. In the Netflix challenge papers, they refer to them as _bias_. We can again use least squares to estimate the $\alpha_i$ in the following way: @@ -229,7 +229,7 @@ We can again use least squares to estimate the $\alpha_i$ in the following way: fit <- lm(rating ~ userId, data = train_set) ``` -Because there are hundreds of $\alpha_i$ as each movie gets one, the `lm()` function will be very slow here. In this case, we can show that the least squares estimate $\hat{\alpha}_i$ is just the average of $y_{i,j} - \hat{\mu}$ for each user $i$. So we can compute them this way: +Because there are hundreds of $\alpha_i$, as each movie gets one, the `lm()` function will be very slow here. In this case, we can show that the least squares estimate $\hat{\alpha}_i$ is just the average of $y_{i,j} - \hat{\mu}$ for each user $i$. So we can compute them this way: ```{r} a <- rowMeans(y - mu, na.rm = TRUE) @@ -237,7 +237,7 @@ a <- rowMeans(y - mu, na.rm = TRUE) Note that going forward, we drop the `hat` notation in the code to represent estimates. -Let's see how much our prediction improves once we use $\hat{y}_{i,j} = \hat{\mu} + \hat{\alpha}_i$. Because we know ratings can't be below 0.5 or above 5 we define the function `clamp` +Let's see how much our prediction improves once we use $\hat{y}_{i,j} = \hat{\mu} + \hat{\alpha}_i$. Because we know ratings can't be below 0.5 or above 5, we define the function `clamp`: ```{r} clamp <- function(x, min = 0.5, max = 5) pmax(pmin(x, max), min) @@ -261,7 +261,7 @@ We already see an improvement. But can we make it better? ## Movie effects -We know from experience that some movies are just generally rated higher than others. We can use a linear models with a _treatment effect_ $\beta_j$ for each movie, which can be interpreted as movie effect or the difference between the average ranking for movie $j$ and the overall average $\mu$: +We know from experience that some movies are just generally rated higher than others. We can use a linear model with a _treatment effect_ $\beta_j$ for each movie, which can be interpreted as movie effect or the difference between the average ranking for movie $j$ and the overall average $\mu$: $$ Y_{i,j} = \mu + \alpha_i + \beta_j +\varepsilon_{i,j} @@ -274,7 +274,7 @@ fit <- lm(rating ~ userId + movieId, data = train_set) ``` However, this code generates a very large matrix with all the indicator variables needed to represent all the movies and the code will take time to run. -We instead, use an approximation by computing the least square estimate $\hat{\mu}$ and $\hat{\alpha}_i$ first, and then estimating $\hat{\beta}_j$ as the average of the residuals $y_{i,j} - \hat{\mu} - \hat{\alpha}_i$: +We instead use an approximation by first computing the least square estimate $\hat{\mu}$ and $\hat{\alpha}_i$, and then estimating $\hat{\beta}_j$ as the average of the residuals $y_{i,j} - \hat{\mu} - \hat{\alpha}_i$: ```{r} b <- colMeans(y - mu - a, na.rm = TRUE) @@ -299,7 +299,7 @@ rmse2 <- test_set |> ## Penalized least squares -If we look at the top movies based on our estimates of the movie effect $\hat{\beta}_j$ we find that they all obscure movies with just one rating: +If we look at the top movies based on our estimates of the movie effect $\hat{\beta}_j$, we find that they all obscure movies with just one rating: ```{r} n <- colSums(!is.na(y)) @@ -308,8 +308,7 @@ filter(movie_map, movieId %in% names(b)[ind]) |> pull(title) n[ind] ``` -Do we really think these are the top movies in our database? -The one of these that appears in our test set receives a terrible rating: +Do we really think these are the top movies in our database? The one of these that appears in our test set receives a terrible rating: ```{r} filter(test_set, movieId %in% names(b)[ind]) |> @@ -318,20 +317,18 @@ filter(test_set, movieId %in% names(b)[ind]) |> ``` -Large estimates, negative or positive, should not trusted when based on a small number of ratings. Because large errors can increase our RMSE, we rather be conservative when unsure. +Large estimates, negative or positive, should not trusted when based on a small number of ratings. Because large errors can increase our RMSE, we would rather be conservative when unsure. In previous sections, we computed standard error and constructed confidence intervals to account for different levels of uncertainty. However, when making predictions, we need one number, one prediction, not an interval. For this, we introduce the concept of regularization. -Regularization permits us to penalize large estimates that -are formed using small sample sizes. It has commonalities with the -Bayesian approach that shrunk predictions described in @sec-bayesian-statistics. +Regularization permits us to penalize large estimates that are formed using small sample sizes. It has commonalities with the Bayesian approach that shrunk predictions described in @sec-bayesian-statistics. -The general idea behind regularization is to constrain the total variability of the effect sizes. Why does this help? Consider a case in which we have movie $j=1$ with 100 user ratings and 4 movies $j=2,3,4,5$ with just one user rating. Suppose we know the average rating is, say, $\mu = 3$. If we use least squares, the estimate for the first movie effect is the average of 100 user ratings, which we expect to be a quite precise. However, the estimate for movies 2, 3, 4, and 5 will be based on one observation. Note that because the average is based on a single observation, the error for $j=2,3,4,5$ is 0, but we don't expect to be this lucky next time, when asked to predict. In fact, ignoring the one user and guessing that movies 2,3,4, and 5 are just average movies might provide a better prediction. The general idea of penalized regression is to control the total variability of the movie effects: $\sum_{j=1}^5 \beta_j^2$. Specifically, instead of minimizing the least squares equation, we minimize an equation that adds a penalty: +The general idea behind regularization is to constrain the total variability of the effect sizes. Why does this help? Consider a case in which we have movie $j=1$ with 100 user ratings and 4 movies $j=2,3,4,5$ with just one user rating. Suppose we know the average rating is, say, $\mu = 3$. If we use least squares, the estimate for the first movie effect is the average of 100 user ratings, which we expect to be quite precise. However, the estimate for movies 2, 3, 4, and 5 will be based on one observation. Note that because the average is based on a single observation, the error for $j=2,3,4,5$ is 0, but we don't expect to be this lucky next time, when asked to predict. In fact, ignoring the one user and guessing that movies 2,3,4, and 5 are just average movies might provide a better prediction. The general idea of penalized regression is to control the total variability of the movie effects: $\sum_{j=1}^5 \beta_j^2$. Specifically, instead of minimizing the least squares equation, we minimize an equation that adds a penalty: $$ \sum_{i,j} \left(y_{u,i} - \mu - \alpha_i - \beta_j \right)^2 + \lambda \sum_{j} \beta_j^2 $$ -The first term is just the sum of squares and the second is a penalty that gets larger when many $\beta_i$s are large. Using calculus we can actually show that the values of $\beta_i$ that minimize this equation are: +The first term is just the sum of squares and the second is a penalty that gets larger when many $\beta_i$s are large. Using calculus, we can actually show that the values of $\beta_i$ that minimize this equation are: $$ \hat{\beta}_j(\lambda) = \frac{1}{\lambda + n_j} \sum_{i=1}^{n_i} \left(Y_{i,j} - \mu - \alpha_i\right) @@ -340,12 +337,12 @@ $$ where $n_j$ is the number of ratings made for movie $j$. :::{.callout-note} -When we estimate the parameters of a linear model with penalized least squares, we refer to the approach as _ridge regression_. The `lm.ridge` function in the **MASS** package can perform the estimation. We don't use it here due to the large numbers of parameters associated with movie effects. +When we estimate the parameters of a linear model with penalized least squares, we refer to the approach as _ridge regression_. The `lm.ridge` function in the **MASS** package can perform the estimation. We don't use it here due to the large numbers of parameters associated with movie effects. ::: -This approach will have our desired effect: when our sample size $n_j$ is very large, we obtain a stable estimate and the penalty $\lambda$ is effectively ignored since $n_j+\lambda \approx n_j$. However, when the $n_j$ is small, then the estimate $\hat{\beta}_i(\lambda)$ is shrunken towards 0. The larger the $\lambda$, the more we shrink. +This approach will have our desired effect: when our sample size $n_j$ is very large, we obtain a stable estimate and the penalty $\lambda$ is effectively ignored since $n_j+\lambda \approx n_j$. Yet when the $n_j$ is small, then the estimate $\hat{\beta}_i(\lambda)$ is shrunken towards 0. The larger the $\lambda$, the more we shrink. -But how do we select $\lambda$? In @sec-cross-validation we describe an approach to do this. Here we will simply compute the RMSE we for different values of $\lambda$ to illustrate the effect: +But how do we select $\lambda$? In @sec-cross-validation, we describe an approach to do this. Here we will simply compute the RMSE we for different values of $\lambda$ to illustrate the effect: ```{r} n <- colSums(!is.na(y)) @@ -368,7 +365,7 @@ plot(lambdas, rmses, type = "l") The minimum is obtained for $\lambda=$ `r lambdas[which.min(rmses)]` -Using this$\lambda$ we can compute the regularized estimates and add to our table of estimates: +Using this $\lambda$, we can compute the regularized estimates and add to our table of estimates: ```{r} lambda <- lambdas[which.min(rmses)] @@ -401,7 +398,7 @@ filter(test_set, movieId %in% names(b)[ind]) |> These make more sense with some movies that are watched more and have more ratings in the training set. :::{.callout-note} -Notice Swinger has a lower rating than the other top 10, yet a large movie effect estimate. This is due to the fact that it was rated by harsher users +Notice _Swinger_ has a lower rating than the other top 10, yet a large movie effect estimate. This is due to the fact that it was rated by harsher users. ::: Note that regularization improves our RMSE: @@ -445,12 +442,12 @@ if (knitr::is_html_output()) { ## Exercises -1\. For the `movielens` data, compute the number of ratings for each movie and then plot it against the year the movie came out. Use the square root transformation on the counts. +1\. For the `movielens` data, compute the number of ratings for each movie and then plot it against the year the movie was released. Use the square root transformation on the counts. -2\. We see that, on average, movies that came out after 1993 get more ratings. We also see that with newer movies, starting in 1993, the number of ratings decreases with year: the more recent a movie is, the less time users have had to rate it. +2\. We see that, on average, movies that were releaed after 1993 get more ratings. We also see that with newer movies, starting in 1993, the number of ratings decreases with year: the more recent a movie is, the less time users have had to rate it. -Among movies that came out in 1993 or later, what are the 25 movies with the most ratings per year? Also report their average rating. +Among movies that came out in 1993 or later, what are the 25 movies with the most ratings per year? Also, report their average rating. 3\. From the table constructed in the previous example, we see that the most rated movies tend to have above average ratings. This is not surprising: more people watch popular movies. To confirm this, stratify the post 1993 movies by ratings per year and compute their average ratings. Make a plot of average rating versus ratings per year and show an estimate of the trend. @@ -546,13 +543,13 @@ Let's use regularization to pick the best schools. Remember regularization _shri overall <- mean(sapply(scores, mean)) ``` -and then define, for each school, how it deviates from that average. Write code that estimates the score above average for each school but dividing by $n + \lambda$ instead of $n$, with $n$ the school size and $\lambda$ a regularization parameter. Try $\lambda = 3$. +and then define, for each school, how it deviates from that average. Write code that estimates the score above average for each school, but dividing by $n + \lambda$ instead of $n$, with $n$ the school size and $\lambda$ a regularization parameter. Try $\lambda = 3$. -15\. Notice that this improves things a bit. The number of small schools that are not highly ranked is now 4. Is there a better $\lambda$? Find the $\lambda$ that minimizes the RMSE = $1/100 \sum_{i=1}^{100} (\mbox{quality} - \mbox{estimate})^2$. +15\. Notice that this improves things a bit. The number of small schools that are not highly ranked is now 4. Is there a better $\lambda$? Find the $\lambda$ that minimizes the RMSE = $1/100 \sum_{i=1}^{100} (\mbox{quality} - \mbox{estimate})^2$. 16\. Rank the schools based on the average obtained with the best $\alpha$. Note that no small school is incorrectly included. -17\. A common mistake to make when using regularization is shrinking values towards 0 that are not centered around 0. For example, if we don't subtract the overall average before shrinking, we actually obtain a very similar result. Confirm this by re-running the code from exercise 6 but without removing the overall mean. +17\. A common mistake to make when using regularization is shrinking values towards 0 that are not centered around 0. For example, if we don't subtract the overall average before shrinking, we actually obtain a very similar result. Confirm this by re-running the code from exercise 6, but without removing the overall mean. diff --git a/ml/algorithms.qmd b/ml/algorithms.qmd index 7444440..18f6f40 100644 --- a/ml/algorithms.qmd +++ b/ml/algorithms.qmd @@ -1,331 +1,56 @@ -# Examples of algorithms +# Examples of algorithms {#sec-example-alogirhms} -There are dozens of machine learning algorithms. Here we provide a few examples spanning rather different approaches. Throughout the chapter we will be using the two predictor digits data introduced in Section @sec-two-or-seven to demonstrate how the algorithms work. +There are hundreds of machine learning algorithms. Here we provide a few examples spanning rather different approaches. Throughout the chapter we will be using the two predictor digits data introduced in @sec-two-or-seven to demonstrate how the algorithms work. We focus on the concepts and ideas behind the algorithms using illustrative datasets from the **dslabs** package. -```{r warning = FALSE, message = FALSE} +```{r warning = FALSE, message = FALSE, cache=FALSE} library(tidyverse) -library(dslabs) library(caret) +library(dslabs) ``` - -## Linear regression - -Linear regression can be considered a machine learning algorithm. In Section @sec-two-or-seven we demonstrated how linear regression can be too rigid to be useful. This is generally true, but for some challenges it works rather well. It also serves as a baseline approach: if you can't beat it with a more complex approach, you probably want to stick to linear regression. To quickly make the connection between regression and machine learning, we will reformulate Galton's study with heights, a continuous outcome. - -```{r, message = FALSE, warning = FALSE} -library(HistData) -set.seed(1983) -galton_heights <- GaltonFamilies |> - filter(gender == "male") |> - group_by(family) |> - sample_n(1) |> - ungroup() |> - select(father, childHeight) |> - rename(son = childHeight) -``` - -Suppose you are tasked with building a machine learning algorithm that predicts the son's height $Y$ using the father's height $X$. Let's generate testing and training sets: - -```{r, message = FALSE, warning = FALSE} -y <- galton_heights$son -test_index <- createDataPartition(y, times = 1, p = 0.5, list = FALSE) -train_set <- galton_heights |> slice(-test_index) -test_set <- galton_heights |> slice(test_index) -``` - -In this case, if we were just ignoring the father's height and guessing the son's height, we would guess the average height of sons. - -```{r} -m <- mean(train_set$son) -m -``` - -Our root mean squared error is: - -```{r} -sqrt(mean((m - test_set$son)^2)) -``` - -Can we do better? In the regression chapter, we learned that if the pair $(X,Y)$ follow a bivariate normal distribution, the conditional expectation (what we want to estimate) is equivalent to the regression line: - -$$ -f(x) = \mbox{E}( Y \mid X = x ) = \beta_0 + \beta_1 x -$$ - -In Section @sec-lse we introduced least squares as a method for estimating the slope $\beta_0$ and intercept $\beta_1$: - -```{r} -fit <- lm(son ~ father, data = train_set) -fit$coef -``` - -This gives us an estimate of the conditional expectation: - -$$ \hat{f}(x) = 35 + 0.5 x $$ - -We can see that this does indeed provide an improvement over our guessing approach. - -```{r} -y_hat <- fit$coef[1] + fit$coef[2]*test_set$father -sqrt(mean((y_hat - test_set$son)^2)) -``` - - -### The `predict` function - -The `predict` function is very useful for machine learning applications. This function takes a fitted object from functions such as `lm` or `glm` (we learn about `glm` soon) and a data frame with the new predictors for which to predict. So in our current example, we would use `predict` like this: - -```{r} -y_hat <- predict(fit, test_set) -``` - -Using `predict`, we can get the same results as we did previously: - -```{r} -y_hat <- predict(fit, test_set) -sqrt(mean((y_hat - test_set$son)^2)) -``` - -`predict` does not always return objects of the same types; it depends on what type of object is sent to it. To learn about the specifics, you need to look at the help file specific for the type of fit object that is being used. The `predict` is actually a special type of function in R (called a _generic function_) that calls other functions depending on what kind of object it receives. So if `predict` receives an object coming out of the `lm` function, it will call `predict.lm`. If it receives an object coming out of `glm`, it calls `predict.glm`. These two functions are similar but different. You can learn more about the differences by reading the help files: - -```{r, eval = FALSE} -?predict.lm -?predict.glm -``` - -There are many other versions of `predict` and many machine learning algorithms have a `predict` function. - -:::{.callout-note} -You are ready to do exercises 1 - 8. -::: - - +Then in @sec-ml-in-practice we show an efficient way to implement these ideas using the **caret** package. ## Logistic regression -The regression approach can be extended to categorical data. In this section we first illustrate how, for binary data, one can simply assign numeric values of 0 and 1 to the outcomes $y$, and apply regression as if the data were continuous. We will then point out a limitation with this approach and introduce _logistic regression_ as a solution. Logistic regression is a specific case of a set of _generalized linear models_. To illustrate logistic regression, we will apply it to our previous predicting sex example defined in Section @sec-training-test. - -```{r, echo = FALSE} - -y <- heights$sex - -set.seed(2007) -test_index <- createDataPartition(y, times = 1, p = 0.5, list = FALSE) -test_set <- heights[test_index, ] -train_set <- heights[-test_index, ] -``` - -If we define the outcome $Y$ as 1 for females and 0 for males, and $X$ as the height, we are interested in the conditional probability: - -$$ -\mbox{Pr}( Y = 1 \mid X = x) -$$ - -As an example, let's provide a prediction for a student that is 66 inches tall. What is the conditional probability of being female if you are 66 inches tall? In our dataset, we can estimate this by rounding to the nearest inch and computing: - -```{r} -train_set |> - filter(round(height) == 66) |> - summarize(y_hat = mean(sex == "Female")) -``` - -To construct a prediction algorithm, we want to estimate the proportion of the population that is female for any given height $X = x$, which we write as the conditional probability described above: $\mbox{Pr}( Y = 1 | X = x)$. Let's see what this looks like for several values of $x$ (we will remove strata of $x$ with few data points): - -```{r height-and-sex-conditional-probabilities} -heights |> - mutate(x = round(height)) |> - group_by(x) |> - filter(n() >= 10) |> - summarize(prop = mean(sex == "Female")) |> - ggplot(aes(x, prop)) + - geom_point() -``` - -Since the results from the plot above look close to linear, and it is the only approach we currently know, we will try regression. We assume that: - -$$p(x) = \mbox{Pr}( Y = 1 | X = x) = \beta_0 + \beta_1 x$$ - -Note: because $p_0(x) = 1 - p_1(x)$, we will only estimate $p_1(x)$ and drop the $_1$ index. - -If we convert the factors to 0s and 1s, we can estimate $\beta_0$ and $\beta_1$ with least squares. - -```{r} -lm_fit <- mutate(train_set, y = as.numeric(sex == "Female")) |> - lm(y ~ height, data = _) -``` - - -Once we have estimates $\hat{\beta}_0$ and $\hat{\beta}_1$, we can obtain an actual prediction. Our estimate of the conditional probability $p(x)$ is: +In @sec-two-or-seven we used linear regression to predict classes by fitting the model $$ -\hat{p}(x) = \hat{\beta}_0+ \hat{\beta}_1 x +p(\mathbf{x}) = \mbox{Pr}(Y=1 \mid X_1=x_1 , X_2 = x_2) = +\beta_0 + \beta_1 x_1 + \beta_2 x_2 $$ - -To form a prediction, we define a _decision rule_: predict female if $\hat{p}(x) > 0.5$. We can compare our predictions to the outcomes using: - -```{r} -p_hat <- predict(lm_fit, test_set) -y_hat <- ifelse(p_hat > 0.5, "Female", "Male") |> factor() -confusionMatrix(y_hat, test_set$sex)$overall[["Accuracy"]] -``` - -We see this method does substantially better than guessing. - -### Generalized linear models - -The function $\beta_0 + \beta_1 x$ can take any value including negatives and values larger than 1. In fact, the estimate $\hat{p}(x)$ computed in the linear regression section does indeed become negative. - -```{r regression-prediction} -heights |> - mutate(x = round(height)) |> - group_by(x) |> - filter(n() >= 10) |> - summarize(prop = mean(sex == "Female")) |> - ggplot(aes(x, prop)) + - geom_point() + - geom_abline(intercept = lm_fit$coef[1], slope = lm_fit$coef[2]) -``` - -The range is: +using least squares after assigning numeric values of 0 and 1 to the outcomes $y$, and applied regression as if the data were continuous. A obvious problem with this approach is that $\hat{p}(\mathbf{x})$ can be negative and larger than 1: ```{r} -range(p_hat) +fit_lm <- lm(y ~ x_1 + x_2, data = mutate(mnist_27$train,y = ifelse(y == 7, 1, 0))) +range(fit_lm$fitted) ``` -But we are estimating a probability: $\mbox{Pr}( Y = 1 \mid X = x)$ which is constrained between 0 and 1. - -The idea of generalized linear models (GLM) is to 1) define a distribution of $Y$ that is consistent with it's possible outcomes and -2) find a function $g$ so that $g(\mbox{Pr}( Y = 1 \mid X = x))$ can be modeled as a linear combination of predictors. -Logistic regression is the most commonly used GLM. It is an extension of linear regression that assures that the estimate of $\mbox{Pr}( Y = 1 \mid X = x)$ is between 0 and 1. This approach makes use of the _logistic_ transformation defined as: +To avoid this we can apply the approach described in @sec-glm that is more appropriate for binary data. We write the model like this: -$$ g(p) = \log \frac{p}{1-p}.$$ - -This logistic transformation converts probability to log odds. As discussed in the data visualization lecture, the odds tell us how much more likely it is something will happen compared to not happening. $p = 0.5$ means the odds are 1 to 1, thus the odds are 1. If $p = 0.75$, the odds are 3 to 1. A nice characteristic of this transformation is that it converts probabilities to be symmetric around 0. Here is a plot of $g(p)$ versus $p$: - -```{r p-versus-logistic-of-p, echo = FALSE} -p <- seq(0.01,.99,len = 100) -qplot(p, log( p/(1 - p) ), geom = "line") -``` - -With _logistic regression_, we model the conditional probability directly with: $$ -g\left\{ \mbox{Pr}(Y = 1 \mid X = x) \right\} = \beta_0 + \beta_1 x +\log \frac{p(\mathbf{x})}{1-p(\mathbf{x})} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 $$ - -With this model, we can no longer use least squares. Instead we compute the _maximum likelihood estimate_ (MLE). You can learn more about this concept in a statistical theory textbook^[http://www.amazon.com/Mathematical-Statistics-Analysis-Available-Enhanced/dp/0534399428]. - -In R, we can fit the logistic regression model with the function `glm`: generalized linear models. This function is more general than logistic regression so we need to specify the model we want through the `family` parameter: - -```{r} -glm_fit <- train_set |> - mutate(y = as.numeric(sex == "Female")) |> - glm(y ~ height, data = _, family = "binomial") -``` - -We can obtain prediction using the predict function: - -```{r} -p_hat_logit <- predict(glm_fit, newdata = test_set, type = "response") -``` - -When using `predict` with a `glm` object, we have to specify that we want `type = "response"` if we want the conditional probabilities, since the default is to return the logistic transformed values. - -This model fits the data slightly better than the line: - -```{r conditional-prob-glm-fit, echo = FALSE } -tmp <- heights |> - mutate(x = round(height)) |> - group_by(x) |> - filter(n() >= 10) |> - summarize(prop = mean(sex == "Female")) -logistic_curve <- data.frame(x = seq(min(tmp$x), max(tmp$x))) |> - mutate(p_hat = plogis(glm_fit$coef[1] + glm_fit$coef[2]*x)) -tmp |> - ggplot(aes(x, prop)) + - geom_point() + - geom_line(data = logistic_curve, - mapping = aes(x, p_hat), lty = 2) -``` - -Because we have an estimate $\hat{p}(x)$, we can obtain predictions: - -```{r} -y_hat_logit <- ifelse(p_hat_logit > 0.5, "Female", "Male") |> factor() -confusionMatrix(y_hat_logit, test_set$sex)$overall[["Accuracy"]] -``` - -The resulting predictions are similar. This is because the two estimates of $p(x)$ are larger than 1/2 in about the same region of x: - -```{r glm-prediction, echo = FALSE} -data.frame(x = seq(min(tmp$x), max(tmp$x))) |> - mutate(logistic = plogis(glm_fit$coef[1] + glm_fit$coef[2]*x), - regression = lm_fit$coef[1] + lm_fit$coef[2]*x) |> - gather(method, p_x, -x) |> - ggplot(aes(x, p_x, color = method)) + - geom_line() + - geom_hline(yintercept = 0.5, lty = 5) -``` - -Both linear and logistic regressions provide an estimate for the conditional expectation: - -$$ -\mbox{E}(Y \mid X = x) -$$ -which in the case of binary data is equivalent to the conditional probability: - -$$ -\mbox{Pr}(Y = 1 \mid X = x) -$$ - - -### Logistic regression with more than one predictor - -In this section we apply logistic regression to the two or seven data introduced in Section @sec-two-or-seven. In this case, we are interested in estimating a conditional probability that depends on two variables. The standard logistic regression model in this case will assume that - -$$ -g\{p(x_1, x_2)\}= g\{\mbox{Pr}(Y = 1 \mid X_1 = x_1 , X_2 = x_2)\} = -\beta_0 + \beta_1 x_1 + \beta_2 x_2 -$$ - -with $g(p) = \log \frac{p}{1-p}$ the logistic function described in the previous section. To fit the model we use the following code: - -```{r} +```{r, echo=FALSE} fit_glm <- glm(y ~ x_1 + x_2, data = mnist_27$train, family = "binomial") p_hat_glm <- predict(fit_glm, mnist_27$test, type = "response") y_hat_glm <- factor(ifelse(p_hat_glm > 0.5, 7, 2)) -confusionMatrix(y_hat_glm, mnist_27$test$y)$overall["Accuracy"] ``` -Comparing to the results we obtained in Section @sec-two-or-seven, we see that logistic regression performs similarly to regression. -This is not surprising, given that the estimate of $\hat{p}(x_1, x_2)$ looks similar as well: +We can then find the the _maximum likelihood estimates_ (MLE) of the model parameters and predict using the estimate $p(\mathbf{x})$ to obtain an accuracy of `r confusionMatrix(y_hat_glm, mnist_27$test$y)$overall["Accuracy"]`. We see that logistic regression performs similarly to regression. This is not surprising, given that the estimate of $\hat{p}(\mathbf{x})$ looks similar as well: -```{r, echo = FALSE} -# We use this function to plot the estimated conditional probabilities -plot_cond_prob <- function(p_hat = NULL){ - tmp <- mnist_27$true_p - if(!is.null(p_hat)){ - tmp <- mutate(tmp, p = p_hat) - } - tmp |> ggplot(aes(x_1, x_2, z = p, fill = p)) + - geom_raster(show.legend = FALSE) + - scale_fill_gradientn(colors = c("#F8766D","white","#00BFC4")) + - stat_contour(breaks = c(0.5),color = "black") -} -``` -```{r logistic-p-hat} +```{r logistic-p-hat, echo = FALSE} p_hat <- predict(fit_glm, newdata = mnist_27$true_p, type = "response") mnist_27$true_p |> mutate(p_hat = p_hat) |> - ggplot(aes(x_1, x_2, z = p_hat, fill = p_hat)) + - geom_raster() + + ggplot(aes(x_1, x_2, z = p_hat)) + + geom_raster(aes(fill = p_hat)) + scale_fill_gradientn(colors = c("#F8766D","white","#00BFC4")) + stat_contour(breaks = c(0.5), color = "black") ``` -Just like regression, the decision rule is a line, a fact that can be corroborated mathematically since +Just like regression, the decision rule is a line, a fact that can be corroborated mathematically, definint $g(x) = \log \{x/(1-x)\}$, we have: $$ g^{-1}(\hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2) = 0.5 \implies @@ -333,14 +58,14 @@ g^{-1}(\hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2) = 0.5 \implies x_2 = -\hat{\beta}_0/\hat{\beta}_2 -\hat{\beta}_1/\hat{\beta}_2 x_1 $$ -Thus $x_2$ is a linear function of $x_1$. This implies that, just like regression, our logistic regression approach has no chance of capturing the non-linear nature of the true $p(x_1,x_2)$. Once we move on to more complex examples, we will see that linear regression and generalized linear regression are limited and not flexible enough to be useful for most machine learning challenges. The new techniques we learn are essentially approaches to estimating the conditional probability in a way that is more flexible. - +Thus, just like with regression, $x_2$ is a linear function of $x_1$. This implies that our logistic regression approach has no chance of capturing the non-linear nature of the true $p(\mathbf{x})$. We now described some techniques that estimate the conditional probability in a way that is more flexible. :::{.callout-note} -You can now do exeercises 9 - 11. +You are ready to do exercises 1 - 11. ::: + ## k-nearest neighbors ```{r, echo = FALSE, warning = FALSE, message = FALSE} @@ -355,35 +80,18 @@ plot_cond_prob <- function(p_hat = NULL){ if (!is.null(p_hat)) { tmp <- mutate(tmp, p = p_hat) } - tmp |> ggplot(aes(x_1, x_2, z = p, fill = p)) + - geom_raster(show.legend = FALSE) + + tmp |> ggplot(aes(x_1, x_2, z = p)) + + geom_raster(aes(fill = p), show.legend = FALSE) + scale_fill_gradientn(colors = c("#F8766D","white","#00BFC4")) + stat_contour(breaks = c(0.5),color = "black") } ``` -We introduced the kNN algorithm in Section @sec-knn-cv-intro) and demonstrated how we use cross validation to pick $k$ in Section \@ref(caret-cv). Here we quickly review how we fit a kNN model using the **caret** package. In Section \@ref(caret-cv we introduced the following code to fit a kNN model: - -```{r} -train_knn <- train(y ~ ., method = "knn", - data = mnist_27$train, - tuneGrid = data.frame(k = seq(9, 71, 2))) +```{r, echo=FALSE} +train_knn <- knn3(y ~ ., k = 31, data = mnist_27$train) ``` -We saw that the parameter that maximized the estimated accuracy was: - -```{r} -train_knn$bestTune -``` - -This model improves the accuracy over regression and logistic regression: - -```{r} -confusionMatrix(predict(train_knn, mnist_27$test, type = "raw"), - mnist_27$test$y)$overall["Accuracy"] -``` - -A plot of the estimated conditional probability shows that the kNN estimate is flexible enough and does indeed capture the shape of the true conditional probability. +We introduced the kNN algorithm in @sec-knn-cv-intro. In @sec-mse-estimates we noted that $k=31$ provided the highest accuracy in the test set. Using $k=31$ we obtain an accuracy `r confusionMatrix(predict(train_knn, mnist_27$test, type = "class"),mnist_27$test$y)$overall["Accuracy"]`, an improvement over regression. A plot of the estimated conditional probability shows that the kNN estimate is flexible enough and does indeed capture the shape of the true conditional probability. ```{r best-knn-fit, echo = FALSE, out.width = "100%"} p1 <- plot_cond_prob() + ggtitle("True conditional probability") @@ -399,10 +107,9 @@ grid.arrange(p2, p1, nrow = 1) You are ready to do exercises 12 - 13. ::: - ## Generative models -We have described how, when using squared loss, the conditional expectation/probabilities provide the best approach to developing a decision rule. In a binary case, the smallest true error we can achieve is determined by Bayes' rule, which is a decision rule based on the true conditional probability: +We have described how, when using squared loss, the conditional expectation provide the best approach to developing a decision rule. In a binary case, the smallest true error we can achieve is determined by Bayes' rule, which is a decision rule based on the true conditional probability: $$ p(\mathbf{x}) = \mbox{Pr}(Y = 1 \mid \mathbf{X}=\mathbf{x}) @@ -421,19 +128,15 @@ p(\mathbf{x}) = \mbox{Pr}(Y = 1|\mathbf{X}=\mathbf{x}) = \frac{f_{\mathbf{X}|Y = { f_{\mathbf{X}|Y = 0}(\mathbf{x})\mbox{Pr}(Y = 0) + f_{\mathbf{X}|Y = 1}(\mathbf{x})\mbox{Pr}(Y = 1) } $$ +with $f_{\mathbf{X}|Y = 1}$ and $f_{\mathbf{X}|Y = 0}$ representing the distribution functions of the predictor $\mathbf{X}$ for the two classes $Y = 1$ and $Y = 0$. The formula implies that if we can estimate these conditional distributions of the predictors, we can develop a powerful decision rule. However, this is a big _if_. -with $f_{\mathbf{X}|Y = 1}$ and $f_{\mathbf{X}|Y = 0}$ representing the distribution functions of the predictor $\mathbf{X}$ for the two classes $Y = 1$ and $Y = 0$. The formula implies that if we can estimate these conditional distributions of the predictors, we can develop a powerful decision rule. However, this is a big _if_. As we go forward, we will encounter examples in which $\mathbf{X}$ has many dimensions and we do not have much information about the distribution. In these cases, Naive Bayes will be practically impossible to implement. However, there are instances in which we have a small number of predictors (not much more than 2) and many categories in which generative models can be quite powerful. We describe two specific examples and use our previously described case studies to illustrate them. +As we go forward, we will encounter examples in which $\mathbf{X}$ has many dimensions and we do not have much information about the distribution. In these cases, Naive Bayes will be practically impossible to implement. However, there are instances in which we have a small number of predictors (not much more than 2) and many categories in which generative models can be quite powerful. We describe two specific examples and use our previously described case studies to illustrate them. Let's start with a very simple and uninteresting, yet illustrative, case: the example related to predicting sex from height. ```{r, warning = FALSE, message = FALSE} -library(tidyverse) -library(caret) - -library(dslabs) - -y <- heights$height set.seed(1995) +y <- heights$height test_index <- createDataPartition(y, times = 1, p = 0.5, list = FALSE) train_set <- heights |> slice(-test_index) test_set <- heights |> slice(test_index) @@ -442,9 +145,7 @@ test_set <- heights |> slice(test_index) In this case, the Naive Bayes approach is particularly appropriate because we know that the normal distribution is a good approximation for the conditional distributions of height given sex for both classes $Y = 1$ (female) and $Y = 0$ (male). This implies that we can approximate the conditional distributions $f_{X|Y = 1}$ and $f_{X|Y = 0}$ by simply estimating averages and standard deviations from the data: ```{r} -params <- train_set |> - group_by(sex) |> - summarize(avg = mean(height), sd = sd(height)) +params <- train_set |> group_by(sex) |> summarize(avg = mean(height), sd = sd(height)) params ``` @@ -459,16 +160,14 @@ Now we can use our estimates of average and standard deviation to get an actual ```{r} x <- test_set$height - f0 <- dnorm(x, params$avg[2], params$sd[2]) f1 <- dnorm(x, params$avg[1], params$sd[1]) - p_hat_bayes <- f1*pi / (f1*pi + f0*(1 - pi)) ``` -Our Naive Bayes estimate $\hat{p}(x)$ looks a lot like our logistic regression estimate: +Our Naive Bayes estimate $\hat{p}(x)$ looks a lot like a logistic regression estimate: -```{r conditional-prob-glm-fit-2, echo = FALSE } +```{r conditional-prob-glm-fit-2, echo = FALSE} tmp <- heights |> mutate(x = round(height)) |> group_by(x) |> @@ -498,7 +197,7 @@ $$ { \hat{f}_{X|Y = 0}(x)(1-\hat{\pi}) + \hat{f}_{X|Y = 1}(x)\hat{\pi} } $$ -As we discussed earlier, our sample has a much lower prevalence, `r signif(pi,2)`, than the general population. So if we use the rule $\hat{p}(x)>0.5$ to predict females, our accuracy will be affected due to the low sensitivity: +As we discussed earlier, our sample has a much lower prevalence, `r signif(pi,2)`, than the general population. So if we use the rule $\hat{p}(x) > 0.5$ to predict females, our accuracy will be affected due to the low sensitivity: ```{r} y_hat_bayes <- ifelse(p_hat_bayes > 0.5, "Female", "Male") @@ -517,7 +216,7 @@ The Naive Bayes approach gives us a direct way to correct this since we can simp ```{r} p_hat_bayes_unbiased <- f1 * 0.5 / (f1 * 0.5 + f0 * (1 - 0.5)) -y_hat_bayes_unbiased <- ifelse(p_hat_bayes_unbiased> 0.5, "Female", "Male") +y_hat_bayes_unbiased <- ifelse(p_hat_bayes_unbiased > 0.5, "Female", "Male") ``` Note the difference in sensitivity with a better balance: @@ -530,18 +229,15 @@ specificity(factor(y_hat_bayes_unbiased), factor(test_set$sex)) The new rule also gives us a very intuitive cutoff between 66-67, which is about the middle of the female and male average heights: ```{r naive-with-good-prevalence} -qplot(x, p_hat_bayes_unbiased, geom = "line") + - geom_hline(yintercept = 0.5, lty = 2) + - geom_vline(xintercept = 67, lty = 2) +plot(x, p_hat_bayes_unbiased) +abline(h = 0.5, lty = 2) + +abline(v = 67, lty = 2) ``` ### Quadratic discriminant analysis Quadratic Discriminant Analysis (QDA) is a version of Naive Bayes in which we assume that the distributions $p_{\mathbf{X}|Y = 1}(x)$ and $p_{\mathbf{X}|Y = 0}(\mathbf{x})$ are multivariate normal. The simple example we described in the previous section is actually QDA. Let's now look at a slightly more complicated case: the 2 or 7 example. -```{r} -``` - In this case, we have two predictors so we assume each one is bivariate normal. This implies that we need to estimate two averages, two standard deviations, and a correlation for each case $Y = 1$ and $Y = 0$. Once we have these, we can approximate the distributions $f_{X_1,X_2|Y = 1}$ and $f_{X_1, X_2|Y = 0}$. We can easily estimate parameters from the data: ```{r} @@ -553,57 +249,49 @@ params <- mnist_27$train |> params ``` -Here we provide a visual way of showing the approach. We plot the data and use contour plots to give an idea of what the two estimated normal densities look like (we show the curve representing a region that includes 95% of the points): +With these estimates in place, all we need are the prevalence $\pi$ to to compute + +$$ +\hat{p}(\mathbf{x})= \frac{\hat{f}_{\mathbf{X}|Y = 1}(\mathbf{x}) \hat{\pi}} +{ \hat{f}_{\mathbf{X}|Y = 0}(x)(1-\hat{\pi}) + \hat{f}_{\mathbf{X}|Y = 1}(\mathbf{x})\hat{\pi} } +$$ + +Note that the densities $f$ are bivariate normal distributions. Here we provide a visual way of showing the approach. We plot the data and use contour plots to give an idea of what the two estimated normal densities look like (we show the curve representing a region that includes 95% of the points): -```{r qda-explained} +```{r qda-explained, echo=FALSE} mnist_27$train |> mutate(y = factor(y)) |> ggplot(aes(x_1, x_2, fill = y, color = y)) + geom_point(show.legend = FALSE) + stat_ellipse(type = "norm", lwd = 1.5) ``` -This defines the following estimate of $f(x_1, x_2)$. - -We can use the `train` function from the __caret__ package to fit the model and obtain predictors: +We can fit QDA using the `qda` function the **MASS** package: -```{r} -library(caret) -train_qda <- train(y ~ ., method = "qda", data = mnist_27$train) +```{r, message=FALSE, cache=FALSE} +train_qda <- MASS::qda(y ~ ., data = mnist_27$train) +y_hat <- predict(train_qda, mnist_27$test)$class ``` -We see that we obtain relatively good accuracy: +We see that we obtain relatively good accuracy ```{r} -y_hat <- predict(train_qda, mnist_27$test) -confusionMatrix(y_hat, mnist_27$test$y)$overall["Accuracy"] +confusionMatrix(y_hat, mnist_27$test$y)$overall["Accuracy"] ``` -The estimated conditional probability looks relatively good, although it does not fit as well as the kernel smoothers: +The conditional probability looks relatively good, although it does not fit as well as the kernel smoothers: -```{r, echo = FALSE} -plot_cond_prob <- function(p_hat = NULL){ - tmp <- mnist_27$true_p - if(!is.null(p_hat)){ - tmp <- mutate(tmp, p = p_hat) - } - tmp |> ggplot(aes(x_1, x_2, z = p, fill = p)) + - geom_raster(show.legend = FALSE) + - scale_fill_gradientn(colors = c("#F8766D","white","#00BFC4")) + - stat_contour(breaks = c(0.5),color = "black") -} -``` ```{r qda-estimate, echo = FALSE, out.width = "100%", warning = FALSE, message = FALSE} library(gridExtra) p1 <- plot_cond_prob() + ggtitle("True conditional probability") -p2 <- plot_cond_prob(predict(train_qda, newdata = mnist_27$true_p, type = "prob")[,2]) + +p2 <- plot_cond_prob(predict(train_qda, newdata = mnist_27$true_p, type = "prob")$posterior[,2]) + ggtitle("QDA") grid.arrange(p2, p1, nrow = 1) ``` -One reason QDA does not work as well as the kernel methods is perhaps because the assumption of normality does not quite hold. Although for the 2s it seems reasonable, for the 7s it does seem to be off. Notice the slight curvature in the points for the 7s: +One reason QDA does not work as well as the kernel methods is because the assumption of normality do not quite hold. Although for the 2s it seems reasonable, for the 7s it does seem to be off. Notice the slight curvature in the points for the 7s: ```{r qda-does-not-fit, out.width = "100%"} mnist_27$train |> mutate(y = factor(y)) |> @@ -613,19 +301,14 @@ mnist_27$train |> mutate(y = factor(y)) |> facet_wrap(~y) ``` - -QDA can work well here, but it becomes harder to use as the number of predictors increases. Here we have 2 predictors and had to compute 4 means, 4 SDs, and 2 correlations. How many parameters would we have if instead of 2 predictors, we had 10? -The main problem comes from estimating correlations for 10 predictors. With 10, we have 45 correlations for each class. In general, the formula is $K\times p(p-1)/2$, which gets big fast. Once the number of parameters approaches the size of our data, the method becomes impractical due to overfitting. +QDA can work well here, but it becomes harder to use as the number of predictors increases. Here we have 2 predictors and had to compute 4 means, 4 SDs, and 2 correlations. Notice that if we have 10 predictors, we have 45 correlations for each class. In general, the formula is $K\times p(p-1)/2$, which gets big fast. Once the number of parameters approaches the size of our data, the method becomes impractical due to overfitting. ### Linear discriminant analysis -A relatively simple solution to the problem of having too many parameters is to assume that the correlation structure is the same for all classes, which reduces the number of parameters we need to estimate. - -In this case, we would compute just one pair of standard deviations and one correlation, - -and the distributions looks like this: -```{r lda-explained, echo = FALSE} tmp <- lapply(1:2, function(i){ with(params[i,], MASS::mvrnorm(1000, mu = c(avg_1, avg_2), Sigma = matrix(c(sd_1^2, sd_1*sd_2*r, sd_1*sd_2*r, sd_2^2), 2, 2))) |> as.data.frame() |> @@ -652,13 +330,16 @@ mnist_27$train |> mutate(y = factor(y)) |> stat_ellipse(aes(x_1, x_2, color = y), data = tmp, type = "norm", lwd = 1.5) ``` -Now the size of the ellipses as well as the angle are the same. This is because they have the same standard deviations and correlations. +We can LDA using the **MASS** `lda` function: + +```{r, echo=FALSE} +train_lda <- MASS::lda(y ~ ., data = mnist_27$train) +y_hat <- predict(train_lda, mnist_27$test)$class +``` -We can fit the LDA model using __caret__: +Now the size of the ellipses as well as the angles are the same. This is because they are assumed to have the same standard deviations and correlations. Although this added constrain lowers the number of parameters, the rigidity lowers our accuracy to: ```{r} -train_lda <- train(y ~ ., method = "lda", data = mnist_27$train) -y_hat <- predict(train_lda, mnist_27$test) confusionMatrix(y_hat, mnist_27$test$y)$overall["Accuracy"] ``` @@ -669,7 +350,7 @@ train_lda <- train(y ~ ., method = "lda", data = mnist_27$train) p1 <- plot_cond_prob() + ggtitle("True conditional probability") -p2 <- plot_cond_prob(predict(train_lda, newdata = mnist_27$true_p, type = "prob")[,2]) + +p2 <- plot_cond_prob(predict(train_lda, newdata = mnist_27$true_p, type = "prob")$posterior[,2]) + ggtitle("LDA") grid.arrange(p2, p1, nrow = 1) @@ -682,7 +363,7 @@ In the case of LDA, the lack of flexibility does not permit us to capture the no The normal density is: $$ -p(x) = \frac{1}{\sqrt{2\pi} \sigma} \exp\left\{ - \frac{(x-\mu)^2}{\sigma^2}\right\} +f(x) = \frac{1}{\sqrt{2\pi} \sigma} \exp\left\{ - \frac{(x-\mu)^2}{\sigma^2}\right\} $$ If we remove the constant $1/(\sqrt{2\pi} \sigma)$ and then take the log, we get: @@ -693,131 +374,10 @@ $$ which is the negative of a distance squared scaled by the standard deviation. For higher dimensions, the same is true except the scaling is more complex and involves correlations. -### Case study: more than three classes - -We can generate an example with three categories like this: -```{r} -if (!exists("mnist")) mnist <- read_mnist() -set.seed(3456) -index_127 <- sample(which(mnist$train$labels %in% c(1,2,7)), 2000) -y <- mnist$train$labels[index_127] -x <- mnist$train$images[index_127,] -index_train <- createDataPartition(y, p = 0.8, list = FALSE) -## get the quadrants -row_column <- expand.grid(row = 1:28, col = 1:28) -upper_left_ind <- which(row_column$col <= 14 & row_column$row <= 14) -lower_right_ind <- which(row_column$col > 14 & row_column$row > 14) -## binarize the values. Above 200 is ink, below is no ink -x <- x > 200 -## proportion of pixels in lower right quadrant -x <- cbind(rowSums(x[ ,upper_left_ind])/rowSums(x), - rowSums(x[ ,lower_right_ind])/rowSums(x)) -##save data -train_set <- data.frame(y = factor(y[index_train]), - x_1 = x[index_train,1], x_2 = x[index_train,2]) -test_set <- data.frame(y = factor(y[-index_train]), - x_1 = x[-index_train,1], x_2 = x[-index_train,2]) -``` - -Here is the training data: - -```{r mnist-27-training-data} -train_set |> ggplot(aes(x_1, x_2, color = y)) + geom_point() -``` - -We can use the __caret__ package to train the QDA model: - -```{r} -train_qda <- train(y ~ ., method = "qda", data = train_set) -``` - -Now we estimate three conditional probabilities (although they have to add to 1): - -```{r} -predict(train_qda, test_set, type = "prob") |> head() -``` - -Our predictions are one of the three classes: - -```{r} -predict(train_qda, test_set) |> head() -``` - -The confusion matrix is therefore a 3 by 3 table: - -```{r} -confusionMatrix(predict(train_qda, test_set), test_set$y)$table -``` - -The accuracy is `r caret::confusionMatrix(predict(train_qda, test_set), test_set$y)$overall["Accuracy"]` - -Note that for sensitivity and specificity, we have a pair of values for **each** class. To define these terms, we need a binary outcome. We therefore have three columns: one for each class as the positives and the other two as the negatives. - -To visualize what parts of the region are called 1, 2, and 7 we now need three colors: - -```{r three-classes-plot, echo = FALSE} -GS <- 150 -new_x <- expand.grid(x_1 = seq(min(train_set$x_1), max(train_set$x_1), len = GS), - x_2 = seq(min(train_set$x_2), max(train_set$x_2), len = GS)) -new_x |> mutate(y_hat = predict(train_qda, new_x)) |> - ggplot(aes(x_1, x_2, color = y_hat, z = as.numeric(y_hat))) + - geom_point(size = 0.5, pch = 16) + - stat_contour(breaks = c(1.5, 2.5),color = "black") + - guides(colour = guide_legend(override.aes = list(size = 2))) -``` - - -```{r, echo = FALSE} -train_lda <- train(y ~ ., method = "lda", data = train_set) -``` - -The accuracy for LDA, -`r caret::confusionMatrix(predict(train_lda, test_set), test_set$y)$overall["Accuracy"]`, -is much worse because the model is more rigid. This is what the decision rule looks like: - -```{r lda-too-rigid, echo = FALSE} -new_x |> mutate(y_hat = predict(train_lda, new_x)) |> - ggplot(aes(x_1, x_2, color = y_hat, z = as.numeric(y_hat))) + - geom_point(size = 0.5, pch = 16) + - stat_contour(breaks = c(1.5, 2.5),color = "black") + - guides(colour = guide_legend(override.aes = list(size = 2))) -``` - -The results for kNN - -```{r} -train_knn <- train(y ~ ., method = "knn", data = train_set, - tuneGrid = data.frame(k = seq(15, 51, 2))) -``` - -are much better with an accuracy of -`r caret::confusionMatrix(predict(train_knn, test_set), test_set$y)$overall["Accuracy"]`. The decision rule looks like this: - -```{r three-classes-knn-better, echo = FALSE} -new_x |> mutate(y_hat = predict(train_knn, new_x)) |> - ggplot(aes(x_1, x_2, color = y_hat, z = as.numeric(y_hat))) + - geom_point(size = 0.5, pch = 16) + - stat_contour(breaks = c(1.5, 2.5),color = "black") + - guides(colour = guide_legend(override.aes = list(size = 2))) -``` - -Note that one of the limitations of generative models here is due to the lack of fit of the normal assumption, in particular for class 1. - -```{r three-classes-lack-of-fit} -train_set |> mutate(y = factor(y)) |> - ggplot(aes(x_1, x_2, fill = y, color = y)) + - geom_point(show.legend = FALSE) + - stat_ellipse(type = "norm") -``` - -Generative models can be very powerful, but only when we are able to successfully approximate the joint distribution of predictors conditioned on each class. - - :::{.callout-note} - You are now ready to do exercises 14-22. - +::: ## Classification and regression trees (CART) {#sec-trees} @@ -830,9 +390,9 @@ img_path <- "img" We described how methods such as LDA and QDA are not meant to be used with many predictors $p$ because the number of parameters that we need to estimate becomes too large. For example, with the digits example $p = 784$, we would have over 600,000 parameters with LDA, and we would multiply that by the number of classes for QDA. Kernel methods such as kNN or local regression do not have model parameters to estimate. However, they also face a challenge when multiple predictors are used due to what is referred to as the *curse of dimensionality*. The *dimension* here refers to the fact that when we have $p$ predictors, the distance between two observations is computed in $p$-dimensional space. -A useful way of understanding the curse of dimensionality is by considering how large we have to make a span/neighborhood/window to include a given percentage of the data. Remember that with larger neighborhoods, our methods lose flexibility. +A useful way of understanding the curse of dimensionality is by considering how large we have to make a span/neighborhood/window to include a given percentage of the data. Remember that with larger neighborhoods, our methods lose flexibility, and to be flexible we need to keep the neighborhoods small. -For example, suppose we have one continuous predictor with equally spaced points in the \[0,1\] interval and we want to create windows that include 1/10th of data. Then it's easy to see that our windows have to be of size 0.1: +To see how this becomes an issue for higher dimensions, suppose we have one continuous predictor with equally spaced points in the [0,1] interval and we want to create windows that include 1/10th of data. Then it's easy to see that our windows have to be of size 0.1: ```{r curse-of-dim, echo = FALSE, out.width = "50%", fig.height = 1.5} rafalib::mypar() @@ -868,10 +428,10 @@ points(x[25],y[25], cex = 0.5, pch = 4) Using the same logic, if we want to include 10% of the data in a three-dimensional space, then the side of each cube is $\sqrt[3]{.10} \approx 0.464$. In general, to include 10% of the data in a case with $p$ dimensions, we need an interval with each side of size $\sqrt[p]{.10}$ of the total. This proportion gets close to 1 quickly, and if the proportion is 1 it means we include all the data and are no longer smoothing. -```{r curse-of-dim-4, message = FALSE, message = FALSE} -library(tidyverse) +```{r curse-of-dim-4, message = FALSE, message = FALSE, echo = FALSE} p <- 1:100 -qplot(p, .1^(1/p), ylim = c(0,1)) +data.frame(p=p, size = .1^(1/p)) |> ggplot(aes(p, size)) + geom_line() + ylim(c(0,1)) + + xlab("p") + ylab("0.1^(1/p)") ``` By the time we reach 100 predictors, the neighborhood is no longer very local, as each side covers almost the entire dataset. @@ -883,8 +443,6 @@ Here we look at a set of elegant and versatile methods that adapt to higher dime To motivate this section, we will use a new dataset that includes the breakdown of the composition of olive oil into 8 fatty acids: ```{r} -library(tidyverse) -library(dslabs) names(olive) ``` @@ -900,17 +458,14 @@ We remove the `area` column because we won't use it as a predictor. olive <- select(olive, -area) ``` -Let's very quickly try to predict the region using kNN: - -```{r olive-knn, warning = FALSE, message = FALSE} +```{r olive-knn, warning = FALSE, message = FALSE, echo=FALSE} library(caret) fit <- train(region ~ ., method = "knn", tuneGrid = data.frame(k = seq(1, 15, 2)), data = olive) -ggplot(fit) ``` -We see that using just one neighbor, we can predict relatively well. However, a bit of data exploration reveals that we should be able to do even better. For example, if we look at the distribution of each predictor stratified by region we see that eicosenoic is only present in Southern Italy and that linoleic separates Northern Italy from Sardinia. +Using kNN we can achieve a test set accuracy of `r fit$results[1,2]`. However, a bit of data exploration reveals that we should be able to do even better. For example, if we look at the distribution of each predictor stratified by region we see that eicosenoic is only present in Southern Italy and that linoleic separates Northern Italy from Sardinia. ```{r olive-eda, fig.height = 3, fig.width = 6, echo = FALSE} olive |> gather(fatty_acid, percentage, -region) |> @@ -931,10 +486,11 @@ olive |> color = "black", lty = 2) ``` -In Section @sec-predictor-space we define predictor spaces. The predictor space here consists of eight-dimensional points with values between 0 and 100. In the plot above, we show the space defined by the two predictors eicosenoic and linoleic, and, by eye, we can construct a prediction rule that partitions the predictor space so that each partition contains only outcomes of a one category. This in turn can be used to define an algorithm with perfect accuracy. Specifically, we define the following decision rule. If eicosenoic is larger than 0.065, predict Southern Italy. If not, then if linoleic is larger than $10.535$, predict Sardinia, and if lower, predict Northern Italy. We can draw this decision tree like this: +In Section @sec-predictor-space we defined _predictor spaces_. The predictor space here consists of eight-dimensional points with values between 0 and 100. In the plot above, we show the space defined by the two predictors eicosenoic and linoleic, and, by eye, we can construct a prediction rule that partitions the predictor space so that each partition contains only outcomes of a one category. + +This in turn can be used to define an algorithm with perfect accuracy. Specifically, we define the following decision rule: if eicosenoic is larger than 0.065, predict Southern Italy. If not, then if linoleic is larger than $10.535$, predict Sardinia, and if lower, predict Northern Italy. We can draw this decision tree like this: ```{r olive-tree, echo = FALSE, warning = FALSE, message = FALSE, fig.height = 4.5, out.width = "50%"} -library(caret) library(rpart) rafalib::mypar() train_rpart <- train(region ~ ., method = "rpart", data = olive) @@ -945,32 +501,29 @@ text(train_rpart$finalModel, cex = 0.75) Decision trees like this are often used in practice. For example, to decide on a person's risk of poor outcome after having a heart attack, doctors use the following: -```{r, echo = FALSE, out.width = "50%"} -# source https://www.researchgate.net/profile/Douglas_Walton/publication/228297479/figure/fig1/AS:301828123185152@1448972840090/Decision-Tree-for-Heart-Attack-Victim-adapted-from-Gigerenzer-et-al-1999-4.png -knitr::include_graphics(file.path(img_path,"Decision-Tree-for-Heart-Attack-Victim-adapted-from-Gigerenzer-et-al-1999-4.png")) -``` +![](img/Decision-Tree-for-Heart-Attack-Victim-adapted-from-Gigerenzer-et-al-1999-4.png) (Source: Walton 2010 Informal Logic, Vol. 30, No. 2, pp. 159-184[^trees-1].) [^trees-1]: https://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID1759289_code1486039.pdf?abstractid = 1759289&mirid = 1&type = 2 -A tree is basically a flow chart of yes or no questions. The general idea of the methods we are describing is to define an algorithm that uses data to create these trees with predictions at the ends, referred to as *nodes*. Regression and decision trees operate by predicting an outcome variable $Y$ by partitioning the predictors. +A tree is basically a flow chart of yes or no questions. The general idea of the methods we are describing is to define an algorithm that uses data to create these trees with predictions at the ends, referred to as *nodes*. Regression and decision trees operate by predicting an outcome variable $y$ by partitioning the predictors. ### Regression trees -When the outcome is continuous, we call the method a *regression* tree. To introduce regression trees, we will use the 2008 poll data used in previous sections to describe the basic idea of how we build these algorithms. As with other machine learning algorithms, we will try to estimate the conditional expectation $f(x) = \mbox{E}(Y | X = x)$ with $Y$ the poll margin and $x$ the day. +When using trees, and the outcome is continuous, we call the approach a *regression* tree. To introduce regression trees, we will use the 2008 poll data used in previous sections to describe the basic idea of how we build these algorithms. As with other machine learning algorithms, we will try to estimate the conditional expectation $f(x) = \mbox{E}(Y | X = x)$ with $Y$ the poll margin and $x$ the day. -```{r polls-2008-again} -qplot(day, margin, data = polls_2008) +```{r polls-2008-again, echo=FALSE} +ggplot(data = polls_2008, aes(day, margin)) + geom_point() ``` -The general idea here is to build a decision tree and, at the end of each *node*, obtain a predictor $\hat{y}$. A mathematical way to describe this is to say that we are partitioning the predictor space into $J$ non-overlapping regions, $R_1, R_2, \ldots, R_J$, and then for any predictor $x$ that falls within region $R_j$, estimate $f(x)$ with the average of the training observations $y_i$ for which the associated predictor $x_i$ is also in $R_j$. +The general idea here is to build a decision tree and, at the end of each *node*, obtain a predictor $\hat{y}$. A mathematical way to describe this is: we are partitioning the predictor space into $J$ non-overlapping regions, $R_1, R_2, \ldots, R_J$, and then for any predictor $x$ that falls within region $R_j$, estimate $f(x)$ with the average of the training observations $y_i$ for which the associated predictor $x_i$ is also in $R_j$. But how do we decide on the partition $R_1, R_2, \ldots, R_J$ and how do we choose $J$? Here is where the algorithm gets a bit complicated. -Regression trees create partitions recursively. We start the algorithm with one partition, the entire predictor space. In our simple first example, this space is the interval \[-155, 1\]. But after the first step we will have two partitions. After the second step we will split one of these partitions into two and will have three partitions, then four, then five, and so on. We describe how we pick the partition to further partition, and when to stop, later. +Regression trees create partitions recursively. We start the algorithm with one partition, the entire predictor space. In our simple first example, this space is the interval \[-155, 1\]. But after the first step we will have two partitions. After the second step we will split one of these partitions into two and will have three partitions, then four, and so on. We describe how we pick the partition to further partition, and when to stop, later. -Once we select a partition $\mathbf{x}$ to split in order to create the new partitions, we find a predictor $j$ and value $s$ that define two new partitions, which we will call $R_1(j,s)$ and $R_2(j,s)$, that split our observations in the current partition by asking if $x_j$ is bigger than $s$: +For each existing partition, we find a predictor $j$ and value $s$ that define two new partitions, which we will call $R_1(j,s)$ and $R_2(j,s)$, that split our observations in the current partition by asking if $x_j$ is bigger than $s$: $$ R_1(j,s) = \{\mathbf{x} \mid x_j < s\} \mbox{ and } R_2(j,s) = \{\mathbf{x} \mid x_j \geq s\} @@ -1017,15 +570,13 @@ polls_2008 |> geom_step(aes(day, y_hat), col = "red") ``` -Note that the algorithm stopped partitioning at 8. Now we explain how this decision is made. - -First we need to define the term *complexity parameter* (cp). Every time we split and define two new partitions, our training set RSS decreases. This is because with more partitions, our model has more flexibility to adapt to the training data. In fact, if you split until every point is its own partition, then RSS goes all the way down to 0 since the average of one value is that same value. To avoid this, the algorithm sets a minimum for how much the RSS must improve for another partition to be added. This parameter is referred to as the *complexity parameter* (cp). The RSS must improve by a factor of cp for the new partition to be added. Large values of cp will therefore force the algorithm to stop earlier which results in fewer nodes. +Note that the algorithm stopped partitioning at 8. The decision is made based on a measure referred to as *complexity parameter* (cp). Every time we split and define two new partitions, our training set RSS decreases. This is because with more partitions, our model has more flexibility to adapt to the training data. In fact, if you split until every point is its own partition, then RSS goes all the way down to 0 since the average of one value is that same value. To avoid this, the algorithm sets a minimum for how much the RSS must improve for another partition to be added. This parameter is referred to as the *complexity parameter* (cp). The RSS must improve by a factor of cp for the new partition to be added. Large values of cp will therefore force the algorithm to stop earlier which results in fewer nodes. However, cp is not the only parameter used to decide if we should partition a current partition or not. Another common parameter is the minimum number of observations required in a partition before partitioning it further. The argument used in the `rpart` function is `minsplit` and the default is 20. The `rpart` implementation of regression trees also permits users to determine a minimum number of observations in each node. The argument is `minbucket` and defaults to `round(minsplit/3)`. As expected, if we set `cp = 0` and `minsplit = 2`, then our prediction is as flexible as possible and our predictor is our original data: -```{r polls-2008-tree-over-fit} +```{r polls-2008-tree-over-fit, echo=FALSE} fit <- rpart(margin ~ ., data = polls_2008, control = rpart.control(cp = 0, minsplit = 2)) polls_2008 |> @@ -1037,33 +588,19 @@ polls_2008 |> Intuitively we know that this is not a good approach as it will generally result in over-training. These `cp`, `minsplit`, and `minbucket`, three parameters can be used to control the variability of the final predictors. The larger these values are the more data is averaged to compute a predictor and thus reduce variability. The drawback is that it restricts flexibility. -So how do we pick these parameters? We can use cross validation, described in Chapter @sec-cross-validation, just like with any tuning parameter. Here is an example of using cross validation to choose cp. -```{r polls-2008-tree-train} +```{r polls-2008-tree-train, echo=FALSE} library(caret) train_rpart <- train(margin ~ ., method = "rpart", tuneGrid = data.frame(cp = seq(0, 0.05, len = 25)), data = polls_2008) -ggplot(train_rpart) -``` - -To see the resulting tree, we access the `finalModel` and plot it: - -```{r, eval = FALSE} -plot(train_rpart$finalModel, margin = 0.1) -text(train_rpart$finalModel, cex = 0.75) ``` -```{r polls-2008-final-model, fig.height = 5, out.width = "80%", echo = FALSE} -rafalib::mypar() -plot(train_rpart$finalModel, margin = 0.1) -text(train_rpart$finalModel, cex = 0.75) -``` -And because we only have one predictor, we can actually plot $\hat{f}(x)$: +So how do we pick these parameters? We can use cross validation, just like with any tuning parameter. Here is the resulting tree when we use cross validation to choose cp: -```{r polls-2008-final-fit} +```{r polls-2008-final-fit, echo=FALSE} polls_2008 |> mutate(y_hat = predict(train_rpart)) |> ggplot() + @@ -1071,9 +608,10 @@ polls_2008 |> geom_step(aes(day, y_hat), col = "red") ``` -Note that if we already have a tree and want to apply a higher cp value, we can use the `prune` function. We call this *pruning* a tree because we are snipping off partitions that do not meet a `cp` criterion. We previously created a tree that used a `cp = 0` and saved it to `fit`. We can prune it like this: +Note that if we already have a tree and want to apply a higher cp value, we can use the `prune` function. We call this *pruning* a tree because we are snipping off partitions that do not meet a `cp` criterion. Here is an example where we create a tree that used a `cp = 0` and then we prune it back: ```{r polls-2008-prune} +fit <- rpart(margin ~ ., data = polls_2008, control = rpart.control(cp = 0)) pruned_fit <- prune(fit, cp = 0.01) ``` @@ -1099,40 +637,17 @@ $$ \mbox{entropy}(j) = -\sum_{k = 1}^K \hat{p}_{j,k}\log(\hat{p}_{j,k}), \mbox{ with } 0 \times \log(0) \mbox{ defined as }0 $$ -Let us look at how a classification tree performs on the digits example we examined before by using this code to run the algorithm and plot the resulting accuracy: -```{r, echo = FALSE} -library(dslabs) -``` - -```{r, echo = FALSE} -plot_cond_prob <- function(p_hat = NULL){ - tmp <- mnist_27$true_p - if(!is.null(p_hat)){ - tmp <- mutate(tmp, p = p_hat) - } - tmp |> ggplot(aes(x_1, x_2, z = p, fill = p)) + - geom_raster(show.legend = FALSE) + - scale_fill_gradientn(colors = c("#F8766D","white","#00BFC4")) + - stat_contour(breaks = c(0.5),color = "black") -} -``` - -```{r mnist-27-tree} +```{r, echo=FALSE} train_rpart <- train(y ~ ., method = "rpart", tuneGrid = data.frame(cp = seq(0.0, 0.1, len = 25)), data = mnist_27$train) -plot(train_rpart) -``` - -The accuracy achieved by this approach is better than what we got with regression, but is not as good as what we achieved with kernel methods: - -```{r} y_hat <- predict(train_rpart, mnist_27$test) -confusionMatrix(y_hat, mnist_27$test$y)$overall["Accuracy"] ``` +If we use classification tree performs on the 2 or 7 example we achieve an accuracy of `r confusionMatrix(y_hat, mnist_27$test$y)$overall["Accuracy"]` which is better than regression, but is not as good as what we achieved with kernel methods. + The plot of the estimated conditional probability shows us the limitations of classification trees: ```{r rf-cond-prob, echo = FALSE, out.width = "100%", warning = FALSE, message = FALSE} @@ -1149,13 +664,13 @@ Note that with decision trees, it is difficult to make the boundaries smooth sin Classification trees have certain advantages that make them very useful. They are highly interpretable, even more so than linear models. They are easy to visualize (if small enough). Finally, they can model human decision processes and don't require use of dummy predictors for categorical variables. On the other hand, the approach via recursive partitioning can easily over-train and is therefore a bit harder to train than, for example, linear regression or kNN. Furthermore, in terms of accuracy, it is rarely the best performing method since it is not very flexible and is highly unstable to changes in training data. Random forests, explained next, improve on several of these shortcomings. -### Random forests +## Random forests Random forests are a **very popular** machine learning approach that addresses the shortcomings of decision trees using a clever idea. The goal is to improve prediction performance and reduce instability by *averaging* multiple decision trees (a forest of trees constructed with randomness). It has two features that help accomplish this. The first step is *bootstrap aggregation* or *bagging*. The general idea is to generate many predictors, each using regression or classification trees, and then forming a final prediction based on the average prediction of all these trees. To assure that the individual trees are not the same, we use the bootstrap to induce randomness. These two features combined explain the name: the bootstrap makes the individual trees **randomly** different, and the combination of trees is the **forest**. The specific steps are as follows. -1\. Build $B$ decision trees using the training set. We refer to the fitted models as $T_1, T_2, \dots, T_B$. We later explain how we ensure they are different. +1\. Build $B$ decision trees using the training set. We refer to the fitted models as $T_1, T_2, \dots, T_B$. 2\. For every observation in the test set, form a prediction $\hat{y}_j$ using tree $T_j$. @@ -1171,7 +686,7 @@ To illustrate how the first steps can result in smoother estimates we will demon ```{r polls-2008-rf, message = FALSE, warning = FALSE} library(randomForest) -fit <- randomForest(margin~., data = polls_2008) +fit <- randomForest(margin ~ ., data = polls_2008) ``` Note that if we apply the function `plot` to the resulting object, stored in `fit`, we see how the error rate of our algorithm changes as we add trees. @@ -1190,7 +705,7 @@ We can see that in this case, the accuracy improves as we add more trees until a The resulting estimate for this random forest can be seen like this: -```{r polls-2008-rf-fit} +```{r polls-2008-rf-fit, echo=FALSE} polls_2008 |> mutate(y_hat = predict(fit, newdata = polls_2008)) |> ggplot() + @@ -1287,17 +802,13 @@ if(knitr::is_html_output()){ } ``` -Here is the random forest fit for our digits example based on two predictors: ```{r mnits-27-rf-fit} library(randomForest) train_rf <- randomForest(y ~ ., data = mnist_27$train) - -confusionMatrix(predict(train_rf, mnist_27$test), - mnist_27$test$y)$overall["Accuracy"] ``` -Here is what the conditional probabilities look like: +The accuracy for the random forest fit for our 2 or 7 example is `confusionMatrix(predict(train_rf, mnist_27$test), mnist_27$test$y)$overall["Accuracy"`. Here is what the conditional probabilities look like: ```{r cond-prob-rf, echo = FALSE, out.width = "100%"} p1 <- plot_cond_prob() + ggtitle("True conditional probability") @@ -1308,29 +819,12 @@ p2 <- plot_cond_prob(predict(train_rf, newdata = mnist_27$true_p, type = "prob") grid.arrange(p2, p1, nrow = 1) ``` -Visualizing the estimate shows that, although we obtain high accuracy, it appears that there is room for improvement by making the estimate smoother. This could be achieved by changing the parameter that controls the minimum number of data points in the nodes of the tree. The larger this minimum, the smoother the final estimate will be. We can train the parameters of the random forest. Below, we use the **caret** package to optimize over the minimum node size. Because, this is not one of the parameters that the **caret** package optimizes by default we will write our own code: - -```{r acc-versus-nodesize, cache = TRUE} -nodesize <- seq(1, 51, 10) -acc <- sapply(nodesize, function(ns){ - train(y ~ ., method = "rf", data = mnist_27$train, - tuneGrid = data.frame(mtry = 2), - nodesize = ns)$results$Accuracy -}) -qplot(nodesize, acc) -``` - -We can now fit the random forest with the optimized minimun node size to the entire training data and evaluate performance on the test data. -```{r} -train_rf_2 <- randomForest(y ~ ., data = mnist_27$train, - nodesize = nodesize[which.max(acc)]) - -confusionMatrix(predict(train_rf_2, mnist_27$test), - mnist_27$test$y)$overall["Accuracy"] +```{r, echo=FALSE} +train_rf_2 <- randomForest(y ~ ., data = mnist_27$train, nodesize = 31) ``` -The selected model improves accuracy and provides a smoother estimate. +Visualizing the estimate shows that, although we obtain high accuracy, it appears that there is room for improvement by making the estimate smoother. This could be achieved by changing the parameter that controls the minimum number of data points in the nodes of the tree. The larger this minimum, the smoother the final estimate will be. If we use a nodeize of 31, the number of neighbors we used with knn, we get an accuracy of `confusionMatrix(predict(train_rf_2, mnist_27$test), mnist_27$test$y)$overall["Accuracy"]`. The selected model improves accuracy and provides a smoother estimate: ```{r cond-prob-final-rf, echo = FALSE, out.width = "100%"} p1 <- plot_cond_prob() + ggtitle("True conditional probability") @@ -1341,11 +835,8 @@ p2 <- plot_cond_prob(predict(train_rf_2, newdata = mnist_27$true_p, type = "prob grid.arrange(p2, p1, nrow = 1) ``` -Note that we can avoid writing our own code by using other random forest implementations as described in the **caret** manual[^trees-2]. - -[^trees-2]: http://topepo.github.io/caret/available-models.html -Random forest performs better in all the examples we have considered. However, a disadvantage of random forests is that we lose interpretability. An approach that helps with interpretability is to examine *variable importance*. To define *variable importance* we count how often a predictor is used in the individual trees. You can learn more about *variable importance* in an advanced machine learning book[^trees-3]. The **caret** package includes the function `varImp` that extracts variable importance from any model in which the calculation is implemented. We give an example on how we use variable importance in the next section. +Random forest performs better than trees in all the examples we have considered. However, a disadvantage of random forests is that we lose interpretability. An approach that helps with interpretability is to examine *variable importance*. To define *variable importance* we count how often a predictor is used in the individual trees. You can learn more about *variable importance* in an advanced machine learning book[^trees-3]. The **caret** package includes the function `varImp` that extracts variable importance from any model in which the calculation is implemented. We give an example on how we use variable importance in the next section. [^trees-3]: https://web.stanford.edu/\~hastie/Papers/ESLII.pdf @@ -1360,9 +851,10 @@ dat <- MASS::mvrnorm(n = 100, c(69, 69), Sigma) |> data.frame() |> setNames(c("x", "y")) ``` -Use the __caret__ package to partition into a test and training set of equal size. Train a linear model and report the RMSE. Repeat this exercise 100 times and make a histogram of the RMSEs and report the average and standard deviation. Hint: adapt the code shown earlier like this: +Use the __caret__ package to partition into a test and training set of equal size. Train a linear model and report the RMSE. Repeat this exercise 100 times and make a histogram of the RMSEs and report the average and standard deviation. Hint: adapt the code shown earlier like this ```{r, eval = FALSE} +library(caret) y <- dat$y test_index <- createDataPartition(y, times = 1, p = 0.5, list = FALSE) train_set <- dat |> slice(-test_index) @@ -1470,86 +962,63 @@ dat$train |> ggplot(aes(x, color = y)) + geom_density() Compare the accuracy of linear regression and logistic regression. - 10\. Repeat the simulation from exercise 1 100 times and compare the average accuracy for each method and notice they give practically the same answer. - 11\. Generate 25 different datasets changing the difference between the two class: `delta <- seq(0, 3, len = 25)`. Plot accuracy versus `delta`. +12\. If we add 1s to our 2 or 7 examples, we get data that looks like this: -12\. Earlier we used logistic regression to predict sex from height. Use kNN to do the same. Use the code described in this chapter to select the $F_1$ measure and plot it against $k$. Compare to the $F_1$ of about 0.6 we obtained with regression. - -13\. Load the following dataset: - -```{r, eval = FALSE} -``` - -This dataset includes a matrix `x`: - -```{r, eval = FALSE} -dim(tissue_gene_expression$x) +```{r, echo=FALSE} +library(dslabs) +mnist_127$train |> ggplot(aes(x_1, x_2, color = y)) + geom_point() ``` -with the gene expression measured on 500 genes for 189 biological samples representing seven different tissues. The tissue type is stored in `y`: +Fit QDA using the `qda` function in the **MASS** package the create a confusion matrix for predictions on the test. Which of the following best describes the confusion matrix: -```{r, eval = FALSE} -table(tissue_gene_expression$y) -``` +a\. It is a two-by-two table. +b\. Because we have three classes, it is a two-by-three table. +c\. Because we have three classes, it is a three-by-three table. +d\. Confusion matrices only make sense when the outcomes are binary. -Split the data in training and test sets, then use kNN to predict tissue type and see what accuracy you obtain. Try it for $k = 1, 3, \dots, 11$. -14\. We are going to apply LDA and QDA to the `tissue_gene_expression` dataset. We will start with simple examples based on this dataset and then develop a realistic example. +13\. The `byClass` component returned by the `confusionMatrix` oject provides sensitivity and specificity for each class. Because these terms only make sense when data is binary, each row reprents sensitivity and specificity when a particular class is 1 (positives) and the other two are considered 0s (negatives). Based on the values returned by `confusionMatrix` which of the following is the most common mistake: +a\. Calling 1s either a 2 or 7. +b\. Calling 2s either a 1 or 7. +c\. Calling 7s either a 1 or 2. +d\. All mistakes are equally common. -Create a dataset with just the classes "cerebellum" and "hippocampus" (two parts of the brain) and a predictor matrix with 10 randomly selected columns. +14. Create a grid of `x_1` and `x_2` using ```{r, eval = FALSE} -set.seed(1993) -tissues <- c("cerebellum", "hippocampus") -ind <- which(tissue_gene_expression$y %in% tissues) -y <- droplevels(tissue_gene_expression$y[ind]) -x <- tissue_gene_expression$x[ind, ] -x <- x[, sample(ncol(x), 10)] +GS <- 150 +new_x <- with(mnist_127$train, + expand.grid(x_1 = seq(min(x_1), max(x_1), len = GS), + x_2 = seq(min(x_2), max(x_2), len = GS))) ``` -Use the `train` function to estimate the accuracy of LDA. - - -15\. In this case, LDA fits two 10-dimensional normal distributions. Look at the fitted model by looking at the `finalModel` component of the result of train. Notice there is a component called `means` that includes the estimate `means` of both distributions. Plot the mean vectors against each other and determine which predictors (genes) appear to be driving the algorithm. - - -16\. Repeat exercises 1 with QDA. Does it have a higher accuracy than LDA? - +then visualize the decision rule by coloring +the regions of the Cartesian plan to represent the label that would be called in that region. -17\. Are the same predictors (genes) driving the algorithm? Make a plot as in exercise 2. +14\. Repeat exercise 13 but for LDA. Which of the following explains why LDA has worse accuracy: +a. LDA separates the space with lines making it too rigid. +b. LDA divides the space into two and there are three classes. +c. LDA is very similar to QDA the difference is due to chance. +d. LDA can't be used with more than one class. -18\. One thing we see in the previous plot is that the value of predictors correlate in both groups: some predictors are low in both groups while others are high in both groups. The mean value of each predictor, `colMeans(x)`, is not informative or useful for prediction, and often for interpretation purposes it is useful to center or scale each column. This can be achieved with the `preProcessing` argument in `train`. Re-run LDA with `preProcessing = "scale"`. Note that accuracy does not change but see how it is easier to identify the predictors that differ more between groups in the plot made in exercise 4. +15\. Now repear exercise 13 for kNN with $k=31$ and compute and compare the overall accuracy for all three methods. -19\. In the previous exercises we saw that both approaches worked well. Plot the predictor values for the two genes with the largest differences between the two groups in a scatterplot to see how they appear to follow a bivariate distribution as assumed by the LDA and QDA approaches. Color the points by the outcome. +16. To understand how a simple method like kNN can outperform a model that explicitly tries to emulate Bayes' rule, explore the conditional conditional distributions of `x_1` and `x_2` to see if the normal approximation hold. Generative models can be very powerful, but only when we are able to successfully approximate the joint distribution of predictors conditioned on each class. -20\. Now we are going to increase the complexity of the challenge slightly: we will consider all the tissue types. - -```{r, eval = FALSE} -set.seed(1993) -y <- tissue_gene_expression$y -x <- tissue_gene_expression$x -x <- x[, sample(ncol(x), 10)] -``` - -What accuracy do you get with LDA? - - -21\. We see that the results are slightly worse. Use the `confusionMatrix` function to learn what type of errors we are making. +17\. Earlier we used logistic regression to predict sex from height. Use kNN to do the same. Use the code described in this chapter to select the $F_1$ measure and plot it against $k$. Compare to the $F_1$ of about 0.6 we obtained with regression. -22\. Plot an image of the centers of the seven 10-dimensional normal distributions. - -23\. Create a simple dataset where the outcome grows 0.75 units on average for every increase in a predictor: +18\. Create a simple dataset where the outcome grows 0.75 units on average for every increase in a predictor: ```{r, eval = FALSE} n <- 1000 @@ -1561,48 +1030,40 @@ dat <- data.frame(x = x, y = y) Use `rpart` to fit a regression tree and save the result to `fit`. -24\. Plot the final tree so that you can see where the partitions occurred. - -25\. Make a scatterplot of `y` versus `x` along with the predicted values based on the fit. - -26\. Now model with a random forest instead of a regression tree using `randomForest` from the **randomForest** package, and remake the scatterplot with the prediction line. +19\. Plot the final tree so that you can see where the partitions occurred. -27\. Use the function `plot` to see if the random forest has converged or if we need more trees. +20\. Make a scatterplot of `y` versus `x` along with the predicted values based on the fit. -28\. It seems that the default values for the random forest result in an estimate that is too flexible (not smooth). Re-run the random forest but this time with `nodesize` set at 50 and `maxnodes` set at 25. Remake the plot. +21\. Now model with a random forest instead of a regression tree using `randomForest` from the **randomForest** package, and remake the scatterplot with the prediction line. -29\. We see that this yields smoother results. Let's use the `train` function to help us pick these values. From the **caret** manual[^trees-4] we see that we can't tune the `maxnodes` parameter or the `nodesize` argument with `randomForest`, so we will use the **Rborist** package and tune the `minNode` argument. Use the `train` function to try values `minNode <- seq(5, 250, 25)`. See which value minimizes the estimated RMSE. +22\. Use the function `plot` to see if the random forest has converged or if we need more trees. -[^trees-4]: https://topepo.github.io/caret/available-models.html +23\. It seems that the default values for the random forest result in an estimate that is too flexible (not smooth). Re-run the random forest but this time with `nodesize` set at 50 and `maxnodes` set at 25. Remake the plot. -30\. Make a scatterplot along with the prediction from the best fitted model. -31\. Use the `rpart` function to fit a classification tree to the `tissue_gene_expression` dataset. Use the `train` function to estimate the accuracy. Try out `cp` values of `seq(0, 0.05, 0.01)`. Plot the accuracy to report the results of the best model. +24\. This **dslabs* dataset includes the `tissue_gene_expression` with a matrix `x`: -32\. Study the confusion matrix for the best fitting classification tree. What do you observe happening for placenta? - -33\. Notice that placentas are called endometrium more often than placenta. Note also that the number of placentas is just six, and that, by default, `rpart` requires 20 observations before splitting a node. Thus it is not possible with these parameters to have a node in which placentas are the majority. Rerun the above analysis but this time permit `rpart` to split any node by using the argument `control = rpart.control(minsplit = 0)`. Does the accuracy increase? Look at the confusion matrix again. +```{r, eval = FALSE} +library(dslabs) +dim(tissue_gene_expression$x) +``` -34\. Plot the tree from the best fitting model obtained in exercise 11. +with the gene expression measured on 500 genes for 189 biological samples representing seven different tissues. The tissue type is stored in `tissue_gene_expression$y`. -35\. We can see that with just six genes, we are able to predict the tissue type. Now let's see if we can do even better with a random forest. Use the `train` function and the `rf` method to train a random forest. Try out values of `mtry` ranging from, at least, `seq(50, 200, 25)`. What `mtry` value maximizes accuracy? To permit small `nodesize` to grow as we did with the classification trees, use the following argument: `nodesize = 1`. This will take several seconds to run. If you want to test it out, try using smaller values with `ntree`. Set the seed to 1990. +```{r, eval=FALSE} +table(tissue_gene_expression$y) +``` -36\. Use the function `varImp` on the output of `train` and save it to an object called `imp`. -37\. The `rpart` model we ran above produced a tree that used just six predictors. Extracting the predictor names is not straightforward, but can be done. If the output of the call to train was `fit_rpart`, we can extract the names like this: +Fit a random forest using the `randomForest` function the package **randomForest** then use the `varImp` function to see which are the top 10 most predictive genes. Make a histogram of the reported importance to get an idea of the distribution of the importance values. -```{r, eval = FALSE} -ind <- !(fit_rpart$finalModel$frame$var == "") -tree_terms <- - fit_rpart$finalModel$frame$var[ind] |> - unique() |> - as.character() -tree_terms +```{r} +library(randomForest) +fit_rf <- with(tissue_gene_expression, randomForest(x, y)) +vi <- varImp(fit_rf) +vi |> top_n(10, Overall) |> arrange(desc(Overall)) +hist(vi$Overall, breaks = seq(0,4,0.1)) ``` -What is the variable importance in the random forest call for these predictors? Where do they rank? - -38\. Advanced: Extract the top 50 predictors based on importance, take a subset of `x` with just these predictors and apply the function `heatmap` to see how these genes behave across the tissues. We will introduce the `heatmap` function in Chapter @sec-clustering. - diff --git a/ml/clustering.qmd b/ml/clustering.qmd index 4de8649..3674426 100644 --- a/ml/clustering.qmd +++ b/ml/clustering.qmd @@ -2,13 +2,7 @@ The algorithms we have described up to now are examples of a general approach referred to as _supervised_ machine learning. The name comes from the fact that we use the outcomes in a training set to _supervise_ the creation of our prediction algorithm. There is another subset of machine learning referred to as _unsupervised_. In this subset we do not necessarily know the outcomes and instead are interested in discovering groups. These algorithms are also referred to as _clustering_ algorithms since predictors are used to define _clusters_. -In the two examples we have shown here, clustering would not be very useful. In the first example, if we are simply given the heights we will not be able to discover two groups, males and females, because the intersection is large. In the second example, we can see from plotting the predictors that discovering the two digits, two and seven, will be challenging: - -```{r mnist-27-unsupervised, message=FALSE, warning=FALSE, cache=FALSE} -library(tidyverse) -library(dslabs) -mnist_27$train |> ggplot(aes(x_1, x_2, color = y)) + geom_point() -``` +In the two examples we have shown here, clustering would not be very useful. In the first example, if we are simply given the heights we will not be able to discover two groups, males and females, because the intersection is large. In the second example, we can see from plotting the predictors that discovering the two digits, two and seven, will be challenging. However, there are applications in which unsupervised learning can be a powerful technique, in particular as an exploratory tool. @@ -16,16 +10,19 @@ A first step in any clustering algorithm is defining a distance between observat We will construct a simple example based on movie ratings. Here we quickly construct a matrix `x` that has ratings for the 50 movies with the most ratings. -```{r} +```{r, cache=FALSE} +library(tidyverse) +library(janitor) +library(dslabs) top_movies <- movielens |> count(movieId) |> top_n(50, n) |> pull(movieId) -top_users <- filter(movielens, movieId %in% top_movies) |> count(userId) |> top_n(50, n) |> pull(userId) -x <- movielens |> filter(movieId %in% top_movies & userId %in% top_users) |> +x <- movielens |> filter(movieId %in% top_movies) +top_users <- x |> count(userId) |> top_n(50, n) |> pull(userId) +x <- x |> filter(userId %in% top_users) |> select(title, userId, rating) |> - pivot_wider(names_from = userId, values_from = rating) -row_names <- str_remove(x$title, ": Episode") |> str_trunc(20) -x <- x[,-1] |> as.matrix() -x <- sweep(x, 1, rowMeans(x, na.rm = TRUE)) -rownames(x) <- row_names + pivot_wider(names_from = userId, values_from = rating) |> + column_to_rownames("title") |> + as.matrix() +rownames(x) <- str_remove(rownames(x), ": Episode") |> str_trunc(20) ``` We want to use these data to find out if there are clusters of movies based on the ratings from `r ncol(x)` movie raters. A first step is to find the distance between each pair of movies using the `dist` function: @@ -71,13 +68,15 @@ names(groups)[groups == 4] And group 9 appears to be nerd movies: ```{r} -names(groups)[groups == 9] +names(groups)[groups == 6] ``` -We can change the size of the group by either making `k` larger or `h` smaller. We can also explore the data to see if there are clusters of movie raters. +We can change the size of the group by either making `k` larger or `h` smaller. + +Note that we can also explore the data to see if there are clusters of movie raters. ```{r} -h_2 <- dist(t(x)) |> hclust() +h_2 <- hclust(dist(t(x))) ``` @@ -111,6 +110,7 @@ k <- kmeans(x_0, centers = 10, nstart = 25) A powerful visualization tool for discovering clusters or patterns in your data is the heatmap. The idea is simple: plot an image of your data matrix with colors used as the visual cue and both the columns and rows ordered according to the results of a clustering algorithm. We will demonstrate this with the `tissue_gene_expression` dataset. We will scale the rows of the gene expression matrix. The first step is compute: + ```{r} x <- sweep(tissue_gene_expression$x, 2, colMeans(tissue_gene_expression$x)) h_1 <- hclust(dist(x)) @@ -132,7 +132,7 @@ heatmap(x, col = RColorBrewer::brewer.pal(11, "Spectral")) We do not show the results of the heatmap function because there are too many features for the plot to be useful. We will therefore filter some columns and remake the plots. -## Filtering features +### Filtering features If the information about clusters is included in just a few features, including all the features can add enough noise that detecting clusters becomes challenging. One simple approach to try to remove features with no information is to only include those with high variance. In the movie example, a user with low variance in their ratings is not really informative: all the movies seem about the same to them. Here is an example of how we can include only the features with high variance. @@ -144,6 +144,7 @@ o <- order(sds, decreasing = TRUE)[1:25] heatmap(x[,o], col = RColorBrewer::brewer.pal(11, "Spectral")) ``` +Note there are several other heatmap function in R. A popular examples is the `heatmap.2` in the **gplots** package. ## Exercises diff --git a/ml/conditionals.qmd b/ml/conditionals.qmd index 36f15c8..962c9bd 100644 --- a/ml/conditionals.qmd +++ b/ml/conditionals.qmd @@ -1,35 +1,42 @@ # Conditional probabilities and expectations -In machine learning applications, we rarely can predict outcomes perfectly. For example, spam detectors often miss emails that are clearly spam, Siri often misunderstands the words we are saying, and your bank at times thinks your card was stolen when it was not. The most common reason for not being able to build perfect algorithms is that it is impossible. To see this, note that most datasets will include groups of observations with the same exact observed values for all predictors, but with different outcomes. Because our prediction rules are functions, equal inputs (the predictors) implies equal outputs (the predictions). Therefore, for a challenge in which the same predictors are associated with different outcomes across different individual observations, it is impossible to predict correctly for all these cases. We saw a simple example of this in the previous section: for any given height $x$, you will have both males and females that are $x$ inches tall. +In machine learning applications, we rarely can predict outcomes perfectly. For example, spam detectors often miss emails that are clearly spam, Siri often misunderstands the words we are saying, and your bank at times thinks your card was stolen when it was not. The most common reason for not being able to build perfect algorithms is that it is impossible. To see this, note that most datasets will include groups of observations with the same exact observed values for all predictors, but with different outcomes. + +Because our prediction rules are functions, equal inputs (the predictors) implies equal outputs (the predictions). Therefore, for a challenge in which the same predictors are associated with different outcomes across different individual observations, it is impossible to predict correctly for all these cases. We saw a simple example of this in the previous section: for any given height $x$, you will have both males and females that are $x$ inches tall. However, none of this means that we can't build useful algorithms that are much better than guessing, and in some cases better than expert opinions. To achieve this in an optimal way, we make use of probabilistic representations of the problem based on the ideas presented in Section @sec-conditional-expectation. Observations with the same observed values for the predictors may not all be the same, but we can assume that they all have the same probability of this class or that class. We will write this idea out mathematically for the case of categorical data. ## Conditional probabilities -We use the notation $(X_1 = x_1,\dots,X_p=x_p)$ to represent the fact that we have observed values $x_1,\dots,x_p$ for covariates $X_1, \dots, X_p$. This does not imply that the outcome $Y$ will take a specific value. Instead, it implies a specific probability. In particular, we denote the _conditional probabilities_ for each class $k$: +We use the notation $(X_1 = x_1,\dots,X_p=x_p)$ to represent the fact that we have observed values $x_1,\dots,x_p$ for covariates $X_1, \dots, X_p$. This does not imply that the outcome $Y$ will take a specific value. Instead, it implies a specific probability. In particular, we denote the _conditional probabilities_ for each class $k$ with: $$ \mbox{Pr}(Y=k \mid X_1 = x_1,\dots,X_p=x_p), \, \mbox{for}\,k=1,\dots,K $$ -To avoid writing out all the predictors, we will use the bold letters like this: $\mathbf{X} \equiv (X_1,\dots,X_p)$ and $\mathbf{x} \equiv (x_1,\dots,x_p)$. We will also use the following notation for the conditional probability of being class $k$: +To avoid writing out all the predictors, we will use the bold letters like this: $\mathbf{X} \equiv (X_1,\dots,X_p)^\top$ and $\mathbf{x} \equiv (x_1,\dots,x_p)^\top$. We will also use the following notation for the conditional probability of being class $k$: $$ p_k(\mathbf{x}) = \mbox{Pr}(Y=k \mid \mathbf{X}=\mathbf{x}), \, \mbox{for}\, k=1,\dots,K $$ +Notice that the $p_k(\mathbf{x})$ have to add up to 1 for each $\mathbf{x}$, so once we know $K-1$, we know all. When the outcome is binary, we only need to know 1, so we drop the $k$ and use the notation $p(\mathbf{x}) = \mbox{Pr}(Y=1 \mid \mathbf{X}=\mathbf{x})$. + +:::{.callout-note} +Do not be confused by the fact that we use $p$ for two different things: the conditional probability $p(\mathbf{x})$ and the number of predictors $p$. +::: -Note: We will be using the $p(x)$ notation to represent conditional probabilities as functions of the predictors. Do not confuse it with the $p$ that represents the number of predictors. +These probabilities guide the construction of an algorithm that makes the best prediction: for any given $\mathbf{x}$, we will predict the class $k$ with the largest probability among $p_1(x), p_2(x), \dots p_K(x)$. In mathematical notation, we write it like this: -These probabilities guide the construction of an algorithm that makes the best prediction: for any given $\mathbf{x}$, we will predict the class $k$ with the largest probability among $p_1(x), p_2(x), \dots p_K(x)$. In mathematical notation, we write it like this: $\hat{Y} = \max_k p_k(\mathbf{x})$. +$$\hat{Y} = \max_k p_k(\mathbf{x})$$ -In machine learning, we refer to this as _Bayes' Rule_. But keep in mind that this is a theoretical rule since in practice we don't know $p_k(\mathbf{x}), k=1,\dots,K$. In fact, estimating these conditional probabilities can be thought of as the main challenge of machine learning. The better our probability estimates $\hat{p}_k(\mathbf{x})$, the better our predictor: +In machine learning, we refer to this as _Bayes' Rule_. But this is a theoretical rule since, in practice, we don't know $p_k(\mathbf{x}), k=1,\dots,K$. In fact, estimating these conditional probabilities can be thought of as the main challenge of machine learning. The better our probability estimates $\hat{p}_k(\mathbf{x})$, the better our predictor $\hat{Y}$. -$$\hat{Y} = \max_k \hat{p}_k(\mathbf{x})$$ +So how well we predict depends on two things: 1) how close are the $\max_k p_k(\mathbf{x})$ to 1 or 0 (perfect certainty) +and 2) how close our estimates $\hat{p}_k(\mathbf{x})$ are to $p_k(\mathbf{x})$. We can't do anything about the first restriction as it is determined by the nature of the problem, so our energy goes into finding ways to best estimate conditional probabilities. -So what we will predict depends on two things: 1) how close are the $\max_k p_k(\mathbf{x})$ to 1 or 0 (perfect certainty) -and 2) how close our estimates $\hat{p}_k(\mathbf{x})$ are to $p_k(\mathbf{x})$. We can't do anything about the first restriction as it is determined by the nature of the problem, so our energy goes into finding ways to best estimate conditional probabilities. The first restriction does imply that we have limits as to how well even the best possible algorithm can perform. You should get used to the idea that while in some challenges we will be able to achieve almost perfect accuracy, with digit readers for example, in others our success is restricted by the randomness of the process, with movie recommendations for example. +The first restriction does imply that we have limits as to how well even the best possible algorithm can perform. You should get used to the idea that while in some challenges we will be able to achieve almost perfect accuracy, with digit readers for example, in others, our success is restricted by the randomness of the process, with movie recommendations for example. -Before we continue, it is important to remember that defining our prediction by maximizing the probability is not always optimal in practice and depends on the context. As discussed above, sensitivity and specificity may differ in importance. But even in these cases, having a good estimate of the $p_k(x), k=1,\dots,K$ will suffice for us to build optimal prediction models, since we can control the balance between specificity and sensitivity however we wish. For instance, we can simply change the cutoffs used to predict one outcome or the other. In the plane example, we may ground the plane anytime the probability of malfunction is higher than 1 in a million as opposed to the default 1/2 used when error types are equally undesired. +It is important to remember that defining our prediction by maximizing the probability is not always optimal in practice and depends on the context. As discussed in @sec-evaluation-metrics, sensitivity and specificity may differ in importance. But even in these cases, having a good estimate of the $p_k(x), k=1,\dots,K$ will suffice for us to build optimal prediction models, since we can control the balance between specificity and sensitivity however we wish. For instance, we can simply change the cutoffs used to predict one outcome or the other. In the plane example, we may ground the plane anytime the probability of malfunction is higher than 1 in a million as opposed to the default 1/2 used when error types are equally undesired. ## Conditional expectations @@ -46,7 +53,7 @@ As a result, we often only use the expectation to denote both the conditional pr Just like with categorical outcomes, in most applications the same observed predictors do not guarantee the same continuous outcomes. Instead, we assume that the outcome follows the same conditional distribution. We will now explain why we use the conditional expectation to define our predictors. -## Conditional expectation minimizes squared loss function +## Conditional expectations minimizes squared loss function Why do we care about the conditional expectation in machine learning? This is because the expected value has an attractive mathematical property: it minimizes the MSE. Specifically, of all possible predictions $\hat{Y}$, @@ -60,7 +67,11 @@ $$ f(\mathbf{x}) \equiv \mbox{E}( Y \mid \mathbf{X}=\mathbf{x} ) $$ -for any set of features $\mathbf{x} = (x_1, \dots, x_p)$. Of course this is easier said than done, since this function can take any shape and $p$ can be very large. Consider a case in which we only have one predictor $x$. The expectation $\mbox{E}\{ Y \mid X=x \}$ can be any function of $x$: a line, a parabola, a sine wave, a step function, anything. It gets even more complicated when we consider instances with large $p$, in which case $f(\mathbf{x})$ is a function of a multidimensional vector $\mathbf{x}$. For example, in our digit reader example $p = 784$! **The main way in which competing machine learning algorithms differ is in their approach to estimating this expectation.** +for any set of features $\mathbf{x} = (x_1, \dots, x_p)^\top$. + +This is easier said than done, since this function can take any shape and $p$ can be very large. Consider a case in which we only have one predictor $x$. The expectation $\mbox{E}\{ Y \mid X=x \}$ can be any function of $x$: a line, a parabola, a sine wave, a step function, anything. It gets even more complicated when we consider instances with large $p$, in which case $f(\mathbf{x})$ is a function of a multidimensional vector $\mathbf{x}$. For example, in our digit reader example $p = 784$! + +The main way in which competing machine learning algorithms differ is in their approach to estimating this conditional expectation. ## Exercises diff --git a/ml/cross-validation.qmd b/ml/cross-validation.qmd index 496637a..e3bff2a 100644 --- a/ml/cross-validation.qmd +++ b/ml/cross-validation.qmd @@ -1,66 +1,50 @@ # Cross validation {#sec-cross-validation} -In this chapter we introduce cross validation, one of the most important ideas in machine learning. Here we focus on the conceptual and mathematical aspects. We will describe how to implement cross validation in practice with the **caret** package later, in Section @sec-caret-cv) in the next chapter. To motivate the concept, we will use the two predictor digits data presented in Section \@ref(two-or-seven) and introduce, for the first time, an actual machine learning algorithm: k-nearest neighbors (kNN. +In this chapter we introduce cross validation, one of the most important ideas in machine learning. Here we focus on the conceptual and mathematical aspects. We will describe how to implement cross validation in practice with the **caret** package later in @sec-caret-cv. To motivate the concept, we will use the two predictor digits data presented in \@ref-two-or-seven and introduce an actual machine learning algorithm: k-nearest neighbors (kNN), to demonstrate the ideas. ## Motivation with k-nearest neighbors {#sec-knn-cv-intro} -Let's start by loading the data and showing a plot of the predictors with outcome represented with color. - -```{r mnist-27-data, warning = FALSE, message = FALSE} -library(tidyverse) -library(dslabs) -mnist_27$test |> ggplot(aes(x_1, x_2, color = y)) + geom_point() -``` - -We will use these data to estimate the conditional probability function +We are interested in estimating the conditional probability function $$ -p(x_1, x_2) = \mbox{Pr}(Y = 1 \mid X_1 = x_1 , X_2 = x_2). +p(\mathbf{x}) = \mbox{Pr}(Y = 1 \mid X_1 = x_1 , X_2 = x_2). $$ -as defined in Section @sec-smoothing-ml-connection). With k-nearest neighbors (kNN) we estimate $p(x_1, x_2)$ in a similar way to bin smoothing. However, as we will see, kNN is easier to adapt to multiple dimensions. First we define the distance between all observations based on the features. Then, for any point $(x_1,x_2)$ for which we want an estimate of $p(x_1, x_2)$, we look for the $k$ nearest points to $(x_1,x_2)$ and then take an average of the 0s and 1s associated with these points. We refer to the set of points used to compute the average as the *neighborhood*. Due to the connection we described earlier between conditional expectations and conditional probabilities, this gives us a $\hat{p}(x_1,x_2$, just like the bin smoother gave us an estimate of a trend. As with bin smoothers, we can control the flexibility of our estimate, in this case through the $k$ parameter: larger $k$s result in smoother estimates, while smaller $k$s result in more flexible and more wiggly estimates. -To implement the algorithm, we can use the `knn3` function from the **caret** package. Looking at the help file for this package, we see that we can call it in one of two ways. We will use the first in which we specify a *formula* and a data frame. The data frame contains all the data to be used. The formula has the form `outcome ~ predictor_1 + predictor_2 + predictor_3` and so on. Therefore, we would type `y ~ x_1 + x_2`. If we are going to use all the predictors, we can use the `.` like this `y ~ .`. For this function, we also need to pick a parameter: the number of neighbors to include. Let's start with the default $k = 5$. The final call looks like this: +as defined in @sec-smoothing-ml-connection. -```{r, echo = FALSE, message = FALSE, warning = FALSE, cache = FALSE} -library(caret) -``` +With k-nearest neighbors (kNN) we estimate $p(\mathbf{x})$ in a similar way to bin smoothing. First we define the distance between all observations based on the features. Then, for any point $\mathbf{x}_0$ we estimate $p(\mathbf{x})$ by identifying the $k$ nearest points to $mathbf{x}_0$ and then taking an average of the $y$s associated with these points. We refer to the set of points used to compute the average as the *neighborhood*. -```{r} +Due to the connection we described earlier between conditional expectations and conditional probabilities, this gives us $\hat{p}(\mathbf{x}_0)$, just like the bin smoother gave us an estimate of a trend. As with bin smoothers, we can control the flexibility of our estimate through the $k$ parameter: larger $k$s result in smoother estimates, while smaller $k$s result in more flexible and wiggly estimates. + +To implement the algorithm, we can use the `knn3` function from the **caret** package. Looking at the help file for this package, we see that we can call it in one of two ways. We will use the first in which we specify a *formula* and a data frame. The data frame contains all the data to be used. The formula has the form `outcome ~ predictor_1 + predictor_2 + predictor_3` and so on. Therefore, we would type `y ~ x_1 + x_2`. If we are going to use variables in the data frame, we can use the `.` like this `y ~ .`. We also need to pick $k$, which is set to `k = 5` by default. The final call looks like this: + +```{r, message = FALSE, warning = FALSE} +library(dslabs) library(caret) knn_fit <- knn3(y ~ ., data = mnist_27$train, k = 5) ``` In this case, since our dataset is balanced and we care just as much about sensitivity as we do about specificity, we will use accuracy to quantify performance. -The `predict` function for `knn` produces a probability for each class. We keep the probability of being a 7 as the estimate $\hat{p}(x_1, x_2)$ +The `predict` function for `knn` produces a probability for each class. We keep the probability of being a 7 as the estimate $\hat{p}(\mathbf{x})$ ```{r} y_hat_knn <- predict(knn_fit, mnist_27$test, type = "class") confusionMatrix(y_hat_knn, mnist_27$test$y)$overall["Accuracy"] ``` -In Section @sec-two-or-seven we used linear regression to generate an estimate. - -```{r} -fit_lm <- mnist_27$train |> - mutate(y = ifelse(y == 7, 1, 0)) |> - lm(y ~ x_1 + x_2, data = _) -p_hat_lm <- predict(fit_lm, mnist_27$test) -y_hat_lm <- factor(ifelse(p_hat_lm > 0.5, 7, 2)) -confusionMatrix(y_hat_lm, mnist_27$test$y)$overall["Accuracy"] -``` - -And we see that kNN, with the default parameter, already beats regression. To see why this is the case, we will plot $\hat{p}(x_1, x_2)$ and compare it to the true conditional probability $p(x_1, x_2)$: +We see that kNN, with the default parameter, already beats regression. To see why this is the case, we plot $\hat{p}(\mathbf{x})$ and compare it to the true conditional probability $p(\mathbf{x})$: -```{r, echo = FALSE} +```{r, echo = FALSE, message=FALSE, warning=FALSE, cache=FALSE} +library(tidyverse) # We use this function to plot the estimated conditional probabilities plot_cond_prob <- function(p_hat = NULL){ tmp <- mnist_27$true_p if (!is.null(p_hat)) { tmp <- mutate(tmp, p = p_hat) } - tmp |> ggplot(aes(x_1, x_2, z = p, fill = p)) + - geom_raster(show.legend = FALSE) + + tmp |> ggplot(aes(x_1, x_2, z = p)) + + geom_raster(aes(fill = p), show.legend = FALSE) + scale_fill_gradientn(colors = c("#F8766D", "white", "#00BFC4")) + stat_contour(breaks = c(0.5), color = "black") } @@ -76,7 +60,7 @@ library(gridExtra) grid.arrange(p2, p1, nrow = 1) ``` -We see that kNN better adapts to the non-linear shape of $p(x_1, x_2)$. However, our estimate has some islands of blue in the red area, which intuitively does not make much sense. This is due to what we call *over-training*. We describe over-training in detail below. Over-training is the reason that we have higher accuracy in the train set compared to the test set: +We see that kNN better adapts to the non-linear shape of $p(\mathbf{x})$. However, our estimate has some islands of blue in the red area, which intuitively does not make much sense. We notice that we have higher accuracy in the train set compared to the test set: ```{r} y_hat_knn <- predict(knn_fit, mnist_27$train, type = "class") @@ -86,11 +70,13 @@ y_hat_knn <- predict(knn_fit, mnist_27$test, type = "class") confusionMatrix(y_hat_knn, mnist_27$test$y)$overall["Accuracy"] ``` -## Over-training +This is due to what we call *over-training*. -Over-training is at its worst when we set $k = 1$. With $k = 1$, the estimate for each $(x_1, x_2)$ in the training set is obtained with just the $y$ corresponding to that point. In this case, if the $(x_1, x_2)$ are unique, we will obtain perfect accuracy in the training set because each point is used to predict itself. Remember that if the predictors are not unique and have different outcomes for at least one set of predictors, then it is impossible to predict perfectly. +### Over-training -Here we fit a kNN model with $k = 1$: +With kNN, over-training is at its worst when we set $k = 1$. With $k = 1$, the estimate for each $\mathbf{x}$ in the training set is obtained with just the $y$ corresponding to that point. In this case, if the $x_1$ and $x_2$ are unique, we will obtain perfect accuracy in the training set because each point is used to predict itself. Remember that if the predictors are not unique and have different outcomes for at least one set of predictors, then it is impossible to predict perfectly. + +Here we fit a kNN model with $k = 1$ and confirm we get close to perfect accuracy in the training set: ```{r} knn_fit_1 <- knn3(y ~ ., data = mnist_27$train, k = 1) @@ -98,7 +84,7 @@ y_hat_knn_1 <- predict(knn_fit_1, mnist_27$train, type = "class") confusionMatrix(y_hat_knn_1, mnist_27$train$y)$overall[["Accuracy"]] ``` -However, the test set accuracy is actually worse than regression: +But in the test set, accuracy is actually worse than what we obtained with regression: ```{r} y_hat_knn_1 <- predict(knn_fit_1, mnist_27$test, type = "class") @@ -131,11 +117,11 @@ grid.arrange(p1, p2, nrow = 1) The black curves denote the decision rule boundaries. -The estimate $\hat{p}(x_1, x_2)$ follows the training data too closely (left). You can see that in the training set, boundaries have been drawn to perfectly surround a single red point in a sea of blue. Because most points $(x_1, x_2)$ are unique, the prediction is either 1 or 0 and the prediction for that point is the associated label. However, once we introduce the training set (right), we see that many of these small islands now have the opposite color and we end up making several incorrect predictions. +The estimate $\hat{p}(\mathbf{x})$ follows the training data too closely (left). You can see that in the training set, boundaries have been drawn to perfectly surround a single red point in a sea of blue. Because most points $\mathbf{x}$ are unique, the prediction is either 1 or 0 and the prediction for that point is the associated label. However, once we introduce the training set (right), we see that many of these small islands now have the opposite color and we end up making several incorrect predictions. -## Over-smoothing +### Over-smoothing -Although not as badly as with the previous examples, we saw that with $k = 5$ we also over-trained. Hence, we should consider a larger $k$. Let's try, as an example, a much larger number: $k = 401$. +Although not as badly as with $k=1$, we saw that with $k = 5$ we also over-trained. Hence, we should consider a larger $k$. Let's try, as an example, a much larger number: $k = 401$. ```{r} knn_fit_401 <- knn3(y ~ ., data = mnist_27$train, k = 401) @@ -146,6 +132,7 @@ confusionMatrix(y_hat_knn_401, mnist_27$test$y)$overall["Accuracy"] This turns out to be similar to regression: ```{r mnist-27-glm-est, echo = FALSE, out.width="100%"} +fit_lm <- lm(y ~ ., data = mutate(mnist_27$train, y=y == "7")) p_hat <- predict(fit_lm, newdata = mnist_27$true_p) p_hat <- scales::squish(p_hat, c(0, 1)) p1 <- plot_cond_prob(p_hat) + @@ -159,86 +146,70 @@ grid.arrange(p1, p2, nrow = 1) This size of $k$ is so large that it does not permit enough flexibility. We call this *over-smoothing*. -## Picking the $k$ in kNN +### Parameter tuning -So how do we pick $k$? In principle we want to pick the $k$ that maximizes accuracy, or minimizes the expected MSE as defined in @sec-loss-function. The goal of cross validation is to estimate these quantities for any given algorithm and set of tuning parameters such as $k$. To understand why we need a special method to do this let's repeat what we did above but for different values of $k$: +So how do we pick $k$? In principle we want to pick the $k$ that maximizes accuracy, or minimizes the expected MSE as defined in @sec-mse. The goal of cross validation is to estimate these quantities for any given algorithm and set of tuning parameters such as $k$. To understand why we need a special method to do this let's repeat what we did above, comparing the training set and test set accuracy, but for different values of $k$. We can plot the accuracy estimates for each value of $k$: -```{r} +```{r, echo=FALSE,warning = FALSE, message = FALSE} ks <- seq(3, 251, 2) -``` - -We do this using `map_df` function to repeat the above for each one. - -```{r, warning = FALSE, message = FALSE} -library(purrr) -accuracy <- map_df(ks, function(k){ +accuracy <- sapply(ks, function(k){ fit <- knn3(y ~ ., data = mnist_27$train, k = k) y_hat <- predict(fit, mnist_27$train, type = "class") cm_train <- confusionMatrix(y_hat, mnist_27$train$y) - train_error <- cm_train$overall["Accuracy"] + train_error <- cm_train$overall[["Accuracy"]] y_hat <- predict(fit, mnist_27$test, type = "class") cm_test <- confusionMatrix(y_hat, mnist_27$test$y) - test_error <- cm_test$overall["Accuracy"] + test_error <- cm_test$overall[["Accuracy"]] - tibble(train = train_error, test = test_error) + c(train = train_error, test = test_error) }) ``` -Note that we estimate accuracy by using both the training set and the test set. We can now plot the accuracy estimates for each value of $k$: - ```{r accuracy-vs-k-knn, echo = FALSE} -accuracy |> mutate(k = ks) |> - gather(set, accuracy, -k) |> - mutate(set = factor(set, levels = c("train", "test"))) |> +data.frame(k = ks, train = accuracy["train",], test = accuracy["test",]) |> + pivot_longer(-k, values_to = "accuracy", names_to = "set") |> + mutate(set = factor(set, levels = c("test", "train"))) |> ggplot(aes(k, accuracy, color = set)) + geom_line() + geom_point() ``` -First, note that the estimate obtained on the training set is generally lower than the estimate obtained with the test set, with the difference larger for smaller values of $k$. This is due to over-training. Also note that the accuracy versus $k$ plot is quite jagged. We do not expect this because small changes in $k$ should not affect the algorithm's performance too much. The jaggedness is explained by the fact that the accuracy is computed on a sample and therefore is a random variable. This demonstrates why we prefer to minimize the expected loss rather than the loss we observe with one dataset. +First, note that the estimate obtained on the training set is generally lower than the estimate obtained with the test set, with the difference larger for smaller values of $k$. This is due to over-training. -If we were to use these estimates to pick the $k$ that maximizes accuracy, we would use the estimates built on the test data: +So do we simply pick the $k$ that maximizes accuracy and report this accuracy? There are to problems with this approach: -```{r} -ks[which.max(accuracy$test)] -max(accuracy$test) -``` +1. The accuracy versus $k$ plot is quite jagged. We do not expect this because small changes in $k$ should not affect the algorithm's performance so much. The jaggedness is explained by the fact that the accuracy is computed on a sample and therefore is a random variable. -Another reason we need a better estimate of accuracy is that if we use the test set to pick this $k$, we should not expect the accompanying accuracy estimate to extrapolate to the real world. This is because even here we broke a golden rule of machine learning: we selected the $k$ using the test set. Cross validation also provides an estimate that takes this into account. +2. Although for each $k$ we estimated MSE using the test set, we used the test set to pick the best $k$. As a result, we should not expect this minimum test set accuracy to extrapolate to the real world. -## Mathematical description of cross validation +Cross validation provides a solution to both these problems. -In Section @sec-loss-function, we described that a common goal of machine learning is to find an algorithm that produces predictors $\hat{Y}$ for an outcome $Y$ that minimizes the MSE: +## Mathematical description of cross validation -$$ -\mbox{MSE} = \mbox{E}\left\{ \frac{1}{N}\sum_{i = 1}^N (\hat{Y}_i - Y_i)^2 \right\} -$$ +In the previous section we introduced kNN as an example to motivate the topic of this chapter. In this case particular case there is just one parameter, $k$, that affects the performance of the algorithm. However, in general, machine algorithms may have multiple parameter so we use the notation $\lambda$ to represent any set of parameters needed to define a machine learning algorithm. We also introduce notation to distinguish the predictions you get with each set of parameters with $\hat{y}(\lambda)$ and the MSE for this choice with $\text{MSE}(\lambda)$. Our goal is to find the $\lambda$ that minimizes $\text{MSE}(\lambda)$. Cross validation is a method the helps us estimate $\text{MSE}(\lambda)$. -When all we have at our disposal is one dataset, we can estimate the MSE with the observed MSE like this: +A intuitive first attempt is the apparent error defined in @sec-mse and used in the previous section: $$ -\hat{\mbox{MSE}} = \frac{1}{N}\sum_{i = 1}^N (\hat{y}_i - y_i)^2 -$$ - -These two are often referred to as the *true error* and *apparent error*, respectively. - -There are two important characteristics of the apparent error we should always keep in mind: - -1. Because our data is random, the apparent error is a random variable. For example, the dataset we have may be a random sample from a larger population. An algorithm may have a lower apparent error than another algorithm due to luck. +\hat{\mbox{MSE}}(\lambda) = \frac{1}{N}\sum_{i = 1}^N \left\{\hat{y}_i(\lambda) - y_i\right\}^2 +$$ -2. If we train an algorithm on the same dataset that we use to compute the apparent error, we might be overtraining. In general, when we do this, the apparent error will be an underestimate of the true error. We will see an extreme example of this with k-nearest neighbors. +As noted in the previous section this estimate is a random error, based on just one test set, with enough variability to affect the choice of the best $\lambda$ substantially. -Cross validation is a technique that permits us to alleviate both these problems. To understand cross validation, it helps to think of the true error, a theoretical quantity, as the average of many apparent errors obtained by applying the algorithm to $B$ new random samples of the data, none of them used to train the algorithm. As shown in a previous chapter, we think of the true error as: +Now imagine a world in which we could obtain data over and over again, say from new random samples. We could take a very large number of new samples $B$, split them into training and test sets, and define $$ -\frac{1}{B} \sum_{b = 1}^B \frac{1}{N}\sum_{i = 1}^N \left(\hat{y}_i^b - y_i^b\right)^2 -$$ +\frac{1}{B} \sum_{b=1}^B \frac{1}{N}\sum_{i=1}^N \left\{\hat{y}_i^b(\lambda) - y_i^b\right\}^2 +$$ -with $B$ a large number that can be thought of as practically infinite. As already mentioned, this is a theoretical quantity because we only have available one set of outcomes: $y_1, \dots, y_n$. Cross validation is based on the idea of imitating the theoretical setup above as best we can with the data we have. To do this, we have to generate a series of different random samples. There are several approaches we can use, but the general idea for all of them is to randomly generate smaller datasets that are not used for training, and instead used to estimate the true error. +with $y_i^b$ the $i$th observation in sample $b$ and $\hat{y}_{i}^b(\lambda)$ the prediction obtained with the algorithm defined by parameter $\lambda$ and trained indipendently of $y_i^b$. The law of large numbers tells us that as $B$ becomes larger, this quantity gets closer and closer to $MSE(\lambda)$. This is of course a theoretical consideration as we rarely get access to more than one dataset for algorithms development, but the concept inspires the idea behind cross-validation. -## K-fold cross validation +The general idea is to generate a series of different random samples from the data at hand. There are several approaches to doing this, but the general idea for all to randomly generate +several smaller datasets that are not used for training, and instead use to estimate MSE. + +## Cross validation in practice ```{r, include = FALSE} if (knitr::is_html_output()) { @@ -249,21 +220,15 @@ if (knitr::is_html_output()) { } ``` -The first one we describe is *K-fold cross validation*. Generally speaking, a machine learning challenge starts with a dataset (blue in the image below). We need to build an algorithm using this dataset that will eventually be used in completely independent datasets (yellow). +Generally speaking, we are provided a dataset (blue) and we need to build an algorithm, using this dataset, that will eventually be used in completely independent datasets (yellow) that we might not even see. ```{r, echo = FALSE} knitr::include_graphics("img/cv-1.png") ``` -But we don't get to see these independent datasets. - -```{r, echo = FALSE} -knitr::include_graphics("img/cv-2.png") -``` - -So to imitate this situation, we carve out a piece of our dataset and pretend it is an independent dataset: we divide the dataset into a *training set* (blue) and a *test set* (red). We will train our algorithm exclusively on the training set and use the test set only for evaluation purposes. +So to imitate this situation, we start by carving out a piece of our dataset and pretend it is an independent dataset: we divide the dataset into a *training set* (blue) and a *test set* (red). We will train the entirity of our algorithm, including the choice of parameter $\lambda$, exclusively on the training set and use the test set only for evaluation purposes. -We usually try to select a small piece of the dataset so that we have as much data as possible to train. However, we also want the test set to be large so that we obtain a stable estimate of the loss without fitting an impractical number of models. Typical choices are to use 10%-20% of the data for testing. +We usually try to select a small piece of the dataset so that we have as much data as possible to train. However, we also want the test set to be large so that we obtain a stable estimate of the MSE without fitting an impractical number of models. Typical choices are to use 10%-20% of the data for testing. ```{r, echo = FALSE} knitr::include_graphics("img/cv-3.png") @@ -271,33 +236,34 @@ knitr::include_graphics("img/cv-3.png") Let's reiterate that it is indispensable that we not use the test set at all: not for filtering out rows, not for selecting features, nothing! -Now this presents a new problem because for most machine learning algorithms we need to select parameters, for example the number of neighbors $k$ in k-nearest neighbors. Here, we will refer to the set of parameters as $\lambda$. We need to optimize algorithm parameters without using our test set and we know that if we optimize and evaluate on the same dataset, we will overtrain. This is where cross validation is most useful. +But then how do we optimize $\lambda$? In cross validation we achieve this by splitting the training set into two,the training and validation sets. + + +```{r, echo = FALSE} +knitr::include_graphics("img/cv-4.png") +``` -For each set of algorithm parameters being considered, we want an estimate of the MSE and then we will choose the parameters with the smallest MSE. Cross validation provides this estimate. +We will do this many times assuring that the estimates of MSE obtained in each dataset are independent from each other. There are several proposed methods for doing this. Here we describe one of these approaches, K-fold cross validation, in detail to provide the general idea used in all approaches. -First, before we start the cross validation procedure, it is important to fix all the algorithm parameters. Although we will train the algorithm on the set of training sets, the parameters $\lambda$ will be the same across all training sets. We will use $\hat{y}_i(\lambda)$ to denote the predictors obtained when we use parameters $\lambda$. +### K-fold cross validation -So, if we are going to imitate this definition: +As a reminder we are going to imitate the concept used when introducing this version of the MSE: $$ \mbox{MSE}(\lambda) = \frac{1}{B} \sum_{b = 1}^B \frac{1}{N}\sum_{i = 1}^N \left(\hat{y}_i^b(\lambda) - y_i^b\right)^2 $$ -we want to consider datasets that can be thought of as an independent random sample and we want to do this several times. With K-fold cross validation, we do it $K$ times. In the cartoons, we are showing an example that uses $K = 5$. +We want to generate datasets that can be thought of as an independent random sample and we want to do this $B$ times. With K-fold cross validation, we do it $K$ times. In the illustrations, we are showing an example that uses $K = 5$. We will eventually end up with $K$ samples, but let's start by describing how to construct the first: we simply pick $M = N/K$ observations at random (we round if $M$ is not a round number) and think of these as a random sample $y_1^b, \dots, y_M^b$, with $b = 1$. We call this the validation set: -```{r, echo = FALSE} -knitr::include_graphics("img/cv-4.png") -``` - Now we can fit the model in the training set, then compute the apparent error on the independent set: $$ \hat{\mbox{MSE}}_b(\lambda) = \frac{1}{M}\sum_{i = 1}^M \left(\hat{y}_i^b(\lambda) - y_i^b\right)^2 $$ -Note that this is just one sample and will therefore return a noisy estimate of the true error. This is why we take $K$ samples, not just one. In K-cross validation, we randomly split the observations into $K$ non-overlapping sets: +As a reminder, this is just one sample and will therefore return a noisy estimate of the true error. In K-cross validation, we randomly split the observations into $K$ non-overlapping sets: ```{r, echo = FALSE} knitr::include_graphics("img/cv-5.png") @@ -311,19 +277,27 @@ $$ and obtain an estimate of our loss. A final step would be to select the $\lambda$ that minimizes the MSE. +### How many folds? + +Now how do we pick the cross validation $K$? Large values of $K$ are preferable because the training data better imitates the original dataset. However, larger values of $K$ will have much slower computation time: for example, 100-fold cross validation will be 10 times slower than 10-fold cross validation. For this reason, the choices of $K = 5$ and $K = 10$ are popular. + +One way we can improve the variance of our final estimate is to take more samples. To do this, we would no longer require the training set to be partitioned into non-overlapping sets. Instead, we would just pick $K$ sets of some size at random. + +### Estimate MSE of our optimized algorithm + We have described how to use cross validation to optimize parameters. However, we now have to take into account the fact that the optimization occurred on the training data and therefore we need an estimate of our final algorithm based on data that was not used to optimize the choice. Here is where we use the test set we separated early on: ```{r, echo = FALSE} knitr::include_graphics("img/cv-6.png") ``` -We can do cross validation again: +We can actually do cross validation again: ```{r, echo = FALSE} knitr::include_graphics("img/cv-7.png") ``` -and obtain a final estimate of our expected loss. However, note that this means that our entire compute time gets multiplied by $K$. You will soon learn that performing this task takes time because we are performing many complex computations. As a result, we are always looking for ways to reduce this time. For the final evaluation, we often just use the one test set. +and obtain a final estimate of our expected loss. However, note that last cross validation iteration means that our entire compute time gets multiplied by $K$. You will soon learn that fitting each algorithm takes time because we are performing many complex computations. As a result, we are always looking for ways to reduce this time. For the final evaluation, we often just use the one test set. Once we are satisfied with this model and want to make it available to others, we could refit the model on the entire dataset, without changing the optimized parameters. @@ -331,96 +305,48 @@ Once we are satisfied with this model and want to make it available to others, w knitr::include_graphics("img/cv-8.png") ``` -Now how do we pick the cross validation $K$? Large values of $K$ are preferable because the training data better imitates the original dataset. However, larger values of $K$ will have much slower computation time: for example, 100-fold cross validation will be 10 times slower than 10-fold cross validation. For this reason, the choices of $K = 5$ and $K = 10$ are popular. - -One way we can improve the variance of our final estimate is to take more samples. To do this, we would no longer require the training set to be partitioned into non-overlapping sets. Instead, we would just pick $K$ sets of some size at random. - -One popular version of this technique, at each fold, picks observations at random with replacement (which means the same observation can appear twice). This approach has some advantages (not discussed here) and is generally referred to as the *bootstrap*. In fact, this is the default approach in the **caret** package. We describe how to implement cross validation with the **caret** package in the next chapter. In the next section, we include an explanation of how the bootstrap works in general. - -:::{.callout-note} -You are ready to try exercises 1 through 8 -::: ```{r, include = FALSE} knitr::opts_chunk$set(out.width = "70%", out.extra = NULL) ``` -## Bootstrap - -Suppose the income distribution of your population is as follows: - -```{r income-distribution} -set.seed(1995) -n <- 10^6 -income <- 10^(rnorm(n, log10(45000), log10(3))) -qplot(log10(income), bins = 30, color = I("black")) -``` - -The population median is: +### Boostrap in cross-validation -```{r} -m <- median(income) -m -``` +Typically, cross-validation involves partitioning the original dataset into a training set to train the model and a testing set to evaluate it. With the bootstrap approach, based on the ideas described in #sec-bootstrap, you can create multiple different training datasets via bootstrapping. This method is sometimes called bootstrap aggregating or bagging. We provide a step-by-step description. -Suppose we don't have access to the entire population, but want to estimate the median $m$. We take a sample of 100 and estimate the population median $m$ with the sample median $M$: +From the original dataset, we create a large number of bootstrap samples. Each bootstrap sample is created by randomly selecting observations with replacement, usually the same size as the original dataset. For each bootstrap sample, fit the model and compute the MSE estimate on the observations not selected in the random sampling, referred to as the _out-of-bag observations_. +These out-of-bag observations serve a similar role to a test set in standard cross-validation. -```{r} -N <- 100 -X <- sample(income, N) -median(X) -``` - -Can we construct a confidence interval? What is the distribution of $M$ ? +We then average the MSEs obtaind in the out-of-bag observations from each bootstrap sample to estimate the model's performance. -Because we are simulating the data, we can use a Monte Carlo simulation to learn the distribution of $M$. +This approach is actually the default approach in the **caret** package. We describe how to implement cross validation with the **caret** package in the next chapter. In the next section, we include an explanation of how the bootstrap works in general. -```{r median-is-normal, message = FALSE, warning = FALSE, out.width="100%", fig.width = 6, fig.height = 3} -library(gridExtra) -B <- 10^4 -M <- replicate(B, { - X <- sample(income, N) - median(X) -}) -p1 <- qplot(M, bins = 30, color = I("black")) -p2 <- qplot(sample = scale(M), xlab = "theoretical", ylab = "sample") + - geom_abline() -grid.arrange(p1, p2, ncol = 2) -``` - -If we know this distribution, we can construct a confidence interval. The problem here is that, as we have already described, in practice we do not have access to the distribution. In the past, we have used the Central Limit Theorem, but the CLT we studied applies to averages and here we are interested in the median. We can see that the 95% confidence interval based on CLT - -```{r} -median(X) + 1.96 * sd(X) / sqrt(N) * c(-1, 1) -``` +### Comparison of MSE estimates {#sec-mse-estimates} -is quite different from the confidence interval we would generate if we know the actual distribution of $M$: +In Section @sec-knn-cv-intro we computed an estimate of MSE based just on the provided test set (shown in red in the plot below). Here we show how the cross-validation techniques described above help reduce variability. The green curve below shows the results of apply K-fold cross validation with 10 folds, leaving out 10% of the data for validation. We can see that the variance is reduced substantially. The blue curve is the result of using 100 bootrap samples to estimate MSE. The variability is reduced even further, but at the cost of increasing computation time by 10 fold. ```{r} -quantile(M, c(0.025, 0.975)) -``` - -The bootstrap permits us to approximate a Monte Carlo simulation without access to the entire distribution. The general idea is relatively simple. We act as if the observed sample is the population. We then sample (with replacement) datasets, of the same sample size as the original dataset. Then we compute the summary statistic, in this case the median, on these *bootstrap samples*. - -Theory tells us that, in many situations, the distribution of the statistics obtained with bootstrap samples approximate the distribution of our actual statistic. This is how we construct bootstrap samples and an approximate distribution: - -```{r} -B <- 10^4 -M_star <- replicate(B, { - X_star <- sample(X, N, replace = TRUE) - median(X_star) -}) -``` - -Note a confidence interval constructed with the bootstrap is much closer to one constructed with the theoretical distribution: - -```{r} -quantile(M_star, c(0.025, 0.975)) +set.seed(2023-11-30) +boot <- train(y~., method = "knn", tuneGrid = data.frame(k=ks), + data = mnist_27$train, + trControl = trainControl(number = 100)) +cv <- train(y ~ ., method = "knn", + tuneGrid = data.frame(k = ks), + data = mnist_27$train, + trControl = trainControl(method = "cv", + number = 10, p = .9)) + +data.frame(k = ks, naive = accuracy["test",], + cv = cv$results[,2], + boot = boot$results[,2]) |> + pivot_longer(-k, values_to = "accuracy", names_to = "set") |> + mutate(set = factor(set, levels = c("naive", "cv", "boot"), + labels = c("Simple", "K-fold", "Boostrap"))) |> + ggplot(aes(k, accuracy, color = set)) + + geom_line() ``` -For more on the Bootstrap, including corrections one can apply to improve these confidence intervals, please consult the book *An introduction to the bootstrap* by Efron, B., & Tibshirani, R. J. -*Note that we can use ideas similar to those used in the bootstrap in cross validation: instead of dividing the data into equal partitions, we simply bootstrap many times.* ## Exercises @@ -478,26 +404,3 @@ How many times do `3`, `4`, and `7` appear in the first re-sampled index? 10\. We see that some numbers appear more than once and others appear no times. This has to be this way for each dataset to be independent. Repeat the exercise for all the re-sampled indexes. -11\. Generate a random dataset like this: - -```{r, eval = FALSE} -y <- rnorm(100, 0, 1) -``` - -Estimate the 75th quantile, which we know is: - -```{r, eval = FALSE} -qnorm(0.75) -``` - -with the sample quantile: - -```{r, eval = FALSE} -quantile(y, 0.75) -``` - -Run a Monte Carlo simulation to learn the expected value and standard error of this random variable. - -12\. In practice, we can't run a Monte Carlo simulation because we don't know if `rnorm` is being used to simulate the data. Use the bootstrap to estimate the standard error using just the initial sample `y`. Use 10 bootstrap samples. - -13\. Redo exercise 12, but with 10,000 bootstrap samples. diff --git a/ml/evaluation-metrics.qmd b/ml/evaluation-metrics.qmd index f78daa3..5c0262f 100644 --- a/ml/evaluation-metrics.qmd +++ b/ml/evaluation-metrics.qmd @@ -1,8 +1,8 @@ -# Evaluation metrics +# Evaluation metrics {#sec-evaluation-metrics} Before we start describing approaches to optimize the way we build algorithms, we first need to define what we mean when we say one approach is better than another. In this section, we focus on describing ways in which machine learning algorithms are evaluated. Specifically, we need to quantify what we mean by "better". -For our first introduction to machine learning concepts, we will start with a boring and simple example: how to predict sex using height. As we explain machine learning step by step, this example will let us set down the first building block. Soon enough, we will be attacking more interesting challenges. We use the __caret__ package, which has several useful functions for building and assessing machine learning methods and we introduce in more detail in Section @sec-caret, and for a first example, we use the height data in dslabs. +For our first introduction to machine learning concepts, we will start with a boring and simple example: how to predict sex using height. As we explain machine learning step by step, this example will let us set down the first building block. Soon enough, we will be attacking more interesting challenges. We use the __caret__ package, which has several useful functions for building and assessing machine learning methods and we introduce in more detail in @sec-caret, and for a first example, we use the height data in dslabs. ```{r, message=FALSE, warning=FALSE, cache=FALSE} library(tidyverse) @@ -78,8 +78,7 @@ heights |> group_by(sex) |> summarize(mean(height), sd(height)) But how do we make use of this insight? Let's try another simple approach: predict `Male` if height is within two standard deviations from the average male: ```{r} -y_hat <- ifelse(x > 62, "Male", "Female") |> - factor(levels = levels(test_set$sex)) +y_hat <- factor(ifelse(x > 62, "Male", "Female"), levels(test_set$sex)) ``` The accuracy goes up from 0.50 to about 0.80: @@ -95,8 +94,7 @@ Here we examine the accuracy of 10 different cutoffs and pick the one yielding t ```{r} cutoff <- seq(61, 70) accuracy <- map_dbl(cutoff, function(x){ - y_hat <- ifelse(train_set$height > x, "Male", "Female") |> - factor(levels = levels(test_set$sex)) + y_hat <- factor(ifelse(train_set$height > x, "Male", "Female"), levels = levels(test_set$sex)) mean(y_hat == train_set$sex) }) ``` @@ -138,34 +136,36 @@ We see that it is a bit lower than the accuracy observed for the training set, b The prediction rule we developed in the previous section predicts `Male` if the student is taller than `r best_cutoff` inches. Given that the average female is about `r best_cutoff` inches, this prediction rule seems wrong. What happened? If a student is the height of the average female, shouldn't we predict `Female`? -Generally speaking, overall accuracy can be a deceptive measure. To see this, we will start by constructing what is referred to as the _confusion matrix_, which basically tabulates each combination of prediction and actual value. We can do this in R using the function `table`: +Generally speaking, overall accuracy can be a deceptive measure. To see this, we will start by constructing what is referred to as the _confusion matrix_, which basically tabulates each combination of prediction and actual value. We can do this in R simply using `table(predicted = y_hat, actual = test_set$sex)` + + +but the `confusionMatrix` **caret** package computes the confusion matrix and much more: ```{r} -table(predicted = y_hat, actual = test_set$sex) +cm <- confusionMatrix(data = y_hat, reference = test_set$sex) +cm$table ``` If we study this table closely, it reveals a problem. If we compute the accuracy separately for each sex, we get: ```{r} -test_set |> - mutate(y_hat = y_hat) |> - group_by(sex) |> - summarize(accuracy = mean(y_hat == sex)) +cm$byClass[c("Sensitivity", "Specificity")] ``` -There is an imbalance in the accuracy for males and females: too many females are predicted to be male. We are calling almost half of the females male! How can our overall accuracy be so high then? This is because the _prevalence_ of males in this dataset is high. These heights were collected from three data sciences courses, two of which had more males enrolled: +In the next section we explain that these two are equivalnet to accuracy with femeles and males, respectively. + +We notice an imbalance: too many females are predicted to be male. We are calling almost half of the females male! How can our overall accuracy be so high then? This is because the _prevalence_ of males in this dataset is high. These heights were collected from three data sciences courses, two of which had more males enrolled: ```{r} -prev <- mean(y == "Male") -prev +cm$byClass["Prevalence"] ``` -So when computing overall accuracy, the high percentage of mistakes made for females is outweighed by the gains in correct calls for men. **This can actually be a big problem in machine learning.** If your training data is biased in some way, you are likely to develop algorithms that are biased as well. The fact that we used a test set does not matter because it is also derived from the original biased dataset. This is one of the reasons we look at metrics other than overall accuracy when evaluating a machine learning algorithm. +So when computing overall accuracy, the high percentage of mistakes made for females is outweighed by the gains in correct calls for men. This type of bias can actually be a big problem in practice. If your training data is biased in some way, you are likely to develop algorithms that are biased as well. The fact that we used a test set does not matter because it is also derived from the original biased dataset. This is one of the reasons we look at metrics other than overall accuracy when evaluating a machine learning algorithm. There are several metrics that we can use to evaluate an algorithm in a way that prevalence does not cloud our assessment, and these can all be derived from the confusion matrix. A general improvement to using overall accuracy is to study _sensitivity_ and _specificity_ separately. -## Sensitivity and specificity +## Sensitivity and specificity {#sec-senistivity-and-specificity} To define sensitivity and specificity, we need a binary outcome. When the outcomes are categorical, we can define these terms for a specific category. In the digits example, we can ask for the specificity in the case of correctly predicting 2 as opposed to some other digit. Once we specify a category of interest, then we can talk about positive outcomes, $Y=1$, and negative outcomes, $Y=0$. @@ -208,12 +208,7 @@ sensitivity | TPR | Recall | $\frac{\mbox{TP}}{\mbox{TP} + \mbox{FN}}$ | $\mbox{ specificity | TNR | 1-FPR | $\frac{\mbox{TN}}{\mbox{TN}+\mbox{FP}}$ | $\mbox{Pr}(\hat{Y}=0 \mid Y=0)$ | specificity | PPV | Precision | $\frac{\mbox{TP}}{\mbox{TP}+\mbox{FP}}$ | $\mbox{Pr}(Y=1 \mid \hat{Y}=1)$| -Here TPR is True Positive Rate, FPR is False Positive Rate, and PPV is Positive Predictive Value. -The __caret__ function `confusionMatrix` computes all these metrics for us once we define what category "positive" is. The function expects factors as input, and the first level is considered the positive outcome or $Y=1$. In our example, `Female` is the first level because it comes before `Male` alphabetically. If you type this into R you will see several metrics including accuracy, sensitivity, specificity, and PPV. - -```{r} -cm <- confusionMatrix(data = y_hat, reference = test_set$sex) -``` +The __caret__ function `confusionMatrix` computes all these metrics for us once we define which category is the "positive" (Y=1). The function expects factors as input, and the first level is considered the positive outcome or $Y=1$. In our example, `Female` is the first level because it comes before `Male` alphabetically. If you type this into R you will see several metrics including accuracy, sensitivity, specificity, and PPV. You can acceess these directly, for example, like this: @@ -222,12 +217,12 @@ cm$overall["Accuracy"] cm$byClass[c("Sensitivity","Specificity", "Prevalence")] ``` -We can see that the high overall accuracy is possible despite relatively low sensitivity. As we hinted at above, the reason this happens is because of the low prevalence (0.23): the proportion of females is low. Because prevalence is low, failing to predict actual females as females (low sensitivity) does not lower the accuracy as much as failing to predict actual males as males (low specificity). This is an example of why it is important to examine sensitivity and specificity and not just accuracy. Before applying this algorithm to general datasets, we need to ask ourselves if prevalence will be the same. +We can see that the high overall accuracy is possible despite relatively low sensitivity. As we hinted at above, the reason this happens is because of the low prevalence (0.23): the proportion of females is low. Because prevalence is low, failing to predict actual females as females (low sensitivity) does not lower the overall accuracy as much as failing to predict actual males as males (low specificity). This is an example of why it is important to examine sensitivity and specificity and not just accuracy. Before applying this algorithm to general datasets, we need to ask ourselves if prevalence will be the same. ## Balanced accuracy and $F_1$ score -Although we usually recommend studying both specificity and sensitivity, very often it is useful to have a one-number summary, for example for optimization purposes. One metric that is preferred over overall accuracy is the average of specificity and sensitivity, referred to as _balanced accuracy_. Because specificity and sensitivity are rates, it is more appropriate to compute the _harmonic_ average. In fact, the _$F_1$-score_, a widely used one-number summary, is the harmonic average of precision and recall: +Although we usually recommend studying both specificity and sensitivity, very often it is useful to have a one-number summary, for example, for optimization purposes. One metric that is preferred over overall accuracy is the average of specificity and sensitivity, referred to as _balanced accuracy_. Because specificity and sensitivity are rates, it is more appropriate to compute the _harmonic_ average. In fact, the _$F_1$-score_, a widely used one-number summary, is the harmonic average of precision and recall: $$ \frac{1}{\frac{1}{2}\left(\frac{1}{\mbox{recall}} + @@ -259,8 +254,7 @@ Let's rebuild our prediction algorithm, but this time maximizing the F-score ins ```{r} cutoff <- seq(61, 70) F_1 <- map_dbl(cutoff, function(x){ - y_hat <- ifelse(train_set$height > x, "Male", "Female") |> - factor(levels = levels(test_set$sex)) + y_hat <- factor(ifelse(train_set$height > x, "Male", "Female"), levels(test_set$sex)) F_meas(data = y_hat, reference = factor(train_set$sex)) }) ``` @@ -295,17 +289,28 @@ sensitivity(data = y_hat, reference = test_set$sex) specificity(data = y_hat, reference = test_set$sex) ``` -We now see that we do much better than guessing, that both sensitivity and specificity are relatively high, and that we have built our first machine learning algorithm. It takes height as a predictor and predicts female if you are 65 inches or shorter. - +We now see that we do much better than guessing, that both sensitivity and specificity are relatively high. ## Prevalence matters in practice -A machine learning algorithm with very high sensitivity and specificity may not be useful in practice when prevalence is close to either 0 or 1. To see this, consider the case of a doctor that specializes in a rare disease and is interested in developing an algorithm for predicting who has the disease. The doctor shares data with you and you then develop an algorithm with very high sensitivity. You explain that this means that if a patient has the disease, the algorithm is very likely to predict correctly. You also tell the doctor that you are also concerned because, based on the dataset you analyzed, 1/2 the patients have the disease: $\mbox{Pr}(\hat{Y}=1)$. The doctor is neither concerned nor impressed and explains that what is important is the precision of the test: $\mbox{Pr}(Y=1 | \hat{Y}=1)$. Using Bayes theorem, we can connect the two measures: +A machine learning algorithm with very high TPR and TNR may not be useful in practice when prevalence is close to either 0 or 1. To see this, consider the case of a doctor that specializes in a rare disease and is interested in developing an algorithm for predicting who has the disease. + +The doctor shares data with about 1/2 cases and 1/2 controls and some predictors. You then develop an algorithm with TPR=0.99 and TNR = 0.99. You are excited to explain to the doctor that this means that if a patient has the disease, the algorithm is very likely to predict correctly. The doctor is not impressed and explains that your TNR is way to low for this algorithm to be used in practice. This is because this is a rare disease with a prevelance in the general population of 0.5%. The doctor reminds you of Bayes formula: + +$$ \mbox{Pr}(Y = 1\mid \hat{Y}=1) = \mbox{Pr}(\hat{Y}=1 \mid Y=1) \frac{\mbox{Pr}(Y=1)}{\mbox{Pr}(\hat{Y}=1)} \implies \text{Precision} = \text{TPR} \times \frac{\text{Prevalence}}{\text{TPR}\times \text{Prevalence} + \text{FPR}\times(1-\text{Prevalence})} \approx 0.33 $$ -$$ \mbox{Pr}(Y = 1\mid \hat{Y}=1) = \mbox{Pr}(\hat{Y}=1 \mid Y=1) \frac{\mbox{Pr}(Y=1)}{\mbox{Pr}(\hat{Y}=1)}$$ +Here is plot of precision as a function of prevalence with TPR and TNR are 95%: -The doctor knows that the prevalence of the disease is 5 in 1,000, which implies that $\mbox{Pr}(Y=1) \, / \,\mbox{Pr}(\hat{Y}=1) = 1/100$ and therefore the precision of your algorithm is less than 0.01. The doctor does not have much use for your algorithm. +```{r,echo=FALSE} +tpr <- 0.95; fpr <- 0.05 +prevalence <- seq(0,1,len = 100) +data.frame(Prevalence = prevalence, + Precision = tpr * prevalence / (tpr*prevalence + fpr*(1 - prevalence))) |> + ggplot(aes(Prevalence, Precision)) + geom_line() + + ggtitle("Precision when TPR = 0.95 and TNR = 0.95 as a function of prevalence") +``` +Although your algorithm has a precision of about 95% on the data you train on, with prevalence of 50%, if applied to the general population, the algorithm's precision would be just 33%. The doctor can't use an algorithm with 33% of people receiving a positive test actually not having the disease. Note that if even if your algorithm had perfect sensitivity, the precision would still be around 33%. So you need to greatly decrease your FPR for the algorithm to be useful in practice. ## ROC and precision-recall curves @@ -327,13 +332,13 @@ Remember that for each of these parameters, we can get a different sensitivity a A widely used plot that does this is the _receiver operating characteristic_ (ROC) curve. If you are wondering where this name comes from, you can consult the ROC Wikipedia page^[https://en.wikipedia.org/wiki/Receiver_operating_characteristic]. -The ROC curve plots sensitivity (TPR) versus 1 - specificity or the false positive rate (FPR). Here we compute the TPR and FPR needed for different probabilities of guessing male: +The ROC curve plots sensitivity, represented as the TPR, versus 1 - specificity represented as the false positive rate (FPR). Here we compute the TPR and FPR needed for different probabilities of guessing male: ```{r roc-1} probs <- seq(0, 1, length.out = 10) guessing <- map_df(probs, function(p){ y_hat <- - sample(c("Male", "Female"), n, replace = TRUE, prob = c(p, 1 - p)) |> + sample(c("Male", "Female"), nrow(test_set), replace = TRUE, prob = c(p, 1 - p)) |> factor(levels = c("Female", "Male")) list(method = "Guessing", FPR = 1 - specificity(y_hat, test_set$sex), @@ -367,7 +372,7 @@ tmp_1 <- map_df(cutoffs, function(x){ }) tmp_2 <- map_df(probs, function(p){ y_hat <- - sample(c("Male", "Female"), n, replace = TRUE, prob = c(p, 1 - p)) |> + sample(c("Male", "Female"), nrow(test_set), replace = TRUE, prob = c(p, 1 - p)) |> factor(levels = c("Female", "Male")) list(method = "Guessing", cutoff = round(p,1), @@ -382,7 +387,7 @@ bind_rows(tmp_1, tmp_2) |> geom_text_repel(nudge_x = 0.01, nudge_y = -0.01, show.legend = FALSE) ``` -We can see that we obtain higher sensitivity with this approach for all values of specificity, which implies it is in fact a better method. Note that ROC curves for guessing always fall on the identiy line. Also note that when making ROC curves, it is often nice to add the cutoff associated with each point. +We can see that we obtain higher sensitivity with this approach for all values of specificity, which implies it is in fact a better method. Note that ROC curves for guessing always fall on the identity line. Also note that when making ROC curves, it is often nice to add the cutoff associated with each point. The packages __pROC__ and __plotROC__ are useful for generating these plots. @@ -390,7 +395,7 @@ ROC curves have one weakness and it is that neither of the measures plotted depe ```{r precision-recall-1, warning=FALSE, message=FALSE, echo=FALSE} -guessing <- map_df(probs, function(p){ +guessing <- map_df(probs[-1], function(p){ y_hat <- sample(c("Male", "Female"), length(test_index), replace = TRUE, prob = c(p, 1 - p)) |> factor(levels = c("Female", "Male")) @@ -399,7 +404,7 @@ guessing <- map_df(probs, function(p){ precision = precision(y_hat, test_set$sex)) }) -height_cutoff <- map_df(cutoffs, function(x){ +height_cutoff <- map_df(cutoffs[-1], function(x){ y_hat <- ifelse(test_set$height > x, "Male", "Female") |> factor(levels = c("Female", "Male")) list(method = "Height cutoff", @@ -408,7 +413,7 @@ height_cutoff <- map_df(cutoffs, function(x){ }) tmp_1 <- bind_rows(guessing, height_cutoff) |> mutate(Positive = "Y = 1 if Female") -guessing <- map_df(probs, function(p){ +guessing <- map_df(rev(probs)[-1], function(p){ y_hat <- sample(c("Male", "Female"), length(test_index), replace = TRUE, prob = c(p, 1 - p)) |> factor(levels = c("Male", "Female")) @@ -417,7 +422,7 @@ guessing <- map_df(probs, function(p){ precision = precision(y_hat, relevel(test_set$sex, "Male", "Female"))) }) -height_cutoff <- map_df(cutoffs, function(x){ +height_cutoff <- map_df(rev(cutoffs)[-1], function(x){ y_hat <- ifelse(test_set$height > x, "Male", "Female") |> factor(levels = c("Male", "Female")) list(method = "Height cutoff", @@ -433,51 +438,49 @@ bind_rows(tmp_1, tmp_2) |> facet_wrap(~ Positive) ``` -From this plot we immediately see that the precision of guessing is not high. This is because the prevalence is low. We also see that if we change positives to mean Male instead of Female, the ROC curve remains the same, but the precision recall plot changes. +From this plot we immediately see that the precision of guessing is not high. This is because the prevalence is low. We also see that if we change positives to mean male instead of female, the ROC curve remains the same, but the precision recall plot changes. -## The loss function {#sec-loss-function} +## Mean Squared Error {#sec-mse} Up to now we have described evaluation metrics that apply exclusively to categorical data. -Specifically, for binary outcomes, we have described how sensitivity, specificity, accuracy, and $F_1$ can be used as quantification. However, these metrics are not useful for continuous outcomes. In this section, we describe how the general approach to defining "best" in machine learning is to define a _loss function_, which can be applied to both categorical and continuous data. +Specifically, for binary outcomes, we have described how sensitivity, specificity, accuracy, and $F_1$ can be used as quantification. However, these metrics are not useful for continuous outcomes. -The most commonly used loss function is the squared loss function. If $\hat{y}$ is our predictor and $y$ is the observed outcome, the squared loss function is simply: +In this section, we describe how the general approach to defining "best" in machine learning is to define a _loss function_, which can be applied to both categorical and continuous data. -$$ -(\hat{y} - y)^2 -$$ +The most commonly used loss function is the squared loss function. If $\hat{y}$ is our predictor and $y$ is the observed outcome, the squared loss function is simply: $(\hat{y} - y)^2$. -Because we often have a test set with many observations, say $N$, we use the mean squared error (MSE): +Because we often model $y$ as the outcome of a random process, theoretically, it does not make sense to compare algorithms based on $(\hat{y} - y)^2$ as the minimum can change from sample to sample. For this reason we minimize mean squared error (MSE): $$ -\mbox{MSE} = \frac{1}{N} \mbox{RSS} = \frac{1}{N}\sum_{i=1}^N (\hat{y}_i - y_i)^2 +\text{MSE} \equiv \mbox{E}\{(\hat{Y} - Y)^2 \} $$ -In practice, we often report the root mean squared error (RMSE), which is $\sqrt{\mbox{MSE}}$, because it is in the same units as the outcomes. But doing the math is often easier with the MSE and it is therefore more commonly used in textbooks, since these usually describe theoretical properties of algorithms. -If the outcomes are binary, both RMSE and MSE are equivalent to one minus accuracy, since $(\hat{y} - y)^2$ is 0 if the prediction was correct and 1 otherwise. In general, our goal is to build an algorithm that minimizes the loss so it is as close to 0 as possible. +Note that if the outcomes are binary, the MSE is equivalent to one minus expected accuracy, since $(\hat{y} - y)^2$ is 0 if the prediction was correct and 1 otherwise. + +Different algorithms will result in different predictions $\hat{Y}$, and therefore different MSE. In general, our goal is to build an algorithm that minimizes the loss so it is as close to 0 as possible. -Because our data is usually a random sample, we can think of the MSE as a random variable and the observed MSE can be thought of as an estimate of the expected MSE, which in mathematical notation we write like this: +However, note that the MSE is a theoretical quantity. How do we estimate this? Because in practice we have tests set with many, say $N$, independent observations, a commonly used observable estimate of the MSE is: $$ -\mbox{E}\left\{ \frac{1}{N}\sum_{i=1}^N (\hat{Y}_i - Y_i)^2 \right\} +\hat{\mbox{MSE}} = \frac{1}{N}\sum_{i=1}^N (\hat{y}_i - y_i)^2 $$ -This is a theoretical concept because in practice we only have one dataset to work with. But in theory, we think of having a very large number of random samples (call it $B$), apply our algorithm to each, obtain an MSE for each random sample, and think of the expected MSE as: +with the $\hat{y}_i$ generated completly independently from the the $y_i$. -$$ -\frac{1}{B} \sum_{b=1}^B \frac{1}{N}\sum_{i=1}^N \left(\hat{y}_i^b - y_i^b\right)^2 -$$ +:::{.callout-note} +In practice, we often report the root mean squared error (RMSE), which is simply $\sqrt{\mbox{MSE}}$, because it is in the same units as the outcomes. +::: -with $y_{i}^b$ denoting the $i$th observation in the $b$th random sample and $\hat{y}_i^b$ the resulting prediction obtained from applying the exact same algorithm to the $b$th random sample. Again, in practice we only observe one random sample, so the expected MSE is only theoretical. However, in Chapter @sec-cross-validation we describe an approach to estimating the MSE that tries to mimic this theoretical quantity. +However, the estimate $\hat{\text{MSE}}$ is a random variable. In fact, $\text{MSE}$ and $\hat{\text{MSE}}$ are often referred to as the true error and apparent error, respectively. +Due to the complexity of some machine learning it is difficult to derive its statistical properties of how well the apparent error estimates the true error. In @sec-cross-validation we introduce cross-validation and an approach to estimating the MSE that get closer the true MSE. -Note that there are loss functions other than the squared loss. For example, the _Mean Absolute Error_ uses absolute values, $|\hat{Y}_i - Y_i|$ instead of squaring the errors +We end the chapter by pointing out that there are loss functions other than the squared loss. For example, the _Mean Absolute Error_ uses absolute values, $|\hat{Y}_i - Y_i|$ instead of squaring the errors $(\hat{Y}_i - Y_i)^2$. However, in this book we focus on minimizing square loss since it is the most widely used. - ## Exercises - The `reported_height` and `height` datasets were collected from three classes taught in the Departments of Computer Science and Biostatistics, as well as remotely through the Extension School. The biostatistics class was taught in 2016 along with an online version offered by the Extension School. On 2016-01-25 at 8:15 AM, during one of the lectures, the instructors asked students to fill in the sex and height questionnaire that populated the `reported_height` dataset. The online students filled the survey during the next few days, after the lecture was posted online. We can use this insight to define a variable, call it `type`, to denote the type of student: `inclass` or `online`: ```{r, eval=FALSE} diff --git a/ml/intro-ml.qmd b/ml/intro-ml.qmd index dde69a6..7be5586 100644 --- a/ml/intro-ml.qmd +++ b/ml/intro-ml.qmd @@ -1,3 +1,13 @@ # Machine Learning {.unnumbered} -Machine learning has achieved remarkable successes, ranging from the postal service's handwritten zip code readers to voice recognition systems like Apple's Siri. These advances also include movie recommendation systems, spam and malware detection, housing price prediction algorithms, and the development of driverless cars. Although _Artificial Intelligence (AI)_ and _Machine Learning_ are terms frequently used interchangeably today, here we distinguish between them. Traditional AI systems, exemplified by chess-playing machines, employed decision-making based on preset rules stemming from theories or fundamental principles. In contrast, machine learning makes decisions using algorithms trained with data. Furthermore, while AI typically refers to tools complete with user interfaces and ready for real-world application, the term _Machine Learning_ is often reserved for the underlying ideas, concepts, and methodologies, regardless of whether a tangible tool has been developed. In this part of the book we focus on these ideas, concepts, and methodologies, but also demonstrate their application to handwritten digits. \ No newline at end of file + +Machine learning has achieved remarkable successes in a variety of applications. These range from the postal service's use of machine learning for reading handwritten zip codes to the development of voice recognition systems like Apple's Siri. Other significant advances include movie recommendation systems, spam and malware detection, housing price prediction algorithms, and the ongoing development of autonomous vehicles. + +The field of _Artificial Intelligence (AI)_ has been evolving for several decades. Traditional AI systems, including some chess-playing machines, often relied on decision-making based on preset rules and knowledge representation. However, with the advent of data availability, machine learning has gained prominence. It focuses on decision-making through algorithms trained with data. In recent years, the terms AI and Machine Learning have been used interchangeably in many contexts, though they have distinct meanings. AI broadly refers to systems or applications that exhibit intelligent behavior, encompassing both rule-based approaches and machine learning. Machine Learning specifically involves learning from data to make decisions or predictions. + +In this part of the book, we will delve into the concepts, ideas, and methodologies of machine learning. We will also demonstrate their practical application, using the example of recognizing handwritten digits, a classic problem that exemplifies the power and utility of machine learning techniques. +Machine learning has achieved remarkable successes, ranging from the postal service's handwritten zip code readers to voice recognition systems like Apple's Siri. These advances also include movie recommendation systems, spam and malware detection, housing price prediction algorithms, and the development of driverless cars. + + + + diff --git a/ml/ml-in-practice.qmd b/ml/ml-in-practice.qmd index ed0bf7a..798d926 100644 --- a/ml/ml-in-practice.qmd +++ b/ml/ml-in-practice.qmd @@ -1,9 +1,15 @@ -# Machine learning in practice +# Machine learning in practice {#sec-ml-in-practice} -Now that we have learned several methods and explored them with simple, yet illustrative, examples, we will try them out on a real example: the MNIST digits. +Now that we have learned several methods and explored them with simple examples, we will try them out on a real example: the MNIST digits. We can load this data using the following **dslabs** package: +```{r} +#| cache: false +#| echo: false +library(dslabs) +``` + ```{r, message=FALSE, warning=FALSE} library(dslabs) mnist <- read_mnist() @@ -42,35 +48,78 @@ y_test <- factor(mnist$test$labels[index]) ## The caret package {#sec-caret} -We have already learned about several machine learning algorithms. Many of these algorithms are implemented in R. However, they are distributed via different packages, developed by different authors, and often use different syntax. The __caret__ package tries to consolidate these differences and provide consistency. It currently includes over 200 different methods which are summarized in the __caret__ package manual^[https://topepo.github.io/caret/available-models.html]. Keep in mind that __caret__ does not include the needed packages and, to implement a package through __caret__, you still need to install the library. The required packages for each method are described in the package manual. The __caret__ package also provides a function that performs cross validation for us. Here we provide some examples showing how we use this incredibly helpful package. We will use the 2 or 7 example to illustrate and in later sections we use use the package to run algorithms on the larger MNIST datset. +We have already learned about several machine learning algorithms. Many of these algorithms are implemented in R. However, they are distributed via different packages, developed by different authors, and often use different syntax. The __caret__ package tries to consolidate these differences and provide consistency. It currently includes over 200 different methods which are summarized in the __caret__ package manual^[https://topepo.github.io/caret/available-models.html]. Keep in mind that __caret__ does not include the packages needed to run each possible algorithm. To apply a machine learning method through __caret__ you still need to install the library that implmenent the method. The required packages for each method are described in the package manual. + +The __caret__ package also provides a function that performs cross validation for us. Here we provide some examples showing how we use this helpful package. We will first use the 2 or 7 example to illustrate and in later sections we use use the package to run algorithms on the larger MNIST datset. ### The `train` functon +The R functions that fit machine algorithms are all slighltly different. Functions such `lm`, `glm`, `qda`, `lda`, `knn3`, `rpart` and `randomForrest` use different syntax, have different argument names and produce objects of different types. + The __caret__ `train` function lets us train different algorithms using similar syntax. So, for example, we can type: ```{r} -#| cache: true - +#| cache: false library(caret) train_glm <- train(y ~ ., method = "glm", data = mnist_27$train) +train_qda <- train(y ~ ., method = "qda", data = mnist_27$train) train_knn <- train(y ~ ., method = "knn", data = mnist_27$train) ``` -To make predictions, we can use the output of this function directly without needing to look at the specifics of `predict.glm` and `predict.knn`. Instead, we can learn how to obtain predictions from `predict.train`. +As we explain in more detail in #sec-caret-cv, the train function selects parameters for you using cross validation. + +### The `predict` function + +The `predict` function is very useful for machine learning applications. This function takes an object from a fitting function and a data frame with features \mathbf{x} for which to predict, and returns predictions for these features. + +Here is an example with logistic regression + +```{r} +fit <- glm(y ~ ., data = mnist_27$train, family = "binomial") +p_hat <- predict(fit, mnist_27$test) +``` + +In this case, the function is simply computing + +$$ +\hat{p}(\mathbf{x}) = g^{-1}\left(\hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 \right) \text{ with } g(p) = \log\frac{p}{1-p} \implies g^{-1}(\mu) = \frac{1}{1-e^{-\mu}} +$$ + +for the `x_1` and `x_2` in the test set `mnist_27$test`. With these estimates in place we can make our predictions and compute our accuracy: + +```{r} +y_hat <- factor(ifelse(p_hat > 0.5, 7, 2)) +``` + +However, note that `predict` does not always return objects of the same types; it depends on what type of object it is applied to. +To learn about the specifics, you need to look at the help file specific for the type of fit object that is being used. The `predict` is actually a special type of function in R (called a _generic function_) that calls other functions depending on what kind of object it receives. So if `predict` receives an object coming out of the `lm` function, it will call `predict.glm`. If it receives an object coming out of `glm`, it calls `predict.qda`, if the fit is from `knn3`, it calls `predict.knn3`, and so on. These functions are similar but not exaclty. You can learn more about the differences by reading the help files: + +```{r, eval = FALSE} +?predict.glm +?predict.qda +?predict.knn3 +``` + +There are many other versions of `predict` and many machine learning algorithms have a `predict` function. + +As with `train`, the **caret** packages unifies the use of `predict` with the function `predict.train`. This function takes the output of `train` and produces prediction of categories or estimates of $p(\mathbf{x}). + +The code looks the same for all methods: -The code looks the same for both methods: ```{r} y_hat_glm <- predict(train_glm, mnist_27$test, type = "raw") +y_hat_qda <- predict(train_qda, mnist_27$test, type = "raw") y_hat_knn <- predict(train_knn, mnist_27$test, type = "raw") ``` This permits us to quickly compare the algorithms. For example, we can compare the accuracy like this: ```{r} -fits <- list(glm = y_hat_glm, knn = y_hat_knn) +fits <- list(glm = y_hat_glm, qda = y_hat_qda, knn = y_hat_knn) sapply(fits, function(fit) confusionMatrix(fit, mnist_27$test$y)$overall[["Accuracy"]]) ``` + ### Cross validation {#sec-caret-cv} When an algorithm includes a tuning parameter, `train` automatically uses cross validation to decide among a few default values. To find out what parameter or parameters are optimized, you can read the manual ^[http://topepo.github.io/caret/available-models.html] or study the output of: @@ -97,7 +146,7 @@ you can quickly see the results of the cross validation using the `ggplot` funct ggplot(train_knn, highlight = TRUE) ``` -By default, the cross validation is performed by taking 25 bootstrap samples comprised of 25% of the observations. For the `kNN` method, the default is to try $k=5,7,9$. We change this using the `tuneGrid` parameter. The grid of values must be supplied by a data frame with the parameter names as specified in the `modelLookup` output. +By default, the cross validation is performed by taking 25 bootstrap samples comprised of 25% of the observations. For the `kNN` method, the default is to try $k=5,7,9$. We change this using the `tuneGrid` argument. The grid of values must be supplied by a data frame with the parameter names as specified in the `modelLookup` output. Here, we present an example where we try out 30 values between 9 and 67. To do this with __caret__, we need to define a column named `k`, so we use this: `data.frame(k = seq(9, 67, 2))`. Note that when running this code, we are fitting 30 versions of kNN to 25 bootstrapped samples. Since we are fitting $30 \times 25 = 750$ kNN models, running this code will take several seconds. We set the seed because cross validation is a random procedure and we want to make sure the result here is reproducible. @@ -147,15 +196,15 @@ Note that `results` component of the `train` output includes several summary sta names(train_knn$results) ``` -We have only covered the basics. To **caret** package manual ^[https://topepo.github.io/caret/available-models.html] includes many more details. +You can learn many more details about the **caret** package, from the manual ^[https://topepo.github.io/caret/available-models.html]. ## Preprocessing -In machine learning, we often transform predictors before running the machine algorithm. We also remove predictors that are clearly not useful. We call these steps *preprocessing*. +We often transform predictors before running the machine algorithm. We also remove predictors that are clearly not useful. We call these steps *preprocessing*. Examples of preprocessing include standardizing the predictors, taking the log transform of some predictors, removing predictors that are highly correlated with others, and removing predictors with very few non-unique values or close to zero variation. We show an example below. -We can run the `nearZero` function from the **caret** package to see that several features do not vary much from observation to observation. We can see that there is a large number of features with 0 variability: +For example, we can run the `nearZero` function from the **caret** package to see that several features do not vary much from observation to observation. We can see that there is a large number of features with 0 variability: ```{r pixel-sds, message=FALSE, warning=FALSE, cache=FALSE} library(matrixStats) @@ -198,7 +247,9 @@ colnames(x_test) <- colnames(x) ## k-nearest neighbors -Before starting this section, **be warned** that the first two calls to the `train` function in the code below can take several hours to run. This is common challenge when training machine learning algorithms since we have to run the algorithm for each cross validation split and each set of tuning parameter being considered. In the next section, we will provide some suggestion on how to predict the duration of the process and ways to reduce. +:::{.callout-warning} +Before starting this section, note that the first two calls to the `train` function in the code below can take several hours to run. This is common challenge when training machine learning algorithms since we have to run the algorithm for each cross validation split and each set of tuning parameter being considered. In the next section, we will provide some suggestion on how to predict the duration of the process and ways to reduce. +::: The first step is to optimize for $k$. @@ -225,6 +276,7 @@ y_hat_knn <- predict(fit_knn, x_test[, col_index], type = "class") confusionMatrix(y_hat_knn, factor(y_test))$overall["Accuracy"] ``` +### Dimension reduction with PCA An alternative to removing low variance columns directly, is to use dimension reduction on the feature matrix before applying the algorithms. It is important that we not use the test set when finding the PCs nor any summary of the data, as this could result in overtraining. So we start by applying `prcomp` to the training data: @@ -233,7 +285,7 @@ col_means <- colMeans(x) pca <- prcomp(sweep(x, 2, col_means), center = FALSE) ``` -Next, and run knn on just a small number of dimensions. We try 36 dimensions since this explains about 80% of the data. +Next, we run knn on just a small number of dimensions. We try 36 dimensions since this explains about 80% of the data. ```{r} #| eval: false @@ -243,25 +295,31 @@ x_train <- pca$x[,1:k] train_knn <- train(x_train, y, method = "knn", tuneGrid = data.frame(k = seq(3, 13, 2))) -fit <- knn3(x_train, y, k = train_knn$bestTune) +fit_knn_pca <- knn3(x_train, y, k = train_knn$bestTune) ``` ```{r} #| echo: false k <- 36 -fit <- knn3(pca$x[,1:k], y, k = 7) +fit_knn_pca <- knn3(pca$x[,1:k], y, k = 7) ``` Now we apply the transformation we learned with the training data to the test data, reduce the dimension, and then run predict. Note that we used the rotation and column means estimated from the training data. ```{r} -y_hat <- predict(fit, sweep(x_test, 2, col_means) %*% pca$rotation[,1:k], type = "class") -confusionMatrix(y_hat, factor(y_test))$overall["Accuracy"] +newdata <- sweep(x_test, 2, col_means) %*% pca$rotation[,1:k] +y_hat_knn_pca <- predict(fit_knn_pca, newdata, type = "class") +confusionMatrix(y_hat_knn_pca, factor(y_test))$overall["Accuracy"] ``` With obtain an improvement in accuracy, while using only 36 dimensions. +:::{.callout-note} +Remember the entire algorithm needs to be developed on the training data. This is why the column means subtracted from test set columns where those estimated in the training set and the rotation applied to the test data to obtained PCs we the one obtained with the train data. +::: + + ## Random Forest With the random forest algorithm several parameters can be optimized, but the main one is `mtry`, the number of predictors that are randomly selected for each tree. This is also the only tuning parameter that the **caret** function `train` permits when using the default implementation from the **randomForest** package. @@ -283,7 +341,7 @@ fit_rf <- randomForest(x[, col_index], y, mtry = train_rf$bestTune$mtry) ```{r} #| echo: false #| cache: false - +#| message: false library(randomForest) ``` @@ -307,7 +365,7 @@ By optimizing some of the other algorithm parameters we can achieve even higher The default method for estimating accuracy used by the `train` function is to test prediction on 25 bootstrap samples. This can result in long compute times. For examples, if we are considering several values, say 10, of the tuning parameters, we will fit the algorithm 250 times. We can use the `system.time` function to estimate the how long it takes to run the algorithm once ```{r} -system.time(fit_rf <- randomForest(x[, col_index], y, mtry = 9)) +system.time({fit_rf <- randomForest(x[, col_index], y, mtry = 9)}) ``` and use this to estimate the total time for the 250 iterations. In this case it will be several hours. @@ -327,14 +385,15 @@ train_rf <- train(x[, col_index], y, trControl = control) ``` -For random forest we can also speed up the training step by running less trees per fit. After running the algorithm once, we can use the plot function to see how the error rate changes as the number of trees grows. Here we +For random forest we can also speed up the training step by running less trees per fit. After running the algorithm once, we can use the plot function to see how the error rate changes as the number of trees grows. +Here we can see that error rate stabilizes after about 200 trees. ```{r} plot(fit_rf) ``` -We can see that error rate stabilizes after about 200 trees. We can use this finding to speed up the cross validation procedure. Specifically, because the default is 500, by adding the argument `ntree = 200` to the call to `train` above, the procedure will finish 2.5 times faster. +We can use this finding to speed up the cross validation procedure. Specifically, because the default is 500, by adding the argument `ntree = 200` to the call to `train` above, the procedure will finish 2.5 times faster. ## Variable importance @@ -352,14 +411,14 @@ mat[col_index] <- imp image(matrix(mat, 28, 28)) ``` -```{r importance-image, fig.width = 4, fig.height = 4, out.width="50%"} +```{r importance-image, echo=FALSE, fig.width = 4, fig.height = 4, out.width="50%"} rafalib::mypar() mat <- rep(0, ncol(x)) mat[col_index] <- imp image(matrix(mat, 28, 28)) ``` -## Visual assessments +## Diagnostics An important part of data analysis is visualizing results to determine why we are failing. How we do this depends on the application. Below we show the images of digits for which we made an incorrect prediction. Here are some errors for the random forest: @@ -387,26 +446,115 @@ The idea of an ensemble is similar to the idea of combining data from different In machine learning, one can usually greatly improve the final results by combining the results of different algorithms. -Here is a simple example where we compute new class probabilities by taking the average of random forest and kNN. We can see that the accuracy improves to 0.96: +Here is a simple example where we compute new class probabilities by taking the average of random forest and kNN. We can see that the accuracy improves: ```{r} p_rf <- predict(fit_rf, x_test[,col_index], type = "prob") p_rf <- p_rf / rowSums(p_rf) -p_knn <- predict(fit_knn, x_test[,col_index]) -p <- (p_rf + p_knn)/2 +p_knn_pca <- predict(fit_knn_pca, newdata) +p <- (p_rf + p_knn_pca)/2 y_pred <- factor(apply(p, 1, which.max) - 1) confusionMatrix(y_pred, y_test)$overall["Accuracy"] ``` -In the exercises we are going to build several machine learning models for the `mnist_27` dataset and then build an ensemble. +We have just build an ensemble with just two algorithms. By combing more similarly performing, but uncorrelated, algorithms we can improve accuracy further. ## Exercises -1\. Previously in the book, we have compared conditional probability give two predictors $p(x_1,x_2)$ to the fit $\hat{p}(x_1,x_2)$ obtained with a machine learning algorithm by making image plots. The following code can be used to make these images and include a curve at the values of $x_1$ and $x_2$ for which the function is $0.5$. +1\. In the exercises in @sec-example-alogirhms we saw that changing `maxnodes` or `nodesize` in the `randomForest` function improved our estimate. Let's use the `train` function to help us pick these values. From the **caret** manual we see that we can't tune the `maxnodes` parameter or the `nodesize` argument with `randomForest`, so we will use the **Rborist** package and tune the `minNode` argument. Use the `train` function to try values `minNode <- seq(5, 250, 25)`. See which value minimizes the estimated RMSE. + +2\. This **dslabs* dataset includes a matrix `x`: + +```{r, eval = FALSE} +library(dslabs) +dim(tissue_gene_expression$x) +``` + +with the gene expression measured on 500 genes for 189 biological samples representing seven different tissues. The tissue type is stored in `y`: + +```{r, eval = FALSE} +table(tissue_gene_expression$y) +``` + +Split the data in training and test sets, then use kNN to predict tissue type and see what accuracy you obtain. Try it for $k = 1, 3, \dots, 11$. + +3\. We are going to apply LDA and QDA to the `tissue_gene_expression` dataset. We will start with simple examples based on this dataset and then develop a realistic example. + +Create a dataset with just the classes `cerebellum` and `hippocampus` (two parts of the brain) and a predictor matrix with 10 randomly selected columns. Estimate the accuracy of LDA. + + +```{r, eval = FALSE} +set.seed(1993) +tissues <- c("cerebellum", "hippocampus") +ind <- which(tissue_gene_expression$y %in% tissues) +y <- droplevels(tissue_gene_expression$y[ind]) +x <- tissue_gene_expression$x[ind, ] +x <- x[, sample(ncol(x), 10)] +``` + +4\. In this case, LDA fits two 10-dimensional normal distributions. Look at the fitted model by looking at the `finalModel` component of the result of train. Notice there is a component called `means` that includes the estimate `means` of both distributions. Plot the mean vectors against each other and determine which predictors (genes) appear to be driving the algorithm. + +5\. Repeat exercises 3 with QDA. Does it have a higher accuracy than LDA? + +6\. Are the same predictors (genes) driving the algorithm? Make a plot as in exercise 3. + +7\. One thing we see in the previous plot is that the value of predictors correlate in both groups: some predictors are low in both groups while others are high in both groups. The mean value of each predictor, `colMeans(x)`, is not informative or useful for prediction, and often for interpretation purposes it is useful to center or scale each column. This can be achieved with the `preProcessing` argument in `train`. Re-run LDA with `preProcessing = "scale"`. Note that accuracy does not change but see how it is easier to identify the predictors that differ more between groups in the plot made in exercise 4. + +8\. In the previous exercises we saw that both approaches worked well. Plot the predictor values for the two genes with the largest differences between the two groups in a scatterplot to see how they appear to follow a bivariate distribution as assumed by the LDA and QDA approaches. Color the points by the outcome. + +9\. Now we are going to increase the complexity of the challenge slightly: we will consider all the tissue types. + +```{r, eval = FALSE} +set.seed(1993) +y <- tissue_gene_expression$y +x <- tissue_gene_expression$x +x <- x[, sample(ncol(x), 10)] +``` + +What accuracy do you get with LDA? + + +10\. We see that the results are slightly worse. Use the `confusionMatrix` function to learn what type of errors we are making. + + +11\. Plot an image of the centers of the seven 10-dimensional normal distributions. + +12\. Make a scatterplot along with the prediction from the best fitted model. + + +13\. Use the `rpart` function to fit a classification tree to the `tissue_gene_expression` dataset. Use the `train` function to estimate the accuracy. Try out `cp` values of `seq(0, 0.05, 0.01)`. Plot the accuracy to report the results of the best model. + +14\. Study the confusion matrix for the best fitting classification tree. What do you observe happening for placenta? + +15\. Notice that placentas are called endometrium more often than placenta. Note also that the number of placentas is just six, and that, by default, `rpart` requires 20 observations before splitting a node. Thus it is not possible with these parameters to have a node in which placentas are the majority. Rerun the above analysis but this time permit `rpart` to split any node by using the argument `control = rpart.control(minsplit = 0)`. Does the accuracy increase? Look at the confusion matrix again. + +16\. Plot the tree from the best fitting model obtained in exercise 11. + +17\. We can see that with just six genes, we are able to predict the tissue type. Now let's see if we can do even better with a random forest. Use the `train` function and the `rf` method to train a random forest. Try out values of `mtry` ranging from, at least, `seq(50, 200, 25)`. What `mtry` value maximizes accuracy? To permit small `nodesize` to grow as we did with the classification trees, use the following argument: `nodesize = 1`. This will take several seconds to run. If you want to test it out, try using smaller values with `ntree`. Set the seed to 1990. + +18\. Use the function `varImp` on the output of `train` and save it to an object called `imp`. + +19\. The `rpart` model we ran above produced a tree that used just six predictors. Extracting the predictor names is not straightforward, but can be done. If the output of the call to train was `fit_rpart`, we can extract the names like this: + +```{r, eval = FALSE} +ind <- !(fit_rpart$finalModel$frame$var == "") +tree_terms <- + fit_rpart$finalModel$frame$var[ind] |> + unique() |> + as.character() +tree_terms +``` + +What is the variable importance in the random forest call for these predictors? Where do they rank? + +20\. Extract the top 50 predictors based on importance, take a subset of `x` with just these predictors and apply the function `heatmap` to see how these genes behave across the tissues. We will introduce the `heatmap` function in Chapter @sec-clustering. + + + +21\. Previously we have compared conditional probability give two predictors $p(x_1,x_2)$ to the fit $\hat{p}(x_1,x_2)$ obtained with a machine learning algorithm by making image plots. The following code can be used to make these images and include a curve at the values of $x_1$ and $x_2$ for which the function is $0.5$. ```{r} #| eval: false - plot_cond_prob <- function(x_1, x_2, p){ data.frame(x_1 = x_1, x_2 = x_2, p = p) |> ggplot(aes(x_1, x_2)) + @@ -424,10 +572,9 @@ with(mnist_27$true_p, plot_cond_prob(x_1, x_2, p)) ``` -Fit a kNN model and make this plot for the estimated conditional probability. Hint: Use the argument `newdata=mnist27$train` to obtain predictions for a grid points. +Fit a kNN model and make this plot for the estimated conditional probability. Hint: Use the argument `newdata = mnist27$train` to obtain predictions for a grid points. -2\. Notice that, in the plot made in exercise 1, the boundary is somewhat wiggly. This is because kNN, like the basic bin smoother, does not use a kernel. To improve this we could try loess. By reading through the available models part of the manual^[https://topepo.github.io/caret/available-models.html] we see that we can use the `gamLoess` method. -In the manual^[https://topepo.github.io/caret/train-models-by-tag.html] we also see that we need to install the __gam__ package if we have not done so already. We see that we have two parameters to optimize: +22\. Notice that, in the plot made in exercise 1, the boundary is somewhat wiggly. This is because kNN, like the basic bin smoother, does not use a kernel. To improve this we could try loess. By reading through the available models part of the **caret** manual we see that we can use the `gamLoess` method. We see that we need to install the __gam__ package, if we have not done so already. We see that we have two parameters to optimize: ```{r} modelLookup("gamLoess") @@ -436,10 +583,9 @@ modelLookup("gamLoess") Use cross-validation to pick a span between 0.15 and 0.75. Keep `degree = 1`. What span does cross-validation select? -3\. Show an image plot of the estimate $\hat{p}(x,y)$ resulting from the model fit in the exercise 2. How does the accuracy compare to that of kNN? Comment on the difference between the estimate obtained with kNN. - +23\. Show an image plot of the estimate $\hat{p}(x,y)$ resulting from the model fit in the exercise 2. How does the accuracy compare to that of kNN? Comment on the difference between the estimate obtained with kNN. -4\. Use the `mnist_27` training set to build a model with several of the models available from the **caret** package. For example, you can try these: +24\. Use the `mnist_27` training set to build a model with several of the models available from the **caret** package. For example, you can try these: ```{r, eval = FALSE} models <- c("glm", "lda", "naive_bayes", "svmLinear", "gamboost", @@ -450,25 +596,25 @@ models <- c("glm", "lda", "naive_bayes", "svmLinear", "gamboost", We have not explained many of these, but apply them anyway using `train` with all the default parameters. Keep the results in a list. You might need to install some packages. Keep in mind that you will likely get some warnings. -5\. Now that you have all the trained models in a list, use `sapply` or `map` to create a matrix of predictions for the test set. You should end up with a matrix with `length(mnist_27$test$y)` rows and `length(models)` columns. +25\. Now that you have all the trained models in a list, use `sapply` or `map` to create a matrix of predictions for the test set. You should end up with a matrix with `length(mnist_27$test$y)` rows and `length(models)` columns. -6\. Now compute accuracy for each model on the test set. +26\. Compute accuracy for each model on the test set. -7\. Now build an ensemble prediction by majority vote and compute its accuracy. +27\. Build an ensemble prediction by majority vote and compute its accuracy. -8\. Earlier we computed the accuracy of each method on the training set and noticed they varied. Which individual methods do better than the ensemble? +28\. Earlier we computed the accuracy of each method on the training set and noticed they varied. Which individual methods do better than the ensemble? -9\. It is tempting to remove the methods that do not perform well and re-do the ensemble. The problem with this approach is that we are using the test data to make a decision. However, we could use the accuracy estimates obtained from cross validation with the training data. Obtain these estimates and save them in an object. +29\. It is tempting to remove the methods that do not perform well and re-do the ensemble. The problem with this approach is that we are using the test data to make a decision. However, we could use the accuracy estimates obtained from cross validation with the training data. Obtain these estimates and save them in an object. -10\. Now let's only consider the methods with an estimated accuracy of 0.8 when constructing the ensemble. What is the accuracy now? +30\. Now let's only consider the methods with an estimated accuracy of 0.8 when constructing the ensemble. What is the accuracy now? -11\. **Advanced**: If two methods give results that are the same, ensembling them will not change the results at all. For each pair of metrics compare the percent of time they call the same thing. Then use the `heatmap` function to visualize the results. Hint: use the `method = "binary"` argument in the `dist` function. +31\. If two methods give results that are the same, ensembling them will not change the results at all. For each pair of metrics compare the percent of time they call the same thing. Then use the `heatmap` function to visualize the results. Hint: use the `method = "binary"` argument in the `dist` function. -12\. **Advanced**: Note that each method can also produce an estimated conditional probability. Instead of majority vote we can take the average of these estimated conditional probabilities. For most methods, we can the use the `type = "prob"` in the train function. However, some of the methods require you to use the argument `trControl=trainControl(classProbs=TRUE)` when calling train. Also these methods do not work if classes have numbers as names. Hint: change the levels like this: +32\. Note that each method can also produce an estimated conditional probability. Instead of majority vote we can take the average of these estimated conditional probabilities. For most methods, we can the use the `type = "prob"` in the train function. Note that some of the methods require you to use the argument `trControl=trainControl(classProbs=TRUE)` when calling train. Also these methods do not work if classes have numbers as names. Hint: change the levels like this: ```{r, eval = FALSE} dat$train$y <- recode_factor(dat$train$y, "2"="two", "7"="seven") dat$test$y <- recode_factor(dat$test$y, "2"="two", "7"="seven") ``` -13\. In this chapter, we illustrated a couple of machine learning algorithms on a subset of the MNIST dataset. Try fitting a model to the entire dataset. +33\. In this chapter, we illustrated a couple of machine learning algorithms on a subset of the MNIST dataset. Try fitting a model to the entire dataset. diff --git a/ml/smoothing.qmd b/ml/smoothing.qmd index 13ac1a9..6ab4943 100644 --- a/ml/smoothing.qmd +++ b/ml/smoothing.qmd @@ -6,7 +6,7 @@ img_path <- "img" Before continuing learning about machine learning algorithms, we introduce the important concept of *smoothing*. Smoothing is a very powerful technique used all across data analysis. Other names given to this technique are *curve fitting* and *low pass filtering*. It is designed to detect trends in the presence of noisy data in cases in which the shape of the trend is unknown. The *smoothing* name comes from the fact that to accomplish this feat, we assume that the trend is *smooth*, as in a smooth surface. In contrast, the noise, or deviation from the trend, is unpredictably wobbly: -```{r signal-plus-noise-example, message=FALSE, warning=FALSE, fig.height=6, echo=FALSE} +```{r signal-plus-noise-example, message=FALSE, warning=FALSE, fig.height=6, echo=FALSE, cache=FALSE} library(tidyverse) set.seed(1) n <- 100 @@ -20,22 +20,21 @@ gridExtra::grid.arrange(p1, p2, p3) Part of what we explain in this section are the assumptions that permit us to extract the trend from the noise. +## Example: Is it a 2 or a 7? {#sec-two-or-seven} +To motivate the need for smoothing and make the connection with machine learning, we will construct a simplified version of the MNIST dataset with just two classes for the outcome and two predictors. Specifically, we define the challenge as building an algorithm that can determine if a digit is a 2 or 7 from the proportion of dark pixels in the upper left quadrant ($X_1$) and the lower right quadrant ($X_2$). We also selected a random sample of 1,000 digits, 500 in the training set and 500 in the test set. We provide this dataset in the `dslabs` package: -## Simplified MNIST: Is it a 2 or a 7? {#sec-two-or-seven} - -To motivate the need for smoothing and make the connection with machine learning, we will construct a simplifyed version of the MNIST dataset with just two classes for the outcome and two predictors. Specifically, we define the challenge as building an algorithm that can determine if a digit is a 2 or 7 from the proportion of dark pixels in the upper left quadrant ($X_1$) and the lower right quadrant ($X_2$). We also selected a random sample of 1,000 digits, 500 in the training set and 500 in the test set. We provide this dataset in the `dslabs` package: +So for the training data we have $n=500$ observed outcomes $y_1,\dots,y_n$, with $Y$ defined as $1$ if the digit is 7 and 0 if it's 2, and $n=500$ features $\matbh{x}_1, \dots, \matbh{x}_n$, with each feature a two-dimensional point $\mathbf{x}_i = (x_{i,1}, x_{i,2})^\top$. Here is a plot of the $x_2$s versus the $x_1$s with color determinining if $y$ is 1 (blue) or 0 (red). ```{r two-or-seven-scatter, warning=FALSE, message=FALSE, cache=FALSE} -library(tidyverse) library(caret) library(dslabs) mnist_27$train |> ggplot(aes(x_1, x_2, color = y)) + geom_point() ``` -We can immediately see some patterns. For example, if $X_1$ (the upper left panel) is very large, then the digit is probably a 7. Also, for smaller values of $X_1$, the 2s appear to be in the mid range values of $X_2$. +We can immediately see some patterns. For example, if $x_1$ (the upper left panel) is very large, then the digit is probably a 7. Also, for smaller values of $x_1$, the 2s appear to be in the mid range values of $x_2$. -To illustrate how to interpret $X_1$ and $X_2$, we include four example images. On the left are the original images of the two digits with the largest and smallest values for $X_1$ and on the right we have the images corresponding to the largest and smallest values of $X_2$: +To illustrate how to interpret $x_1$ and $x_2$, we include four example images. On the left are the original images of the two digits with the largest and smallest values for $x_1$ and on the right we have the images corresponding to the largest and smallest values of $x_2$: ```{r two-or-seven-images-large-x1, echo=FALSE, out.width="100%", fig.height=3, fig.width=6.5} if (!exists("mnist")) mnist <- read_mnist() @@ -81,30 +80,26 @@ We can start getting a sense for why these predictors are useful, but also why t We haven't really learned any algorithms yet, so let's try building an algorithm using multivariable regression. The model is simply: $$ -p(x_1, x_2) = \mbox{Pr}(Y=1 \mid X_1=x_1 , X_2 = x_2) = +p(\mathbf{x}) = \mbox{Pr}(Y=1 \mid \mathbf{X}=\mathbf{x}) = \mbox{Pr}(Y=1 \mid X_1=x_1 , X_2 = x_2) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 $$ -We fit it like this: - -```{r} -fit <- mnist_27$train |> mutate(y = ifelse(y == 7, 1, 0)) |> lm(y ~ x_1 + x_2, data = _) -``` +We fit can fit this model using least squares and obtain an estimate $\hat{p}(\mathbf{x})$ by using the least square estimates $\hat{\beta}_0$, $\hat{\beta}_1$ and $\hat{\beta}_2$. +We define a decision rule by predictin $\hat{y}=1$ if $\hat{p}(\mathbf{x})>0.5$ and 0 otherwise. -We can now build a decision rule based on the estimate of $\hat{p}(x_1, x_2)$: -```{r} +```{r, echo=FALSE} +fit <- mnist_27$train |> mutate(y = ifelse(y == 7, 1, 0)) |> lm(y ~ x_1 + x_2, data = _) p_hat <- predict(fit, newdata = mnist_27$test) y_hat <- factor(ifelse(p_hat > 0.5, 7, 2)) -confusionMatrix(y_hat, mnist_27$test$y)$overall[["Accuracy"]] ``` -We get an accuracy well above 50%. Not bad for our first try. But can we do better? +We get an accuracy of `r confusionMatrix(y_hat, mnist_27$test$y)$overall[["Accuracy"]]`, well above 50%. Not bad for our first try. But can we do better? -Because we constructed the `mnist_27` example and we had at our disposal 60,000 digits in just the MNIST dataset, we used this to build the _true_ conditional distribution $p(x_1, x_2)$. Keep in mind that this is something we don't have access to in practice, but we include it in this example because it permits the comparison of $\hat{p}(x_1, x_2)$ to the true $p(x_1, x_2)$. This comparison teaches us the limitations of different algorithms. Let's do that here. We have stored the true $p(x_1,x_2)$ in the `mnist_27` object and can plot the image using the __ggplot2__ function `geom_raster()`. -We choose better colors and use the `stat_contour` function to draw a curve that separates pairs $(x_1,x_2)$ for which $p(x_1,x_2) > 0.5$ and pairs for which $p(x_1,x_2) < 0.5$: +Because we constructed the `mnist_27` example and we had at our disposal 60,000 digits in just the MNIST dataset, we used this to build the _true_ conditional distribution $p(\mathbf{x})$. Keep in mind that in practice we don't have access to the true conditional distribution. We include it in this educational example because it permits the comparison of $\hat{p}(\mathbf{x})$ to the true $p(\mathbf{x})$. This comparison teaches us the limitations of different algorithms. We have stored the true $p(\mathbf{x})$ in the `mnist_27` and can plot it as an image. +We draw a curve that separates pairs $(\mathbf{x})$ for which $p(\mathbf{x}) > 0.5$ and pairs for which $p(\mathbf{x}) < 0.5$: -```{r true-p-better-colors} +```{r true-p-better-colors, echo=FALSE} mnist_27$true_p |> ggplot(aes(x_1, x_2, z = p)) + geom_raster(aes(fill = p)) + @@ -112,8 +107,8 @@ mnist_27$true_p |> stat_contour(breaks = c(0.5), color = "black") ``` -Above you see a plot of the true $p(x_1, x_2)$. To start understanding the limitations of regression here, first note that with regression $\hat{p}(x_1,x_2)$ has to be a plane, and as a result the boundary defined by the decision rule is given by: -$\hat{p}(x_1,x_2) = 0.5$: +To start understanding the limitations of regression, first note that with regression $\hat{p}(\mathbf{x})$ has to be a plane, and as a result the boundary defined by the decision rule is given by: +$\hat{p}(\mathbf{x}) = 0.5$: $$ \hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 = 0.5 \implies @@ -121,7 +116,7 @@ $$ x_2 = (0.5-\hat{\beta}_0)/\hat{\beta}_2 -\hat{\beta}_1/\hat{\beta}_2 x_1 $$ -Note that for this boundary, $x_2$ is a linear function of $x_1$. This implies that our regression approach has no chance of capturing the non-linear nature of the true $p(x_1,x_2)$. Below is a visual representation of $\hat{p}(x_1, x_2)$. Regression can't catch this. +This implies that for the boundary $x_2$ is a linear function of $x_1$. This implies that our regression approach has no chance of capturing the non-linear nature of the true $p(\mathbf{x})$. Below is a visual representation of $\hat{p}(\mathbf{x})$ which clearly shows how it fails to capture the shape of $p(\mathbf{x})$. ```{r regression-p-hat, echo=FALSE, out.width="100%", fig.height=3, fig.width=7} p_hat <- predict(fit, newdata = mnist_27$true_p) @@ -149,8 +144,6 @@ To explain these concepts, we will focus first on a problem with just one predic ```{r polls-2008-data, warning=FALSE, message=FALSE, cache=FALSE} -library(tidyverse) -library(dslabs) polls_2008 |> ggplot(aes(day, margin)) + geom_point() ``` @@ -174,11 +167,11 @@ polls_2008 |> geom_point(aes(color = resid), size = 3) ``` -The line we see does not appear to describe the trend very well. For example, on September 4 (day -62), the Republican Convention was held and the data suggest that it gave John McCain a boost in the polls. However, the regression line does not capture this potential trend. To see the *lack of fit* more clearly, we note that points above the fitted line (blue) and those below (red) are not evenly distributed across days. We therefore need an alternative, more flexible approach. +The fitted regression line does not appear to describe the trend very well. For example, on September 4 (day -62), the Republican Convention was held and the data suggest that it gave John McCain a boost in the polls. However, the regression line does not capture this potential trend. To see the *lack of fit* more clearly, we note that points above the fitted line (blue) and those below (red) are not evenly distributed across days. We therefore need an alternative, more flexible approach. ## Bin smoothing -The general idea of smoothing is to group data points into strata in which the value of $f(x)$ can be assumed to be constant. We can make this assumption because we think $f(x)$ changes slowly and, as a result, $f(x)$ is almost constant in small windows of time. An example of this idea for the `poll_2008` data is to assume that public opinion remained approximately the same within a week's time. With this assumption in place, we have several data points with the same expected value. +The general idea of smoothing is to group data points into strata in which the value of $f(x)$ can be assumed to be constant. We can make this assumption when we think $f(x)$ changes slowly and, as a result, $f(x)$ is almost constant in small windows of $x$. An example of this idea for the `poll_2008` data is to assume that public opinion remained approximately the same within a week's time. With this assumption in place, we have several data points with the same expected value. If we fix a day to be in the center of our week, call it $x_0$, then for any other day $x$ such that $|x - x_0| \leq 3.5$, we assume $f(x)$ is a constant $f(x) = \mu$. This assumption implies that: @@ -188,13 +181,13 @@ $$ In smoothing, we call the size of the interval satisfying $|x_i - x_0| \leq 3.5$ the *window size*, *bandwidth* or *span*. Later we will see that we try to optimize this parameter. -This assumption implies that a good estimate for $f(x)$ is the average of the $Y_i$ values in the window. If we define $A_0$ as the set of indexes $i$ such that $|x_i - x_0| \leq 3.5$ and $N_0$ as the number of indexes in $A_0$, then our estimate is: +This assumption implies that a good estimate for $f(x_0)$ is the average of the $Y_i$ values in the window. If we define $A_0$ as the set of indexes $i$ such that $|x_i - x_0| \leq 3.5$ and $N_0$ as the number of indexes in $A_0$, then our estimate is: $$ \hat{f}(x_0) = \frac{1}{N_0} \sum_{i \in A_0} Y_i $$ -The idea behind *bin smoothing* is to make this calculation with each value of $x$ as the center. In the poll example, for each day, we would compute the average of the values within a week with that day in the center. Here are two examples: $x_0 = -125$ and $x_0 = -55$. The blue segment represents the resulting average. +We make this calculation with each value of $x$ as the center. In the poll example, for each day, we would compute the average of the values within a week with that day in the center. Here are two examples: $x_0 = -125$ and $x_0 = -55$. The blue segment represents the resulting average. ```{r binsmoother-expained, echo=FALSE} span <- 3.5 @@ -257,10 +250,10 @@ The final code and resulting estimate look like this: span <- 7 fit <- with(polls_2008, ksmooth(day, margin, kernel = "box", bandwidth = span)) -polls_2008 |> mutate(smooth = fit$y) |> - ggplot(aes(day, margin)) + - geom_point(size = 3, alpha = .5, color = "grey") + - geom_line(aes(day, smooth), color = "red") +polls_2008 |> mutate(fit = fit$y) |> + ggplot(aes(x = day)) + + geom_point(aes(y = margin), size = 3, alpha = .5, color = "grey") + + geom_line(aes(y = fit), color = "red") ``` ## Kernels @@ -294,52 +287,6 @@ p2 <- data.frame(x = tmp$x, w_0 = tmp$y) |> gridExtra::grid.arrange(p1, p2, ncol = 2) ``` - - The final code and resulting plot for the normal kernel look like this: ```{r final-ksmooth-normal-kernel} @@ -352,7 +299,7 @@ polls_2008 |> mutate(smooth = fit$y) |> geom_line(aes(day, smooth), color = "red") ``` -Notice that the final estimate now looks smoother. +Notice that this version looks smoother. There are several functions in R that implement bin smoothers. One example is `ksmooth`, shown above. In practice, however, we typically prefer methods that use slightly more complex models than fitting a constant. The final result above, for example, is still somewhat wiggly in parts we don't expect it to be (between -125 and -75, for example). Methods such as `loess`, which we explain next, improve on this. @@ -360,9 +307,7 @@ There are several functions in R that implement bin smoothers. One example is `k A limitation of the bin smoother approach just described is that we need small windows for the approximately constant assumptions to hold. As a result, we end up with a small number of data points to average and obtain imprecise estimates $\hat{f}(x)$. Here we describe how *local weighted regression* (loess) permits us to consider larger window sizes. To do this, we will use a mathematical result, referred to as Taylor's theorem, which tells us that if you look closely enough at any smooth function $f(x)$, it will look like a line. To see why this makes sense, consider the curved edges gardeners make using straight-edged spades: -```{r, echo=FALSE} -knitr::include_graphics(file.path(img_path, "garden.png")) -``` +![](img/garden.png) ("Downing Street garden path edge"[^smoothing-1] by Flckr user Number 10[^smoothing-2]. CC-BY 2.0 license[^smoothing-3].) @@ -557,7 +502,7 @@ $$ E[Y_i | X_i = x_i ] = \beta_0 + \beta_1 (x_i-x_0) + \beta_2 (x_i-x_0)^2 \mbox{ if } |x_i - x_0| \leq h $$ -This is actually the default procedure of the function `loess`. You may have noticed that when we showed the code for using loess, we set `degree = 1`. This tells loess to fit polynomials of degree 1, a fancy name for lines. If you read the help page for loess, you will see that the argument `degree` defaults to 2. By default, loess fits parabolas not lines. Here is a comparison of the fitting lines (red dashed) and fitting parabolas (orange solid): +You may have noticed that when we showed the code for using loess, we set `degree = 1`. This tells loess to fit polynomials of degree 1, a fancy name for lines. If you read the help page for loess, you will see that the argument `degree` defaults to 2. By default, loess fits parabolas not lines. Here is a comparison of the fitting lines (red dashed) and fitting parabolas (orange solid): ```{r polls-2008-parabola-line-loess} total_days <- diff(range(polls_2008$day)) @@ -572,7 +517,7 @@ polls_2008 |> mutate(smooth_1 = fit_1$fitted, smooth_2 = fit_2$fitted) |> geom_line(aes(day, smooth_2), color = "orange", lty = 1) ``` -The `degree = 2` gives us more wiggly results. We actually prefer `degree = 1` as it is less prone to this kind of noise. +The `degree = 2` gives us more wiggly results. In general, we actually prefer `degree = 1` as it is less prone to this kind of noise. ### Beware of default smoothing parameters @@ -597,10 +542,10 @@ polls_2008 |> ggplot(aes(day, margin)) + To see how smoothing relates to machine learning with a concrete example, consider again our @sec-two-or-seven example. If we define the outcome $Y = 1$ for digits that are seven and $Y=0$ for digits that are 2, then we are interested in estimating the conditional probability: $$ -p(x_1, x_2) = \mbox{Pr}(Y=1 \mid X_1=x_1 , X_2 = x_2). +p(\mathbf{x}) = \mbox{Pr}(Y=1 \mid X_1=x_1 , X_2 = x_2). $$ -with $X_1$ and $X_2$ the two predictors defined in Section @sec-two-or-seven). In this example, the 0s and 1s we observe are "noisy" because for some regions the probabilities $p(x_1, x_2)$ are not that close to 0 or 1. So we need to estimate $p(x_1, x_2)$. Smoothing is an alternative to accomplishing this. In Section \@ref(two-or-seven) we saw that linear regression was not flexible enough to capture the non-linear nature of $p(x_1, x_2)$, thus smoothing approaches provide an improvement. In the next chapter we describe a popular machine learning algorithm, k-nearest neighbors, which is based on bin smoothing. +with $x_1$ and $x_2$ the two predictors defined in Section @sec-two-or-seven. In this example, the 0s and 1s we observe are "noisy" because for some regions the probabilities $p(\mathbf{x})$ are not that close to 0 or 1. So we need to estimate $p(\mathbf{x})$. Smoothing is an alternative to accomplishing this. In @sec-two-or-seven we saw that linear regression was not flexible enough to capture the non-linear nature of $p(\mathbf{x})$, thus smoothing approaches provide an improvement. In @sec-knn-cv-intro we describe a popular machine learning algorithm, k-nearest neighbors, which is based on the concept of smoothing. ## Exercises