diff --git a/.gitignore b/.gitignore index 3013a7b..62d2a82 100644 --- a/.gitignore +++ b/.gitignore @@ -3,7 +3,12 @@ .RData .Ruserdata *_cache -*_files +highdim/*_files +inference/*_files +linear-models/*_files +ml/*_files +prob/*_files +summaries/*_files .DS_Store copy-qmds.R crossref.sh diff --git a/_quarto.yml b/_quarto.yml index 086101e..560e94e 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -71,7 +71,7 @@ book: - ml/evaluation-metrics.qmd - ml/conditionals.qmd - ml/smoothing.qmd - - ml/cross-validation.qmd + - ml/resampling-methods.qmd - ml/algorithms.qmd - ml/ml-in-practice.qmd - ml/clustering.qmd diff --git a/docs/highdim/dimension-reduction.html b/docs/highdim/dimension-reduction.html index 419a92a..ca2c60b 100644 --- a/docs/highdim/dimension-reduction.html +++ b/docs/highdim/dimension-reduction.html @@ -373,7 +373,7 @@ @@ -512,10 +512,10 @@

z_1 = r \cos(\phi+ \theta), \,\, z_2 = r \sin(\phi + \theta) \]

-
+
-

+

@@ -528,10 +528,10 @@

\end{aligned} \]

Now we can rotate each point in the dataset by simply applying the formula above to each pair \((x_{i,1}, x_{i,2})^\top\). Here is what the twin standardized heights look like after rotating each point by \(-45\) degrees:

-
+
-

+

@@ -591,12 +591,12 @@

. \]

If we define:

-
+
theta <- 2*pi*-45/360 #convert to radians
 A <- matrix(c(cos(theta), -sin(theta), sin(theta), cos(theta)), 2, 2)

We can write code implementing a rotation by any angle \(\theta\) using linear algebra:

-
+
rotate <- function(x, theta){
   theta <- 2*pi*theta/360
   A <- matrix(c(cos(theta), -sin(theta), sin(theta), cos(theta)), 2, 2)
@@ -623,7 +623,7 @@ 

\mathbf{Z} \mathbf{A}^\top = \mathbf{X} \mathbf{A}\mathbf{A}^\top\ = \mathbf{X} \]

and therefore that \(\mathbf{A}^\top\) is the inverse of \(\mathbf{A}\). This also implies that all the information in \(\mathbf{X}\) is included in the rotation \(\mathbf{Z}\), and it can be retrieved via a linear transformation. A consequence is that for any rotation the distances are preserved. Here is an example for a 30 degree rotation, although it works for any angle:

-
+
all.equal(as.matrix(dist(rotate(x, 30))), as.matrix(dist(x)))
 #> [1] TRUE
@@ -654,7 +654,7 @@

Note that if \(\mathbf{A} \mathbf{A} ^\top= \mathbf{I}\), then the distance between the \(h\)th and \(i\)th rows is the same for the original and transformed data.

We refer to transformation with the property \(\mathbf{A} \mathbf{A}^\top = \mathbf{I}\) as orthogonal transformations. These are guaranteed to preserve the distance between any two points.

We previously demonstrated our rotation has this property. We can confirm using R:

-
+
A %*% t(A)
 #>          [,1]     [,2]
 #> [1,] 1.00e+00 1.01e-17
@@ -665,7 +665,7 @@ 

\sum_{1=1}^n ||\mathbf{z}_i||^2 = \sum_{i=1}^n ||\mathbf{A}^\top\mathbf{x}_i||^2 = \sum_{i=1}^n \mathbf{x}_i^\top \mathbf{A}\mathbf{A}^\top \mathbf{x}_i = \sum_{i=1}^n \mathbf{x}_i^\top\mathbf{x}_i = \sum_{i=1}^n||\mathbf{x}_i||^2 \]

We can confirm using R:

-
+
theta <- -45
 z <- rotate(x, theta) # works for any theta
 sum(x^2)
@@ -675,12 +675,12 @@ 

This can be interpreted as a consequence of the fact that an orthogonal transformation guarantees that all the information is preserved.

However, although the total is preserved, the sum of squares for the individual columns changes. Here we compute the proportion of TSS attributed to each column, referred to as the variance explained or variance captured by each column, for \(\mathbf{X}\):

-
+
colSums(x^2)/sum(x^2)
 #> [1] 0.5 0.5

and \(\mathbf{Z}\):

-
+
colSums(z^2)/sum(z^2)
 #> [1] 0.9848 0.0152
@@ -697,14 +697,14 @@

-
+
angles <- seq(0, -90)
 v <- sapply(angles, function(angle) colSums(rotate(x, angle)^2))
 variance_explained <- v[1,]/sum(x^2)
 plot(angles, variance_explained, type = "l")

We find that a -45 degree rotation appears to achieve the maximum, with over 98% of the total variability explained by the first dimension. We denote this rotation matrix with \(\mathbf{V}\):

-
+
theta <- 2*pi*-45/360 #convert to radians
 V <- matrix(c(cos(theta), -sin(theta), sin(theta), cos(theta)), 2, 2)
@@ -712,11 +712,11 @@

\[ \mathbf{Z} = \mathbf{X}\mathbf{V} \]

-
+
z <- x %*% V

The following animation further illustrates how different rotations affect the variability explained by the dimensions of the rotated data:

-
+

@@ -734,15 +734,15 @@

We also notice that the two groups, adults and children, can be clearly observed with the one number summary, better than with any of the two original dimensions.

-
+
-

+

-
+
hist(x[,1], breaks = seq(-4,4,0.5))
 hist(x[,2], breaks = seq(-4,4,0.5))
 hist(z[,1], breaks = seq(-4,4,0.5))
@@ -780,25 +780,25 @@

In R, we can find the principal components of any matrix with the function prcomp:

-
+
pca <- prcomp(x, center = FALSE)

Keep in mind that default behavior is to center the columns of x before computing the PCs, an operation we don’t need because our matrix is scaled.

The object pca includes the rotated data \(Z\) in pca$x and the rotation \(\mathbf{V}\) in pca$rotation.

We can see that columns of the pca$rotation are indeed the rotation obtained with -45 (remember the sign is arbitrary):

-
+
pca$rotation
 #>         PC1    PC2
 #> [1,] -0.707  0.707
 #> [2,] -0.707 -0.707

The square root of the variation of each column is included in the pca$sdev component. This implies we can compute the variance explained by each PC using:

-
+
pca$sdev^2/sum(pca$sdev^2)
 #> [1] 0.9848 0.0152

The function summary performs this calculation for us:

-
+
summary(pca)
 #> Importance of components:
 #>                          PC1    PC2
@@ -807,7 +807,7 @@ 

#> Cumulative Proportion 0.985 1.0000

We also see that we can rotate x (\(\mathbf{X}\)) and pca$x (\(\mathbf{Z}\)) as explained with the mathematical formulas above:

-
+
all.equal(pca$x, x %*% pca$rotation)
 #> [1] TRUE
 all.equal(x, pca$x %*% t(pca$rotation))
@@ -818,14 +818,14 @@ 

22.6.1 Iris example

The iris data is a widely used example in data analysis courses. It includes four botanical measurements related to three flower species:

-
+
names(iris)
 #> [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
 #> [5] "Species"

If you print iris$Species, you will see that the data is ordered by the species.

If we visualize the distances, we can clearly see the three species with one species very different from the other two:

-
+
x <- iris[,1:4] |> as.matrix()
 d <- dist(x)
 image(as.matrix(d), col = rev(RColorBrewer::brewer.pal(9, "RdBu")))
@@ -839,7 +839,7 @@

Our features matrix has four dimensions, but three are very correlated:

-
+
cor(x)
 #>              Sepal.Length Sepal.Width Petal.Length Petal.Width
 #> Sepal.Length        1.000      -0.118        0.872       0.818
@@ -848,7 +848,7 @@ 

#> Petal.Width 0.818 -0.366 0.963 1.000

If we apply PCA, we should be able to approximate this distance with just two dimensions, compressing the highly correlated dimensions. Using the summary function, we can see the variability explained by each PC:

-
+
pca <- prcomp(x)
 summary(pca)
 #> Importance of components:
@@ -858,7 +858,7 @@ 

#> Cumulative Proportion 0.925 0.9777 0.9948 1.00000

The first two dimensions account for almost 98% of the variability. Thus, we should be able to approximate the distance very well with two dimensions. We confirm this by computing the distance from first two dimensions and comparing to the original:

-
+
d_approx <- dist(pca$x[, 1:2])
 plot(d, d_approx); abline(0, 1, col = "red")
@@ -898,13 +898,13 @@

22.6.2 MNIST example

The written digits example has 784 features. Is there any room for data reduction? We will use PCA to answer this.

If not already loaded, let’s begin by loading the data:

-
+
library(dslabs)
 if (!exists("mnist")) mnist <- read_mnist()

Because the pixels are so small, we expect pixels close to each other on the grid to be correlated, meaning that dimension reduction should be possible.

Let’s compute the PCs. This will take a few seconds as it is a rather large matrix:

-
+
pca <- prcomp(mnist$train$images)
@@ -915,7 +915,7 @@

-
+
plot(pca$sdev^2/sum(pca$sdev^2), xlab = "PC", ylab = "Variance explained")

We can see that the first few PCs already explain a large percent of the variability.

@@ -958,7 +958,7 @@

22.7 Exercises

1. We want to explore the tissue_gene_expression predictors by plotting them.

-
+
dim(tissue_gene_expression$x)

We hope to get an idea of which observations are close to each other, but the predictors are 500-dimensional so plotting is difficult. Plot the first two principal components with color representing tissue type.

diff --git a/docs/highdim/dimension-reduction_files/figure-html/unnamed-chunk-7-1.png b/docs/highdim/dimension-reduction_files/figure-html/before-after-rotation-1.png similarity index 100% rename from docs/highdim/dimension-reduction_files/figure-html/unnamed-chunk-7-1.png rename to docs/highdim/dimension-reduction_files/figure-html/before-after-rotation-1.png diff --git a/docs/highdim/dimension-reduction_files/figure-html/digit-pc-boxplot-1.png b/docs/highdim/dimension-reduction_files/figure-html/digit-pc-boxplot-1.png new file mode 100644 index 0000000..56a99c3 Binary files /dev/null and b/docs/highdim/dimension-reduction_files/figure-html/digit-pc-boxplot-1.png differ diff --git a/docs/highdim/dimension-reduction_files/figure-html/histograms-of-dimensions-1.png b/docs/highdim/dimension-reduction_files/figure-html/histograms-of-dimensions-1.png new file mode 100644 index 0000000..f6a961c Binary files /dev/null and b/docs/highdim/dimension-reduction_files/figure-html/histograms-of-dimensions-1.png differ diff --git a/docs/highdim/dimension-reduction_files/figure-html/rotation-diagram-1.png b/docs/highdim/dimension-reduction_files/figure-html/rotation-diagram-1.png new file mode 100644 index 0000000..be2e591 Binary files /dev/null and b/docs/highdim/dimension-reduction_files/figure-html/rotation-diagram-1.png differ diff --git a/docs/highdim/intro-highdim.html b/docs/highdim/intro-highdim.html index 6837695..b3993de 100644 --- a/docs/highdim/intro-highdim.html +++ b/docs/highdim/intro-highdim.html @@ -355,7 +355,7 @@ diff --git a/docs/highdim/linear-algebra.html b/docs/highdim/linear-algebra.html index 08dc08c..0ffd727 100644 --- a/docs/highdim/linear-algebra.html +++ b/docs/highdim/linear-algebra.html @@ -373,7 +373,7 @@ @@ -576,10 +576,10 @@

21.3 Distance

Many of the analyses we perform with high-dimensional data relate directly or indirectly to distance. For example, most machine learning techniques rely on being able to define distances between observations, using features or predictors. Clustering algorithms, for example, search of observations that are similar. But what does this mean mathematically?

To define distance, we introduce another linear algebra concept: the norm. Recall that a point in two dimensions can be represented in polar coordinates as:

-
+
-

+

@@ -593,10 +593,10 @@

\(\mathbf{x}_1\) and \(\mathbf{x}_2\). We can define how similar they are by simply using euclidean distance:

-
+
-

+

@@ -614,35 +614,35 @@

+
x_1 <- x[6,]
 x_2 <- x[17,]
 x_3 <- x[16,]

We can compute the distances between each pair using the definitions we just learned:

-
+
c(sum((x_1 - x_2)^2), sum((x_1 - x_3)^2), sum((x_2 - x_3)^2)) |> sqrt()
 #> [1] 2320 2331 2519

In R, the function crossprod(x) is convenient for computing norms. It multiplies t(x) by x:

-
+
c(crossprod(x_1 - x_2), crossprod(x_1 - x_3), crossprod(x_2 - x_3)) |> sqrt()
 #> [1] 2320 2331 2519

Note crossprod takes a matrix as the first argument. As a result, the vectors used here are being coerced into single column matrices. Also, note that crossprod(x,y) multiples t(x) by y.

We can see that the distance is smaller between the first two. This agrees with the fact that the first two are 2s and the third is a 7.

-
+
y[c(6, 17, 16)]
 #> [1] 2 2 7

We can also compute all the distances at once relatively quickly using the function dist, which computes the distance between each row and produces an object of class dist:

-
+
d <- dist(x[c(6,17,16),])
 class(d)
 #> [1] "dist"

There are several machine learning related functions in R that take objects of class dist as input. To access the entries using row and column indices, we need to coerce it into a matrix. We can see the distance we calculated above like this:

-
+
d
 #>      1    2
 #> 2 2320     
@@ -654,7 +654,7 @@ 

image(as.matrix(d))

If we order this distance by the labels, we can see yellowish squares near the diagonal. This is because observations from the same digits tend to be closer than to different digits:

-
+
image(as.matrix(d)[order(y[1:300]), order(y[1:300])])
@@ -678,7 +678,7 @@

21.5 Exercises

1. Generate two matrix, A and B, containing randomly generated and normally distributed numbers. The dimensions of these two matrices should be \(4 \times 3\) and \(3 \times 6\), respectively. Confirm that C <- A %*% B produces the same results as:

-
+
m <- nrow(A)
 p <- ncol(B)
 C <- matrix(0, m, p)
@@ -698,13 +698,13 @@ 

+
mnist <- read_mnist()
 x <- mnist$train$images[1:300,] 
 y <- mnist$train$labels[1:300]

and compute the distance matrix:

-
+
d <- dist(x)
 class(d)
diff --git a/docs/highdim/linear-algebra_files/figure-html/unnamed-chunk-7-1.png b/docs/highdim/linear-algebra_files/figure-html/euclidean-dist-diagram-1.png similarity index 100% rename from docs/highdim/linear-algebra_files/figure-html/unnamed-chunk-7-1.png rename to docs/highdim/linear-algebra_files/figure-html/euclidean-dist-diagram-1.png diff --git a/docs/highdim/linear-algebra_files/figure-html/polar-coords-1.png b/docs/highdim/linear-algebra_files/figure-html/polar-coords-1.png new file mode 100644 index 0000000..d3dd184 Binary files /dev/null and b/docs/highdim/linear-algebra_files/figure-html/polar-coords-1.png differ diff --git a/docs/highdim/matrices-in-R.html b/docs/highdim/matrices-in-R.html index a9474cb..5f7a33b 100644 --- a/docs/highdim/matrices-in-R.html +++ b/docs/highdim/matrices-in-R.html @@ -373,7 +373,7 @@ diff --git a/docs/highdim/matrix-factorization.html b/docs/highdim/matrix-factorization.html index 4195486..1345e8d 100644 --- a/docs/highdim/matrix-factorization.html +++ b/docs/highdim/matrix-factorization.html @@ -373,7 +373,7 @@ @@ -503,28 +503,28 @@

The histogram below shows there are three type of users: those that love mob movies and hate romance movies, those that don’t care, and those that love romance movies and hate mob movies.

-
+
hist(p, breaks = seq(-2,2,0.1))
-

+

To see that we can approximate \(\varepsilon_{i,j}\) with $p_iq_j we convert the vectors to matrices and use linear algebra:

-
+
p <- matrix(p); q <- matrix(q)
 plot(p %*% t(q), e)
-

+

However, after removing this mob/romance effect, we still see structure in the correlation:

-
+
cor(e - p %*% t(q))
 #>                      Godfather Godfather 2 Goodfellas Scent of a Woman
 #> Godfather                1.000       0.185     -0.545            0.557
@@ -542,21 +542,21 @@ 

#> Sleepless in Seattle 0.353 1.000

This structure seems to be driven by Al Pacino being in the movie or not. This implies we could add another factor:

-
+
q <- cbind(c(-1, -1, -1, 1, 1, 1),
            c(1, 1, -1, 1, -1, 1))

We can then obtain estimates for each user:

-
+
p <- t(apply(e, 1, function(y) lm(y~q-1)$coefficient))

Note that we use the transpose t because apply binds results into columns and we want a row for each user.

Our approximation based on two factors does a even better job of predicting how our residuals deviate from 0:

-
+
plot(p %*% t(q), e)
-

+

@@ -576,16 +576,16 @@

\]

with \(\mathbf{Z}\) the matrix of principal components.

Let’s perform PCA and examine the results:

-
+
pca <- prcomp(e, center = FALSE)

First, notice that the first two PCs explain over 95% of the variability:

-
+
pca$sdev^2/sum(pca$sdev^2)
 #> [1] 0.6939 0.1790 0.0402 0.0313 0.0303 0.0253

Next, notice that the first column of \(\mathbf{V}\):

-
+
pca$rotation[,1]
 #>            Godfather          Godfather 2           Goodfellas 
 #>                0.306                0.261                0.581 
@@ -594,7 +594,7 @@ 

is assigning positive values to the mob movies and negative values to the romance movies.

The second column:

-
+
pca$rotation[,2]
 #>            Godfather          Godfather 2           Goodfellas 
 #>               -0.354               -0.377                0.382 
@@ -619,7 +619,7 @@ 

24.3 Case study: movie recommendations

Note that if we look at the correlation structure of the movies for which we simulated data in the previous sections, we see structure as well:

-
+
#>                         Godfather, The Godfather: Part II, The
 #> Godfather, The                   1.000                   0.842
 #> Godfather: Part II, The          0.842                   1.000
@@ -649,20 +649,20 @@ 

\]

Unfortunately, we can’t fit this model with prcomp due to the missing values. We introduce the missMDA package that provides an approach to fit such models when matrix entries are missing, a very common occurrence in movie recommendations, through the function imputePCA. Also, because there are small sample sizes for several movie pairs, it is useful to regularize the \(p\)s. The imputePCA function also permits regularization.

We use the estimates for \(\mu\), the \(\alpha\)s and \(\beta\)s from the previous chapter, and estimate two factors (ncp = 2). We fit the model to movies rated more than 25 times, include Scent of a Woman, which does not meet this criterion, because we previously used it as an example. Finally, we use regularization by setting the parameter coeff.ridge to the same value used to estimate the \(\beta\)s.

-
+
library(missMDA)
 ind <- colSums(!is.na(y)) >= 25 | colnames(y) == "3252"
 imputed <- imputePCA(r[,ind], ncp = 2, coeff.ridge = lambda)

To see how much we improve our previous prediction, we construct a matrix with the ratings in the test set:

-
+
y_test <- select(test_set, movieId, userId, rating) |>
   pivot_wider(names_from = movieId, values_from = rating) |>
   column_to_rownames("userId") |>
   as.matrix()

and construct our predictor obtained with imputePCA. We start by constructing the predictor from the previous chapter:

-
+
pred <- matrix(0, nrow(y), ncol(y))
 rownames(pred) <- rownames(y); colnames(pred) <- colnames(y)
 pred <- clamp(sweep(pred + mu + a, 2, b_reg, FUN = "+"))
@@ -670,11 +670,11 @@ 

#> [1] 0.889

Then we adjust the prediction to include the imputed residuals for the test set:

-
+
pred[,ind] <- clamp(pred[,ind] + imputed$completeObs)

We see that our prediction improves:

-
+
rmse(y_test - pred[rownames(y_test), colnames(y_test)])
 #> [1] 0.875
@@ -682,11 +682,11 @@

24.3.1 Visualizing factors

We can compute the first two principal components used for the prediction using prcomp.

-
+
pca <- prcomp(imputed$completeObs, center = FALSE, rank. = 3)

By adding the movie names to the rotation matrix:

-
+
v <- pca$rotation
 rownames(v) <- with(movie_map, title[match(colnames(r[,ind]), movieId)])
@@ -700,7 +700,7 @@

By looking at the highest and lowest values for the first principal component, we see a meaningful pattern. The first PC shows the difference between Hollywood blockbusters on one side:

-
+
#>  [1] "Armageddon"                    "Pearl Harbor"                 
 #>  [3] "X2: X-Men United"              "Dark Knight Rises, The"       
 #>  [5] "X-Men"                         "Independence Day (a.k.a. ID4)"
@@ -708,7 +708,7 @@ 

#> [9] "World Is Not Enough, The" "Spider-Man 2"

and critically acclaimed movies on the other:

-
+
#>  [1] "2001: A Space Odyssey"          "American Psycho"               
 #>  [3] "Royal Tenenbaums, The"          "Harold and Maude"              
 #>  [5] "Apocalypse Now"                 "Fear and Loathing in Las Vegas"
@@ -725,7 +725,7 @@ 

\mathbf{U}^\top\mathbf{D}\mathbf{U} = \mathbf{D}^2 \]

In R, we can obtain the SVD using the function svd. To see the connection to PCA, notice that:

-
+
x <- matrix(rnorm(1000), 100, 10)
 pca <- prcomp(x, center = FALSE)
 s <- svd(x)
@@ -742,7 +742,7 @@ 

24.5 Exercises

In this exercise set, we use the singular value decomposition (SVD) to estimate factors in an example related to the first application of factor analysis: finding factors related to student performance in school.

We construct a dataset that represents grade scores for 100 students in 24 different subjects. The overall average has been removed so this data represents the percentage points each student received above or below the average test score. So a 0 represents an average grade (C), a 25 is a high grade (A+), and a -25 represents a low grade (F). You can simulate the data like this:

-
+
set.seed(1987)
 n <- 100
 k <- 8
@@ -757,7 +757,7 @@ 

Our goal is to describe the student performances as succinctly as possible. For example, we want to know if these test results are all simply random independent numbers. Are all students just about as good? Does being good in one subject imply one will be good in another? How does the SVD help with all this? We will go step by step to show that with just three relatively small pairs of vectors, we can explain much of the variability in this \(100 \times 24\) dataset.

You can visualize the 24 test scores for the 100 students by plotting an image:

-
+
my_image <- function(x, zlim = range(x), ...){
   colors = rev(RColorBrewer::brewer.pal(9, "RdBu"))
   cols <- 1:ncol(x)
@@ -778,7 +778,7 @@ 

  • The students that are good at math are not good at humanities.
  • 2. You can examine the correlation between the test scores directly like this:

    -
    +
    my_image(cor(y), zlim = c(-1,1))
     range(cor(y))
     axis(side = 2, 1:ncol(y), rev(colnames(y)), las = 2)
    @@ -794,12 +794,12 @@

    \[ \mathbf{Y V} = \mathbf{U D} \mbox{ or } \mathbf{U}^{\top}\mathbf{Y} = \mathbf{D V}^{\top}\]

    We can think of \(\mathbf{YV}\) and \(\mathbf{U}^{\top}\mathbf{V}\) as two transformations of \(\mathbf{Y}\) that preserve the total variability.

    Use the function svd to compute the SVD of y. This function will return \(U\), \(V\) and the diagonal entries of \(D\).

    -
    +
    s <- svd(y)
     names(s)

    You can check that the SVD works by typing:

    -
    +
    y_svd <- sweep(s$u, d) %*% t(s$v)
     max(abs(y - y_svd))
    @@ -813,7 +813,7 @@

    \]

    Use the sweep function to compute \(UD\) without constructing diag(s$d) and without using matrix multiplication.

    8. We know that \(\mathbf{u}_1 d_{1,1}\), the first column of \(\mathbf{UD}\), has the most variability of all the columns of \(\mathbf{UD}\). Earlier we saw an image of \(Y\):

    -
    +
    my_image(y)

    in which we can see that the student to student variability is quite large and that it appears that students that are good in one subject are good in all. This implies that the average (across all subjects) for each student should explain a lot of the variability. Compute the average score for each student and plot it against \(\mathbf{u}_1 d_{1,1}\), and describe what you find.

    @@ -832,7 +832,7 @@

    \[ \mathbf{Y} \approx d_{1,1} \mathbf{u}_1 \mathbf{v}_1^{\top} \] We know it explains s$d[1]^2/sum(s$d^2) * 100 percent of the total variability. Our approximation only explains the observation that good students tend to be good in all subjects. But another aspect of the original data that our approximation does not explain was the higher similarity we observed within subjects. We can see this by computing the difference between our approximation and original data and then computing the correlations. You can see this by running this code:

    -
    +
    resid <- y - with(s,(u[,1, drop=FALSE]*d[1]) %*% t(v[,1, drop=FALSE]))
     my_image(cor(resid), zlim = c(-1,1))
     axis(side = 2, 1:ncol(y), rev(colnames(y)), las = 2)
    @@ -843,11 +843,11 @@

    \mathbf{Y} \approx d_{1,1} \mathbf{u}_1 \mathbf{v}_1^{\top} + d_{2,2} \mathbf{u}_2 \mathbf{v}_2^{\top} \]

    We know it will explain:

    -
    +
    sum(s$d[1:2]^2)/sum(s$d^2) * 100

    percent of the total variability. We can compute new residuals like this:

    -
    +
    resid <- y - with(s,sweep(u[,1:2], 2, d[1:2], FUN="*") %*% t(v[,1:2]))
     my_image(cor(resid), zlim = c(-1,1))
     axis(side = 2, 1:ncol(y), rev(colnames(y)), las = 2)
    @@ -858,11 +858,11 @@

    \mathbf{Y} \approx d_{1,1} \mathbf{u}_1 \mathbf{v}_1^{\top} + d_{2,2} \mathbf{u}_2 \mathbf{v}_2^{\top} + d_{3,3} \mathbf{u}_3 \mathbf{v}_3^{\top} \]

    We know it will explain:

    -
    +
    sum(s$d[1:3]^2)/sum(s$d^2) * 100

    percent of the total variability. We can compute new residuals like this:

    -
    +
    resid <- y - with(s,sweep(u[,1:3], 2, d[1:3], FUN="*") %*% t(v[,1:3]))
     my_image(cor(resid), zlim = c(-1,1))
     axis(side = 2, 1:ncol(y), rev(colnames(y)), las = 2)
    diff --git a/docs/highdim/matrix-factorization_files/figure-html/e-vs-pq-1.png b/docs/highdim/matrix-factorization_files/figure-html/e-vs-pq-1.png new file mode 100644 index 0000000..6f638ac Binary files /dev/null and b/docs/highdim/matrix-factorization_files/figure-html/e-vs-pq-1.png differ diff --git a/docs/highdim/matrix-factorization_files/figure-html/e-vs-pq-2-1.png b/docs/highdim/matrix-factorization_files/figure-html/e-vs-pq-2-1.png new file mode 100644 index 0000000..e943ba8 Binary files /dev/null and b/docs/highdim/matrix-factorization_files/figure-html/e-vs-pq-2-1.png differ diff --git a/docs/highdim/regularization.html b/docs/highdim/regularization.html index 6e542c8..963e28b 100644 --- a/docs/highdim/regularization.html +++ b/docs/highdim/regularization.html @@ -374,7 +374,7 @@ @@ -599,7 +599,7 @@

    -

    In Chapter 29, we provide a formal discussion of the mean squared error.

    +

    In Chapter 29, we provide a formal discussion of the mean squared error.

    @@ -735,7 +735,7 @@

    This approach will have our desired effect: when our sample size \(n_j\) is very large, we obtain a stable estimate and the penalty \(\lambda\) is effectively ignored since \(n_j+\lambda \approx n_j\). Yet when the \(n_j\) is small, then the estimate \(\hat{\beta}_i(\lambda)\) is shrunken towards 0. The larger the \(\lambda\), the more we shrink.

    -

    But how do we select \(\lambda\)? In Chapter 29, we describe an approach to do this. Here we will simply compute the RMSE we for different values of \(\lambda\) to illustrate the effect:

    +

    But how do we select \(\lambda\)? In Chapter 29, we describe an approach to do this. Here we will simply compute the RMSE we for different values of \(\lambda\) to illustrate the effect:

    n <- colSums(!is.na(y))
     sums <- colSums(y - mu - a, na.rm = TRUE)
    diff --git a/docs/index.html b/docs/index.html
    index 264f4af..8df4982 100644
    --- a/docs/index.html
    +++ b/docs/index.html
    @@ -352,7 +352,7 @@
     
               
    diff --git a/docs/inference/bayes.html b/docs/inference/bayes.html
    index bad6233..6d8823a 100644
    --- a/docs/inference/bayes.html
    +++ b/docs/inference/bayes.html
    @@ -373,7 +373,7 @@
     
               
    diff --git a/docs/inference/bootstrap.html b/docs/inference/bootstrap.html
    index a72c8c8..18bfc5d 100644
    --- a/docs/inference/bootstrap.html
    +++ b/docs/inference/bootstrap.html
    @@ -373,7 +373,7 @@
     
               
    diff --git a/docs/inference/bootstrap_files/figure-html/income-distribution-1.png b/docs/inference/bootstrap_files/figure-html/income-distribution-1.png
    new file mode 100644
    index 0000000..e12b04d
    Binary files /dev/null and b/docs/inference/bootstrap_files/figure-html/income-distribution-1.png differ
    diff --git a/docs/inference/bootstrap_files/figure-html/median-is-normal-1.png b/docs/inference/bootstrap_files/figure-html/median-is-normal-1.png
    new file mode 100644
    index 0000000..e017081
    Binary files /dev/null and b/docs/inference/bootstrap_files/figure-html/median-is-normal-1.png differ
    diff --git a/docs/inference/clt.html b/docs/inference/clt.html
    index 83ebf65..d9b1f7c 100644
    --- a/docs/inference/clt.html
    +++ b/docs/inference/clt.html
    @@ -373,7 +373,7 @@
     
               
    diff --git a/docs/inference/confidence-intervals.html b/docs/inference/confidence-intervals.html
    index 3fe8f79..d596a23 100644
    --- a/docs/inference/confidence-intervals.html
    +++ b/docs/inference/confidence-intervals.html
    @@ -373,7 +373,7 @@
     
               
    diff --git a/docs/inference/hierarchical-models.html b/docs/inference/hierarchical-models.html
    index 5ee5f1a..1208d75 100644
    --- a/docs/inference/hierarchical-models.html
    +++ b/docs/inference/hierarchical-models.html
    @@ -374,7 +374,7 @@
     
               
    diff --git a/docs/inference/hypothesis-testing.html b/docs/inference/hypothesis-testing.html
    index 3fd264f..f4a7199 100644
    --- a/docs/inference/hypothesis-testing.html
    +++ b/docs/inference/hypothesis-testing.html
    @@ -373,7 +373,7 @@
     
               
    diff --git a/docs/inference/intro-inference.html b/docs/inference/intro-inference.html
    index 34e18ab..84e9633 100644
    --- a/docs/inference/intro-inference.html
    +++ b/docs/inference/intro-inference.html
    @@ -353,7 +353,7 @@
     
               
    diff --git a/docs/inference/models.html b/docs/inference/models.html
    index 208c34a..8f584c2 100644
    --- a/docs/inference/models.html
    +++ b/docs/inference/models.html
    @@ -373,7 +373,7 @@
     
               
    diff --git a/docs/inference/parameters-estimates.html b/docs/inference/parameters-estimates.html
    index 1019efc..dfc408a 100644
    --- a/docs/inference/parameters-estimates.html
    +++ b/docs/inference/parameters-estimates.html
    @@ -374,7 +374,7 @@
     
               
    diff --git a/docs/intro.html b/docs/intro.html
    index 760bf14..64561d7 100644
    --- a/docs/intro.html
    +++ b/docs/intro.html
    @@ -353,7 +353,7 @@
     
               
    diff --git a/docs/linear-models/association-not-causation.html b/docs/linear-models/association-not-causation.html
    index 146bf26..219bd1d 100644
    --- a/docs/linear-models/association-not-causation.html
    +++ b/docs/linear-models/association-not-causation.html
    @@ -373,7 +373,7 @@
     
               
    diff --git a/docs/linear-models/association-tests.html b/docs/linear-models/association-tests.html
    index 560ca07..e7b30dd 100644
    --- a/docs/linear-models/association-tests.html
    +++ b/docs/linear-models/association-tests.html
    @@ -374,7 +374,7 @@
     
               
    diff --git a/docs/linear-models/intro-to-linear-models.html b/docs/linear-models/intro-to-linear-models.html
    index 9c7fc87..e567cc6 100644
    --- a/docs/linear-models/intro-to-linear-models.html
    +++ b/docs/linear-models/intro-to-linear-models.html
    @@ -353,7 +353,7 @@
     
               
    diff --git a/docs/linear-models/measurement-error-models.html b/docs/linear-models/measurement-error-models.html
    index aa2e53a..f2284b2 100644
    --- a/docs/linear-models/measurement-error-models.html
    +++ b/docs/linear-models/measurement-error-models.html
    @@ -373,7 +373,7 @@
     
               
    diff --git a/docs/linear-models/multivariate-regression.html b/docs/linear-models/multivariate-regression.html
    index c8873de..c4d4490 100644
    --- a/docs/linear-models/multivariate-regression.html
    +++ b/docs/linear-models/multivariate-regression.html
    @@ -374,7 +374,7 @@
     
               
    @@ -446,14 +446,14 @@