z_1 = r \cos(\phi+ \theta), \,\,
z_2 = r \sin(\phi + \theta)
\]
-
+
-
@@ -528,10 +528,10 @@
\end{aligned}
\]
Now we can rotate each point in the dataset by simply applying the formula above to each pair \((x_{i,1}, x_{i,2})^\top\). Here is what the twin standardized heights look like after rotating each point by \(-45\) degrees:
-
+
-
+
@@ -591,12 +591,12 @@
.
\]
If we define:
-
+
theta<-2*pi*-45/360#convert to radiansA<-matrix(c(cos(theta), -sin(theta), sin(theta), cos(theta)), 2, 2)
We can write code implementing a rotation by any angle \(\theta\) using linear algebra:
and therefore that \(\mathbf{A}^\top\) is the inverse of \(\mathbf{A}\). This also implies that all the information in \(\mathbf{X}\) is included in the rotation \(\mathbf{Z}\), and it can be retrieved via a linear transformation. A consequence is that for any rotation the distances are preserved. Here is an example for a 30 degree rotation, although it works for any angle:
Note that if \(\mathbf{A} \mathbf{A} ^\top= \mathbf{I}\), then the distance between the \(h\)th and \(i\)th rows is the same for the original and transformed data.
We refer to transformation with the property \(\mathbf{A} \mathbf{A}^\top = \mathbf{I}\) as orthogonal transformations. These are guaranteed to preserve the distance between any two points.
We previously demonstrated our rotation has this property. We can confirm using R:
theta<--45z<-rotate(x, theta)# works for any thetasum(x^2)
@@ -675,12 +675,12 @@
This can be interpreted as a consequence of the fact that an orthogonal transformation guarantees that all the information is preserved.
However, although the total is preserved, the sum of squares for the individual columns changes. Here we compute the proportion of TSS attributed to each column, referred to as the variance explained or variance captured by each column, for \(\mathbf{X}\):
angles<-seq(0, -90)v<-sapply(angles, function(angle)colSums(rotate(x, angle)^2))variance_explained<-v[1,]/sum(x^2)plot(angles, variance_explained, type ="l")
We find that a -45 degree rotation appears to achieve the maximum, with over 98% of the total variability explained by the first dimension. We denote this rotation matrix with \(\mathbf{V}\):
-
+
theta<-2*pi*-45/360#convert to radiansV<-matrix(c(cos(theta), -sin(theta), sin(theta), cos(theta)), 2, 2)
The following animation further illustrates how different rotations affect the variability explained by the dimensions of the rotated data:
-
+
@@ -734,15 +734,15 @@
We also notice that the two groups, adults and children, can be clearly observed with the one number summary, better than with any of the two original dimensions.
The square root of the variation of each column is included in the pca$sdev component. This implies we can compute the variance explained by each PC using:
If we apply PCA, we should be able to approximate this distance with just two dimensions, compressing the highly correlated dimensions. Using the summary function, we can see the variability explained by each PC:
-
+
pca<-prcomp(x)summary(pca)#> Importance of components:
@@ -858,7 +858,7 @@
The first two dimensions account for almost 98% of the variability. Thus, we should be able to approximate the distance very well with two dimensions. We confirm this by computing the distance from first two dimensions and comparing to the original:
-
+
d_approx<-dist(pca$x[, 1:2])plot(d, d_approx); abline(0, 1, col ="red")
@@ -898,13 +898,13 @@
22.6.2 MNIST example
The written digits example has 784 features. Is there any room for data reduction? We will use PCA to answer this.
If not already loaded, let’s begin by loading the data:
Because the pixels are so small, we expect pixels close to each other on the grid to be correlated, meaning that dimension reduction should be possible.
Let’s compute the PCs. This will take a few seconds as it is a rather large matrix:
We hope to get an idea of which observations are close to each other, but the predictors are 500-dimensional so plotting is difficult. Plot the first two principal components with color representing tissue type.
diff --git a/docs/highdim/dimension-reduction_files/figure-html/unnamed-chunk-7-1.png b/docs/highdim/dimension-reduction_files/figure-html/before-after-rotation-1.png
similarity index 100%
rename from docs/highdim/dimension-reduction_files/figure-html/unnamed-chunk-7-1.png
rename to docs/highdim/dimension-reduction_files/figure-html/before-after-rotation-1.png
diff --git a/docs/highdim/dimension-reduction_files/figure-html/digit-pc-boxplot-1.png b/docs/highdim/dimension-reduction_files/figure-html/digit-pc-boxplot-1.png
new file mode 100644
index 0000000..56a99c3
Binary files /dev/null and b/docs/highdim/dimension-reduction_files/figure-html/digit-pc-boxplot-1.png differ
diff --git a/docs/highdim/dimension-reduction_files/figure-html/histograms-of-dimensions-1.png b/docs/highdim/dimension-reduction_files/figure-html/histograms-of-dimensions-1.png
new file mode 100644
index 0000000..f6a961c
Binary files /dev/null and b/docs/highdim/dimension-reduction_files/figure-html/histograms-of-dimensions-1.png differ
diff --git a/docs/highdim/dimension-reduction_files/figure-html/rotation-diagram-1.png b/docs/highdim/dimension-reduction_files/figure-html/rotation-diagram-1.png
new file mode 100644
index 0000000..be2e591
Binary files /dev/null and b/docs/highdim/dimension-reduction_files/figure-html/rotation-diagram-1.png differ
diff --git a/docs/highdim/intro-highdim.html b/docs/highdim/intro-highdim.html
index 6837695..b3993de 100644
--- a/docs/highdim/intro-highdim.html
+++ b/docs/highdim/intro-highdim.html
@@ -355,7 +355,7 @@
Many of the analyses we perform with high-dimensional data relate directly or indirectly to distance. For example, most machine learning techniques rely on being able to define distances between observations, using features or predictors. Clustering algorithms, for example, search of observations that are similar. But what does this mean mathematically?
To define distance, we introduce another linear algebra concept: the norm. Recall that a point in two dimensions can be represented in polar coordinates as:
-
+
-
+
@@ -593,10 +593,10 @@
\(\mathbf{x}_1\) and \(\mathbf{x}_2\). We can define how similar they are by simply using euclidean distance:
-
+
-
+
@@ -614,35 +614,35 @@
+
x_1<-x[6,]x_2<-x[17,]x_3<-x[16,]
We can compute the distances between each pair using the definitions we just learned:
Note crossprod takes a matrix as the first argument. As a result, the vectors used here are being coerced into single column matrices. Also, note that crossprod(x,y) multiples t(x) by y.
We can see that the distance is smaller between the first two. This agrees with the fact that the first two are 2s and the third is a 7.
We can also compute all the distances at once relatively quickly using the function dist, which computes the distance between each row and produces an object of class dist:
There are several machine learning related functions in R that take objects of class dist as input. To access the entries using row and column indices, we need to coerce it into a matrix. We can see the distance we calculated above like this:
If we order this distance by the labels, we can see yellowish squares near the diagonal. This is because observations from the same digits tend to be closer than to different digits:
1. Generate two matrix, A and B, containing randomly generated and normally distributed numbers. The dimensions of these two matrices should be \(4 \times 3\) and \(3 \times 6\), respectively. Confirm that C <- A %*% B produces the same results as:
-
+
m<-nrow(A)p<-ncol(B)C<-matrix(0, m, p)
@@ -698,13 +698,13 @@
The histogram below shows there are three type of users: those that love mob movies and hate romance movies, those that don’t care, and those that love romance movies and hate mob movies.
Note that if we look at the correlation structure of the movies for which we simulated data in the previous sections, we see structure as well:
-
+
#> Godfather, The Godfather: Part II, The
#> Godfather, The 1.000 0.842
#> Godfather: Part II, The 0.842 1.000
@@ -649,20 +649,20 @@
\]
Unfortunately, we can’t fit this model with prcomp due to the missing values. We introduce the missMDA package that provides an approach to fit such models when matrix entries are missing, a very common occurrence in movie recommendations, through the function imputePCA. Also, because there are small sample sizes for several movie pairs, it is useful to regularize the \(p\)s. The imputePCA function also permits regularization.
We use the estimates for \(\mu\), the \(\alpha\)s and \(\beta\)s from the previous chapter, and estimate two factors (ncp = 2). We fit the model to movies rated more than 25 times, include Scent of a Woman, which does not meet this criterion, because we previously used it as an example. Finally, we use regularization by setting the parameter coeff.ridge to the same value used to estimate the \(\beta\)s.
By looking at the highest and lowest values for the first principal component, we see a meaningful pattern. The first PC shows the difference between Hollywood blockbusters on one side:
#> [1] "2001: A Space Odyssey" "American Psycho"
#> [3] "Royal Tenenbaums, The" "Harold and Maude"
#> [5] "Apocalypse Now" "Fear and Loathing in Las Vegas"
@@ -725,7 +725,7 @@
In R, we can obtain the SVD using the function svd. To see the connection to PCA, notice that:
-
+
x<-matrix(rnorm(1000), 100, 10)pca<-prcomp(x, center =FALSE)s<-svd(x)
@@ -742,7 +742,7 @@
24.5 Exercises
In this exercise set, we use the singular value decomposition (SVD) to estimate factors in an example related to the first application of factor analysis: finding factors related to student performance in school.
We construct a dataset that represents grade scores for 100 students in 24 different subjects. The overall average has been removed so this data represents the percentage points each student received above or below the average test score. So a 0 represents an average grade (C), a 25 is a high grade (A+), and a -25 represents a low grade (F). You can simulate the data like this:
Our goal is to describe the student performances as succinctly as possible. For example, we want to know if these test results are all simply random independent numbers. Are all students just about as good? Does being good in one subject imply one will be good in another? How does the SVD help with all this? We will go step by step to show that with just three relatively small pairs of vectors, we can explain much of the variability in this \(100 \times 24\) dataset.
You can visualize the 24 test scores for the 100 students by plotting an image:
Use the sweep function to compute \(UD\) without constructing diag(s$d) and without using matrix multiplication.
8. We know that \(\mathbf{u}_1 d_{1,1}\), the first column of \(\mathbf{UD}\), has the most variability of all the columns of \(\mathbf{UD}\). Earlier we saw an image of \(Y\):
-
+
my_image(y)
in which we can see that the student to student variability is quite large and that it appears that students that are good in one subject are good in all. This implies that the average (across all subjects) for each student should explain a lot of the variability. Compute the average score for each student and plot it against \(\mathbf{u}_1 d_{1,1}\), and describe what you find.
@@ -832,7 +832,7 @@
\[
\mathbf{Y} \approx d_{1,1} \mathbf{u}_1 \mathbf{v}_1^{\top}
\] We know it explains s$d[1]^2/sum(s$d^2) * 100 percent of the total variability. Our approximation only explains the observation that good students tend to be good in all subjects. But another aspect of the original data that our approximation does not explain was the higher similarity we observed within subjects. We can see this by computing the difference between our approximation and original data and then computing the correlations. You can see this by running this code:
-
+
resid<-y-with(s,(u[,1, drop=FALSE]*d[1])%*%t(v[,1, drop=FALSE]))my_image(cor(resid), zlim =c(-1,1))axis(side =2, 1:ncol(y), rev(colnames(y)), las =2)
In Chapter 29, we provide a formal discussion of the mean squared error.
+
In Chapter 29, we provide a formal discussion of the mean squared error.
@@ -735,7 +735,7 @@
This approach will have our desired effect: when our sample size \(n_j\) is very large, we obtain a stable estimate and the penalty \(\lambda\) is effectively ignored since \(n_j+\lambda \approx n_j\). Yet when the \(n_j\) is small, then the estimate \(\hat{\beta}_i(\lambda)\) is shrunken towards 0. The larger the \(\lambda\), the more we shrink.
-
But how do we select \(\lambda\)? In Chapter 29, we describe an approach to do this. Here we will simply compute the RMSE we for different values of \(\lambda\) to illustrate the effect:
+
But how do we select \(\lambda\)? In Chapter 29, we describe an approach to do this. Here we will simply compute the RMSE we for different values of \(\lambda\) to illustrate the effect:
Moneyball: The Art of Winning an Unfair Game by Michael Lewis focuses on the Oakland Athletics (A’s) baseball team and its general manager, Billy Beane, the person tasked with building the team.
Traditionally, baseball teams use scouts to help them decide what players to hire. These scouts evaluate players by observing them perform, tending to favor athletic players with observable physical abilities. For this reason, scouts generally agree on who the best players are and, as a result, these players are often in high demand. This in turn drives up their salaries.
-
From 1989 to 1991, the A’s had one of the highest payrolls in baseball. They were able to buy the best players and, during that time, were one of the best teams. However, in 1995, the A’s team owner changed and the new management cut the budget drastically, leaving then general manager, Sandy Alderson, with one of the lowest payrolls in baseball. He could no longer afford the most sought-after players. As a result, Alderson began using a statistical approach to find inefficiencies in the market. Alderson was a mentor to Billy Beane, who succeeded him in 1998 and fully embraced data science, as opposed to scouts, as a method for finding low-cost players that data predicted would help the team win. Today, this strategy has been adapted by most baseball teams. As we will see, regression plays a large role in this approach.
-
As motivation for this part of the book, we will pretend it is 2002 and try to build a baseball team with a limited budget, just like the A’s had to do. To appreciate what you are up against, note that in 2002 the Yankees’ payroll of $125,928,583 more than tripled the Oakland A’s $39,679,746:
+
From 1989 to 1991, the A’s had one of the highest payrolls in baseball. They were able to buy the best players and, during that time, were one of the best teams. However, in 1995, the A’s team owner changed and the new management cut the budget drastically, leaving then general manager, Sandy Alderson, with one of the lowest payrolls in baseball. He could no longer afford the most sought-after players. As a result, Alderson began using a statistical approach to find inefficiencies in the market. Alderson was a mentor to Billy Beane, who succeeded him in 1998 and fully embraced data science, as opposed to relying exclusively on scouts, as a method for finding low-cost players that data predicted would help the team win. Today, this strategy has been adapted by most baseball teams. As we will see, regression plays a significant role in this approach.
+
As motivation for this part of the book, let’s imagine it is 2002, and attempt to build a baseball team with a limited budget, just like the A’s had to do. To appreciate what you are up against, note that in 2002 the Yankees’ payroll of $125,928,583 more than tripled the Oakland A’s payroll of $39,679,746:
@@ -462,14 +462,14 @@
1 such as home runs (HR), runs batted in (RBI), and stolen bases (SB) are reported for each player in the game summaries included in the sports section of newspapers, with players rewarded for high numbers. Although summary statistics such as these were widely used in baseball, data analysis per se was not. These statistics were arbitrarily decided on without much thought as to whether they actually predicted anything or were related to helping a team win.
-
This changed with Bill James2. In the late 1970s, this aspiring writer and baseball fan started publishing articles describing more in-depth analysis of baseball data. He named the approach of using data to predict what outcomes best predicted if a team would win sabermetrics3. Yet until Billy Beane made sabermetrics the center of his baseball operation, Bill James’ work was mostly ignored by the baseball world. Currently, sabermetrics popularity is no longer limited to just baseball, with other sports also adopting this approach.
+
Statistics have been used in baseball since its beginnings. The dataset we will be using, included in the Lahman library, goes back to the 19th century. For example, a summary statistics we will describe soon, the batting average, has been used for decades to summarize a batter’s success. Other statistics1, such as home runs (HR), runs batted in (RBI), and stolen bases (SB), are reported for each player in the game summaries included in the sports section of newspapers, with players rewarded for high numbers. Although summary statistics such as these were widely used in baseball, data analysis per se was not. These statistics were arbitrarily chosen without much thought as to whether they actually predicted anything or were related to helping a team win.
+
This changed with Bill James2. In the late 1970s, this aspiring writer and baseball fan started publishing articles describing more in-depth analysis of baseball data. He named the approach of using data to determine which outcomes best predicted if a team would win sabermetrics3. Yet until Billy Beane made sabermetrics the center of his baseball operation, Bill James’ work was mostly ignored by the baseball world. Currently, sabermetrics popularity is no longer limited to just baseball, with other sports also adopting this approach.
To simplify the exercise, we will focus on scoring runs and ignore the two other important aspects of the game: pitching and fielding. We will see how regression analysis can help develop strategies to build a competitive baseball team with a constrained budget. The approach can be divided into two separate data analyses. In the first, we determine which recorded player-specific statistics predict runs. In the second, we examine if players were undervalued based on the predictions from our first analysis.
15.1.1 Baseball basics
-
To see how regression will help us find undervalued players, we actually don’t need to understand all the details about the game of baseball, which has over 100 rules. Here, we distill the sport to the basic knowledge one needs to know how to effectively attack the data science problem.
+
To understand how regression helps us find undervalued players, we don’t need to delve into all the details of the game of baseball, which has over 100 rules. Here, we distill the sport to the basic knowledge necessary for effectively addressing the data science problem.
The goal of a baseball game is to score more runs (points) than the other team. Each team has 9 batters that have an opportunity to hit a ball with a bat in a predetermined order. After the 9th batter has had their turn, the first batter bats again, then the second, and so on. Each time a batter has an opportunity to bat, we call it a plate appearance (PA). At each PA, the other team’s pitcher throws the ball and the batter tries to hit it. The PA ends with an binary outcome: the batter either makes an out (failure) and returns to the bench, or the batter doesn’t (success) and can run around the bases, potentially scoring a run (reaching all 4 bases). Each team gets nine tries, referred to as innings, to score runs, and each inning ends after three outs (three failures).
-
Here is a video showing a success: https://www.youtube.com/watch?v=HL-XjMCPfio. And here is one showing a failure: https://www.youtube.com/watch?v=NeloljCx-1g. In these videos, we see how luck is involved in the process. When at bat, the batter wants to hit the ball hard. If the batter hits it hard enough, it is a HR, the best possible outcome as the batter gets at least one automatic run. But sometimes, due to chance, the batter hits the ball very hard and a defender catches it, resulting in an out. In contrast, sometimes the batter hits the ball softly, but it lands just in the right place. The fact that there is chance involved hints at why probability models will be involved.
+
Here is a video showing a success: https://www.youtube.com/watch?v=HL-XjMCPfio. And here is one showing a failure: https://www.youtube.com/watch?v=NeloljCx-1g. In these videos, we see how luck is involved in the process. When at bat, the batter wants to hit the ball hard. If the batter hits it hard enough, it is a HR, the best possible outcome as the batter gets at least one automatic run. But sometimes, due to chance, the batter hits the ball very hard and a defender catches it, resulting in an out. In contrast, sometimes the batter hits the ball softly, but it lands in just the right place. The fact that there is chance involved hints at why probability models will be involved.
Now, there are several ways to succeed. Understanding this distinction will be important for our analysis. When the batter hits the ball, the batter wants to pass as many bases as possible. There are four bases, with the fourth one called home plate. Home plate is where batters start by trying to hit, so the bases form a cycle.
A batter who goes around the bases and arrives home, scores a run.
-
We are simplifying a bit, but there are five ways a batter can succeed, that is, not make an out:
+
We are simplifying a bit, but there are five ways a batter can succeed, meaning not make an out:
-
Bases on balls (BB) - the pitcher fails to throw the ball through a predefined area considered to be hittable (the strike zone), so the batter is permitted to go to first base.
-
Single - Batter hits the ball and gets to first base.
-
Double (2B) - Batter hits the ball and gets to second base.
-
Triple (3B) - Batter hits the ball and gets to third base.
-
Home Run (HR) - Batter hits the ball and goes all the way home and scores a run.
+
Bases on balls (BB): The pitcher fails to throw the ball through a predefined area considered to be hittable (the strike zone), so the batter is permitted to go to first base.
+
Single: The batter hits the ball and gets to first base.
+
Double (2B): The batter hits the ball and gets to second base.
+
Triple (3B): The batter hits the ball and gets to third base.
+
Home Run (HR): The batter hits the ball and goes all the way home and scores a run.
Here is an example of a HR: https://www.youtube.com/watch?v=xYxSZJ9GZ-w. If a batter reaches a base, the batter still has a chance of reaching home and scoring a run if the next batter succeeds with a hit. While the batter is on base, the batter can also try to steal a base (SB). If a batter runs fast enough, the batter can try to advance from one base to the next without the other team tagging the runner. Here is an example of a stolen base: https://www.youtube.com/watch?v=JSE5kfxkzfk.
-
All these events are tracked throughout the season and are available to us through the Lahman package. Now we will start discussing how data analysis can help us decide how to use these statistics to evaluate players.
+
All these events are tracked throughout the season and are available to us through the Lahman package. Now, we can begin discussing how data analysis can help us determine how to use these statistics to evaluate players.
15.1.2 No awards for BB
Historically, the batting average has been considered the most important offensive statistic. To define this average, we define a hit (H) and an at bat (AB). Singles, doubles, triples, and home runs are hits. The fifth way to be successful, BB, is not a hit. An AB is the number of times in which you either get a hit or make an out; BBs are excluded. The batting average is simply H/AB and is considered the main measure of a success rate. Today, this success rate ranges from 20% to 38%. We refer to the batting average in thousands so, for example, if your success rate is 28%, we call it batting 280.
One of Bill James’ first important insights is that the batting average ignores BB, but a BB is a success. Instead of batting average, James proposed the use of the on base percentage (OBP), which he defined as (H+BB)/(AB+BB), or simply the proportion of plate appearances that don’t result in an out, a very intuitive measure. He noted that a player that accumulates many more BB than the average player might go unrecognized if the batter does not excel in batting average. But is this player not helping produce runs? No award is given to the player with the most BB. However, bad habits are hard to break and baseball did not immediately adopt OBP as an important statistic. In contrast, total stolen bases were considered important and an award8 given to the player with the most. But players with high totals of SB also made more outs as they did not always succeed. Does a player with high SB total help produce runs? Can we use data science to determine if it’s better to pay for players with high BB or SB?
+
One of Bill James’ first important insights is that the batting average ignores BB, but a BB is a success. Instead of batting average, James proposed the use of the on-base percentage (OBP), which he defined as (H+BB)/(AB+BB), or simply the proportion of plate appearances that don’t result in an out, a very intuitive measure. He noted that a player that accumulates many more BB than the average player might go unrecognized if the batter does not excel in batting average. But is this player not helping produce runs? No award is given to the player with the most BB. However, bad habits are hard to break and baseball did not immediately adopt OBP as an important statistic. In contrast, total stolen bases were considered important and an award8 given to the player with the most. But players with high totals of SB also made more outs as they did not always succeed. Does a player with high SB total help produce runs? Can we use data science to determine if it’s better to pay for players with high BB or SB?
15.1.3 Base on balls or stolen bases?
One of the challenges in this analysis is that it is not obvious how to determine if a player produces runs because so much depends on his teammates. Although we keep track of the number of runs scored by a player, remember that if player X bats right before someone who hits many HRs, batter X will score many runs. Note these runs don’t necessarily happen if we hire player X, but not his HR hitting teammate.
@@ -572,7 +572,7 @@
15.1.4 Regression applied to baseball statistics
-
Can we use regression with these data? First, notice that the HR and Run data, shown above, appear to be bivariate normal. Specifically, the qq-plots confirm that the normal approximation for each HR strata is useful here:
+
Can we use regression with these data? First, notice that the HR and Run data, shown above, appear to be bivariate normal. Specifically, the qqplots confirm that the normal approximation for each HR strata is useful here:
\(\frac{\mbox{AB}}{\mbox{PA}}\), will change from team to team. To assess its variability, compute and plot this quantity for each team for each year since 1962. Then plot it again, but instead of computing it for every team, compute and plot the ratio for the entire year. Then, once you are convinced that there is not much of a time or team trend, report the overall average.
17. So now we know that the formula for OPS is proportional to \(0.91 \times \mbox{BB} + \mbox{singles} + 2 \times \mbox{doubles} + 3 \times \mbox{triples} + 4 \times \mbox{HR}\). Let’s see how these coefficients compare to those obtained with regression. Fit a regression model to the data after 1962, as done earlier: using per game statistics for each year for each team. After fitting this model, report the coefficients as weights relative to the coefficient for singles.
18. We see that our linear regression model coefficients follow the same general trend as those used by OPS, but with slightly less weight for metrics other than singles. For each team in years after 1962, compute the OPS, the predicted runs with the regression model, and compute the correlation between the two, as well as the correlation with runs per game.
Because we assume the standard deviation of the errors is constant, if we plot the absolute value of the residuals, it should appear constant.
We prefer plots rather than summaries based on, for example, correlation because, as noted in Section @ascombe, correlation is not always the best summary of association. The function plot applied to an lm object automatically plots these.
This function can produce six different plots, and the argument which let’s you specify which you want to see. You can learn more by reading the plot.lm help file. However, some of the plots are based on more advanced concepts beyond the scope of this book. To learn more, we recommend an advanced book on regression analysis.
@@ -1079,7 +1079,7 @@
In fact, the proportion of players that have a lower batting average during their sophomore year is 0.6981132.
So is it “jitters” or “jinx”? To answer this question, let’s turn our attention to all the players that played the 2013 and 2014 seasons and batted more than 130 times (minimum to win Rookie of the Year).
The same pattern arises when we look at the top performers: batting averages go down for most of the top performers.
-
+
@@ -1124,7 +1124,7 @@
But these are not rookies! Also, look at what happens to the worst performers of 2013:
-
+
diff --git a/docs/linear-models/regression_files/figure-html/diagnostic-plots-1.png b/docs/linear-models/regression_files/figure-html/diagnostic-plots-1.png
new file mode 100644
index 0000000..b1b9578
Binary files /dev/null and b/docs/linear-models/regression_files/figure-html/diagnostic-plots-1.png differ
diff --git a/docs/linear-models/treatment-effect-models.html b/docs/linear-models/treatment-effect-models.html
index 6d534c7..e7a8e22 100644
--- a/docs/linear-models/treatment-effect-models.html
+++ b/docs/linear-models/treatment-effect-models.html
@@ -373,7 +373,7 @@
We introduced the kNN algorithm in Section 29.1. In Section 29.7.1, we noted that \(k=31\) provided the highest accuracy in the test set. Using \(k=31\), we obtain an accuracy 0.825, an improvement over regression. A plot of the estimated conditional probability shows that the kNN estimate is flexible enough and does indeed capture the shape of the true conditional probability.
+
We introduced the kNN algorithm in Section 29.1. In Section 29.7.1, we noted that \(k=31\) provided the highest accuracy in the test set. Using \(k=31\), we obtain an accuracy 0.825, an improvement over regression. A plot of the estimated conditional probability shows that the kNN estimate is flexible enough and does indeed capture the shape of the true conditional probability.
@@ -1097,14 +1097,10 @@
-
-
-
-
-
-
+
12. We cam see what the data looks like if we add 1s to our 2 or 7 examples using this code:
Fit QDA using the qda function in the MASS package the create a confusion matrix for predictions on the test. Which of the following best describes the confusion matrix:
a. It is a two-by-two table. b. Because we have three classes, it is a two-by-three table. c. Because we have three classes, it is a three-by-three table. d. Confusion matrices only make sense when the outcomes are binary.
with the gene expression measured on 500 genes for 189 biological samples representing seven different tissues. The tissue type is stored in tissue_gene_expression$y.
Fit a random forest using the randomForest function in the package randomForest. Then use the varImp function to see which are the top 10 most predictive genes. Make a histogram of the reported importance to get an idea of the distribution of the importance values.
Here is plot of precision as a function of prevalence with TPR and TNR are 95%:
-
+
-
+
@@ -729,7 +729,7 @@
26.7 ROC and precision-recall curves
When comparing the two methods (guessing versus using a height cutoff), we looked at accuracy and \(F_1\). The second method clearly outperformed the first. However, while we considered several cutoffs for the second method, for the first we only considered one approach: guessing with equal probability. Be aware that guessing Male with higher probability would give us higher accuracy due to the bias in the sample:
\(\hat{\text{MSE}}\) is a random variable. In fact, \(\text{MSE}\) and \(\hat{\text{MSE}}\) are often referred to as the true error and apparent error, respectively. Due to the complexity of some machine learning, it is difficult to derive the statistical properties of how well the apparent error estimates the true error. In Chapter 29, we introduce cross-validation an approach to estimating the MSE.
+
However, the estimate \(\hat{\text{MSE}}\) is a random variable. In fact, \(\text{MSE}\) and \(\hat{\text{MSE}}\) are often referred to as the true error and apparent error, respectively. Due to the complexity of some machine learning, it is difficult to derive the statistical properties of how well the apparent error estimates the true error. In Chapter 29, we introduce cross-validation an approach to estimating the MSE.
We end this chapter by pointing out that there are loss functions other than the squared loss. For example, the Mean Absolute Error uses absolute values, \(|\hat{Y}_i - Y_i|\) instead of squaring the errors \((\hat{Y}_i - Y_i)^2\). However, in this book we focus on minimizing square loss since it is the most widely used.
26.9 Exercises
The reported_height and height datasets were collected from three classes taught in the Departments of Computer Science and Biostatistics, as well as remotely through the Extension School. The Biostatistics class was taught in 2016 along with an online version offered by the Extension School. On 2016-01-25 at 8:15 AM, during one of the lectures, the instructors asked students to fill in the sex and height questionnaire that populated the reported_height dataset. The online students filled the survey during the next few days, after the lecture was posted online. We can use this insight to define a variable, call it type, to denote the type of student: inclass or online:
p_rf<-predict(fit_rf, x_test[,col_index], type ="prob")p_rf<-p_rf/rowSums(p_rf)p_knn_pca<-predict(fit_knn_pca, newdata)
@@ -830,18 +830,18 @@
31.9 Exercises
1. In the exercises in Chapter 30 we saw that changing maxnodes or nodesize in the randomForest function improved our estimate. Let’s use the train function to help us pick these values. From the caret manual we see that we can’t tune the maxnodes parameter or the nodesize argument with randomForest, so we will use the Rborist package and tune the minNode argument. Use the train function to try values minNode <- seq(5, 250, 25). See which value minimizes the estimated RMSE.
Split the data in training and test sets, then use kNN to predict tissue type and see what accuracy you obtain. Try it for \(k = 1, 3, \dots, 11\).
3. We are going to apply LDA and QDA to the tissue_gene_expression dataset. We will start with simple examples based on this dataset and then develop a realistic example.
Create a dataset with just the classes cerebellum and hippocampus (two parts of the brain) and a predictor matrix with 10 randomly selected columns. Estimate the accuracy of LDA.
. Note that accuracy does not change but see how it is easier to identify the predictors that differ more between groups in the plot made in exercise 4.
8. In the previous exercises, we saw that both approaches worked well. Plot the predictor values for the two genes with the largest differences between the two groups in a scatterplot to see how they appear to follow a bivariate distribution as assumed by the LDA and QDA approaches. Color the points by the outcome.
9. Now we are going to increase the complexity of the challenge slightly: we will consider all the tissue types.
21. Previously, we compared the conditional probability \(p(\mathbf{x})\) give two predictors \(\mathbf{x} = (x_1, x_2)^\top\) to the fit \(\hat{p}(\mathbf{x})\) obtained with a machine learning algorithm by making image plots. The following code can be used to make these images and include a curve at the values of \(x_1\) and \(x_2\) for which the function is \(0.5\):
Fit a kNN model and make this plot for the estimated conditional probability. Hint: Use the argument newdata = mnist27$train to obtain predictions for a grid points.
22. Notice that, in the plot made in exercise 1, the boundary is somewhat wiggly. This is because kNN, like the basic bin smoother, does not use a kernel. To improve this we could try loess. By reading through the available models part of the caret manual, we see that we can use the gamLoess method. We need to install the gam package, if we have not done so already. We see that we have two parameters to optimize:
\(\hat{p}(x,y)\) resulting from the model fit in the exercise 2. How does the accuracy compare to that of kNN? Comment on the difference between the estimate obtained with kNN.
24. Use the mnist_27 training set to build a model with several of the models available from the caret package. For example, you can try these:
32. Note that each method can also produce an estimated conditional probability. Instead of majority vote, we can take the average of these estimated conditional probabilities. For most methods, we can the use the type = "prob" in the train function. Note that some of the methods require you to use the argument trControl=trainControl(classProbs=TRUE) when calling train. Also, these methods do not work if classes have numbers as names. Hint: change the levels like this: