-
Notifications
You must be signed in to change notification settings - Fork 0
/
v.r
153 lines (119 loc) · 6.69 KB
/
v.r
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
## Exercise 6.8.10
## Summer 2017
# 10. We have seen that as the number of features used in a model increases,
# the training error will necessarily decrease, but the test error may not.
# We will now explore this in a simulated data set.
# (a) Generate a data set with p = 20 features, n = 1,000 observations, and an associated quantitative response vector generated
# according to the model Y = Xβ + ε, where β has some elements that are exactly equal to zero.
# (b) Split your data set into a training set containing 100 observations
# and a test set containing 900 observations.
# (c) Perform best subset selection on the training set, and plot the
# training set MSE associated with the best model of each size.
# (d) Plot the test set MSE associated with the best model of each
# size.
# (e) For which model size does the test set MSE take on its minimum
# value? Comment on your results. If it takes on its minimum value
# for a model containing only an intercept or a model containing
# all of the features, then play around with the way that you are
# generating the data in (a) until you come up with a scenario in
# which the test set MSE is minimized for an intermediate model
# size.
# (f) How does the model at which the test set MSE is minimized
# compare to the true model used to generate the data? Comment
# on the coefficient values.
# (g) Create a plot displaying ' p j=1(βj − β ˆj r)2 for a range of values
# of r, where β ˆj r is the jth coefficient estimate for the best model
# containing r coefficients. Comment on what you observe. How
# does this compare to the test MSE plot from (d)?
# (a) Generate a data set with p = 20 features, n = 1,000 observations, and an associated quantitative response vector generated
# according to the model Y = Xβ + ε, where β has some elements that are exactly equal to zero.
set.seed(88)
p = 20
n = 1000
x = matrix(rnorm(p*n), n, p)
B = rnorm(p)
## Setting some elements in B to zero.
B[4] = 0
B[5] = 0
B[13] = 0
B[16] = 0
B[20] = 0
## Create random err vector.
err = rnorm(n)
## Create response vector based on linear model Y = Xβ + ε.
y = x %*% B + err
# (b) Split your data set into a training set containing 100 observations
# and a test set containing 900 observations.
train = sample(seq(n), 100, replace = FALSE)
y.train = y[train, ]
y.test = y[-train, ]
x.train = x[train, ]
x.test = x[-train, ]
# (c) Perform best subset selection on the training set, and plot the
# training set MSE associated with the best model of each size.
library(leaps)
regfit = regsubsets(y ~ ., data = data.frame(x = x.train, y = y.train), nvmax = p)
train.matrix = model.matrix(y ~ ., data = data.frame(x = x.train, y = y.train), nvmax = p)
train.errors = rep(NA, p)
## Storing training errors for the number of preditors incrementally.
for (i in 1:p) {
coefi = coef(regfit, id = i)
pred = train.matrix[, names(coefi)] * coefi
train.errors[i] = mean((y.train - pred)^2)
}
plot(train.errors, ylab = "Training MSE", xlab="Number of Predictors", pch = 20, type = "b", col="purple")
# (d) Plot the test set MSE associated with the best model of each
# size.
test.matrix = model.matrix(y ~ ., data = data.frame(x = x.test, y = y.test), nvmax = p)
test.errors = rep(NA, p)
for (i in 1:p) {
coefi = coef(regfit, id = i)
pred = test.matrix[, names(coefi)] * coefi
test.errors[i] = mean((y.test - pred)^2)
}
plot(test.errors, ylab = "Test MSE", xlab="Number of Predictors", pch = 20, type = "b", col="blue")
# (e) For which model size does the test set MSE take on its minimum
# value? Comment on your results. If it takes on its minimum value
# for a model containing only an intercept or a model containing
# all of the features, then play around with the way that you are
# generating the data in (a) until you come up with a scenario in
# which the test set MSE is minimized for an intermediate model
# size.
points(which.min(test.errors), test.errors[which.min(test.errors)], pch = 20, col="red")
sprintf("The best model size with the lowest MSE is: %d", which.min(test.errors))
## The test MSE plot showed a minimum MSE value when the predictor size is 18 as shown in the
## red dot in the graph, and console output. We obtained this model by setting 5 variables 4, 5, 13, 16, 20
## to zero. We experimented with the model by setting different combinations of variables to zero, and even some model produce
## optimal MSE result, the number of predictors are either too low or too high, therefore undesirable.
## Then we continue experimenting by repeating the same procedure of splitting data into training and testing, and steps (a) to (e) to
## find the best model with lowest MSE with an intermediate predictor size.
## From all experiments we conducted, we concluded that the model with 15 variables yielded the lowest MSE and best model performance.
# (f) How does the model at which the test set MSE is minimized
# compare to the true model used to generate the data? Comment
# on the coefficient values.
summary(regfit)
coef(regfit, which.min(test.errors))
## The 15th model gives the lowest MSE compared to the MSE of the full model
## and caught all zero coefficient values except five of the 20 incremental models.
## The models that did not catch all zero coeffcient are model 16th, 17th, 18th, 19th, and 20th.
## The 15th model gives the lowest MSE compared to the MSE of the true full model, thus generating the best model
## performance utilizing an intermediate model predictor size.
# (g) Create a plot displaying ' p j=1(βj − β ˆj r)2 for a range of values
# of r, where β ˆj r is the jth coefficient estimate for the best model
# containing r coefficients. Comment on what you observe. How
# does this compare to the test MSE plot from (d)?
val.errors = rep(NA, p)
a = rep(NA, p)
b = rep(NA, p)
x_cols = colnames(x, do.NULL = FALSE, prefix="x.")
for (i in 1:p)
{
coefi = coef(regfit, id = i)
a[i] = length(coefi) - 1
b[i] = sqrt(sum((B[x_cols %in% names(coefi)] - coefi[names(coefi) %in% x_cols])^2) +sum(B[!(x_cols %in% names(coefi))])^2)
}
plot(x = a, y = b, xlab = "Number of Coefficients", ylab = "Estimated vs True Coefficients Error", pch = 20, col="red")
sprintf("The minimal difference between estimated and true coefficient is with coeffient number: %d", which.min(b))
## The console output and the plot above shows the model with 15 coefficients (excluding intercept) minimizes
## the difference between the estimated and true coefficient error. In this case, it shows that the best model we obtained with
## 15 predictors suggests a more accurate prediction with minimal difference between the Estimated and the True Coefficient Error.