-
Notifications
You must be signed in to change notification settings - Fork 0
/
HW_2.qmd
235 lines (169 loc) · 12.6 KB
/
HW_2.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
---
title: "Homework 2"
subtitle: <center> <h1>Simple Linear Regression Model Assumptions</h1> </center>
author: <center> Madison Wozniak <center>
output: html_document
---
<style type="text/css">
h1.title {
font-size: 40px;
text-align: center;
}
</style>
```{r setup, include=FALSE}
# load any necessary packages here
library(tidyverse)
library(ggfortify) # plot lm objects using ggplot instead of base R
library(car) # Box-Cox transformation
library(onewaytests) # bf.test
```
## Data and Description
One key component of determining appropriate speed limits is the amount of distance that is required to stop at a given speed. For example, in residential neighborhoods, when pedestrians are commonly in the roadways, it is important to be able to stop in a very short distance to ensure pedestrian safety. The speed of vehicles may be useful for determining the distance required to stop at that given speed, which can aid public officials in determining speed limits.
The Stopping Distance data set compares the **distance (column 2)** (in feet) required for a car to stop on a certain rural road against the **speed (column 1)** (MPH) of the car. Download the StoppingDistance.txt file from Canvas, and put it in the same folder as this quarto file.
#### 0. Replace the text "< PUT YOUR NAME HERE >" (above next to "author:") with your full name.
#### 1. Read in the data set, and call the data frame "stop".
```{r}
stop_data <- read_table("StoppingDistance.txt")
```
#### 2. Create a scatterplot of the data with variables on the appropriate axes (think about which variable makes the most sense to be the response). Make your plot look professional (make sure the axis labels are descriptive).
```{r, fig.align='center'}
base_plot <- ggplot(data = stop_data, aes(x = Speed, y = Distance))+
geom_point() +
ggtitle("Speed vs Stopping Distance") +
xlab("Speed (MPH)") +
ylab("Distance (feet)")
base_plot
```
#### 3. Briefly describe the relationship between Speed and Distance. (Hint: you should use 2 or 3 key words.)
There is a strong, positive, linear relationship between Speed and Distance.
#### 4. Add the OLS regression line to the scatterplot you created in question 2 (note: if you receive a warning about rows with missing values, you may need to adjust an axis limit using `scale_y_continuous(limits = c(###, ###))`).
```{r, fig.align='center'}
base_plot +
geom_smooth(mapping = aes(x = Speed, y = Distance),
method = "lm",
se = FALSE)
```
#### 5. (a) Apply linear regression to the data (no transformations). (b) Print out a summary of the results from the `lm` function. (c) Save the residuals and fitted values to the `stop` dataframe.
```{r}
stop.lm <- lm(Speed~Distance, data = stop_data)
summary(stop.lm)
stop_data$residuals <- stop.lm$residuals
stop_data$fitted_values <- stop.lm$fitted.values
```
#### 6. Mathematically write out the **fitted** simple linear regression model for this data set using the coefficients you found above. Do not use "x" and "y" in your model - use variable names that are fairly descriptive.
$\text{Distance(ft)}_i$ = $\beta_0$ + $\beta_1$ $\times$ $\text{Speed(mph)}_i$ + $\epsilon_i$
### Questions 7-11 involve using diagnostics to determine if the linear regression assumptions are met. For each assumption, (1) perform appropriate diagnostics to determine if the assumption is violated, and (2) explain whether or not you think the assumption is violated and why you think that.
#### 7. (L) X vs Y is linear
```{r, fig.align='center'}
autoplot(stop.lm, which = 1, ncol = 1, nrow = 1) +
theme(aspect.ratio = 1)
```
Because the blue line doesn't hover around zero for all the data, and dips far below zero toward the end of the graph, this shows that it does not pass the linearity assumption.
#### 8. (I) The residuals are independent (no diagnostic tools required in this particular instance - just think about how the data was collected and briefly write your thoughts)
This assumption is met becuase there is no time component in the data collection process, so we can assume the errors are independent.
#### 9. (N) The residuals are normally distributed. Use at least two diagnostic tools.
```{r, fig.align='center'}
ggplot(data = stop_data, mapping = aes(y = residuals)) +
geom_boxplot() +
scale_x_discrete() +
labs(title = "Residual Boxplot")
ggplot(data = stop_data, mapping = aes(x = residuals)) +
geom_histogram(mapping = aes(y = ..density..), binwidth = 0.5) +
stat_function(fun = dnorm, color = "red", size = 2,
args = list(mean = mean(stop_data$residuals),
sd = sd(stop_data$residuals))) +
labs(x = "Residuals", y = "Density", title = "Residual Histogram")
```
The boxplot median is very close to zero, and is not left or right skewed. The histogram generated also shows that out residuals are centered around zero and follows a bell-shaped curve. Both graphs suggest the errors form a normal distribution. The Shapiro-Wilk test also returns a p-value of 0.5 so we can fail to reject the null hypothesis and conclude normality.
#### 10. (E) The residuals have equal/constant variance across all values of X.
```{r, fig.align='center'}
autoplot(stop.lm, which = 3, ncol = 1, nrow = 1)
```
The blue line is mostly horizontal and linear, except for a gradual increase toward the right of the graph.
#### 11. Check if there are any influential points deserving of attention.
```{r, fig.align='center'}
autoplot(stop.lm, which = 4, ncol = 1, nrow = 1)
theme(aspect.ratio = 1)
cd_cont_pos <- function(leverage, level, model) {sqrt(level*length(coef(model))*(1-leverage)/leverage)}
cd_cont_neg <- function(leverage, level, model) {-cd_cont_pos(leverage, level, model)}
cd_threshold <- 0.5
autoplot(stop.lm, which = 5) +
stat_function(fun = cd_cont_pos,
args = list(level = cd_threshold, model = stop.lm),
xlim = c(0, 0.25), lty = 2, colour = "red") +
stat_function(fun = cd_cont_neg,
args = list(level = cd_threshold, model = stop.lm),
xlim = c(0, 0.25), lty = 2, colour = "red") +
scale_y_continuous(limits = c(-4, 4))
```
There is one potential influential point labelled '60' on this graph.
#### 12. Based on your analysis of the diagnostic measures, briefly discuss why this simple linear regression model on the raw data (not transformed) is not appropriate.
This simple linear regression model would not be appropriate because the linearity assumption is not met, constant variance is also not met, and there is one influential point that influences the trend of our model.
#### 13. Fix the model by making any necessary transformations. Justify the transformation(s) you chose in words. (Note: if boxCox(mod) throws an error, replace mod with the formula for the linear model, y ~ x.) (Note: you will most likely need to repeat questions 13 and 14 until you are satisfied with the transformation(s) you chose. Only then should you fill out this section - I only want to see the model you end up choosing, not all of your attempted models.)
For each individual transformation, I started by checking the linearity assumption for each one. Transforming both x and y provided me with the best horizontal straight line on the linearity graph so I decided that would be the most effective modelto use.
```{r, fig.align='center'}
invTranPlot(Distance ~ Speed, data = stop_data, lambda = c(-.5, 0, .5), optimal = TRUE)
boxCox(stop_data$Speed~stop_data$Distance)
stop_transxy_lm <- lm(log(Distance)~log(Speed), data = stop_data)
stop_data$residuals_logxy <- stop_transxy_lm$residuals
stop_data$fitted_logxy <- stop_transxy_lm$fitted.values
```
#### 14. Now, re-check your transformed model and verify that the assumptions (the assumptions that were addressed in questions 7 to 11 above) are met. Provide a brief discussion about how each of the previously violated assumptions are now satisfied. Also, provide the code you used to assess adherence to the assumptions. (Note that transforming will not change your responses about (I) the residuals being independent, so you can skip that assumption here.
#####linearity
```{r, fig.align='center'}
autoplot(stop_transxy_lm, which = 1, ncol = 1, nrow = 1) +
theme(aspect.ratio = 1)
```
The graph now shows a much more horizontal line than before transformation happened, so we can conclude it is more linear than prior.
#####normality
```{r}
ggplot(data = stop_data, mapping = aes(y = residuals_logxy)) +
geom_boxplot() +
scale_x_discrete() +
labs(title = "Residual Boxplot")
ggplot(data = stop_data, mapping = aes(x = residuals_logxy)) +
geom_histogram(mapping = aes(y = ..density..), binwidth = 0.5) +
stat_function(fun = dnorm, color = "red", size = 2,
args = list(mean = mean(stop_data$residuals_logxy),
sd = sd(stop_data$residuals_logxy))) +
labs(x = "Residuals", y = "Density", title = "Residual Histogram")
```
The boxplot and histogram are centered at 0 with no left or right skew, so it satisfies the normality assumption.
#####constant variance
```{r}
autoplot(stop_transxy_lm, which = 3, ncol = 1, nrow = 1)
```
The blue line on the graph is linear through the center of the graph so we can conclude that the variance of the errors is constant.
#####influential points
```{r}
autoplot(stop_transxy_lm, which = 4, ncol = 1, nrow = 1)
theme(aspect.ratio = 1)
cd_cont_pos <- function(leverage, level, model) {sqrt(level*length(coef(model))*(1-leverage)/leverage)}
cd_cont_neg <- function(leverage, level, model) {-cd_cont_pos(leverage, level, model)}
cd_threshold <- 0.5
autoplot(stop_transxy_lm, which = 5) +
stat_function(fun = cd_cont_pos,
args = list(level = cd_threshold, model = stop_transxy_lm),
xlim = c(0, 0.25), lty = 2, colour = "red") +
stat_function(fun = cd_cont_neg,
args = list(level = cd_threshold, model = stop_transxy_lm),
xlim = c(0, 0.25), lty = 2, colour = "red") +
scale_y_continuous(limits = c(-4, 4))
```
The condition for points to be influential is having a distance greater than 0.4, and all points in our model are below this threshold so there are no influential points.
#### 15. Mathematically write out the fitted simple linear regression model for this data set using the coefficients you found above from your transformed model. Do not use "x" and "y" in your model - use variable names that are fairly descriptive.
$\text{log(Distance(ft)}_i)$ = $\beta_0$ + $\beta_1$ $\times$ $\text{log(Speed(mph)}_i)$ + $\epsilon_i$
#### 16. Plot your new fitted *curve* on the scatterplot of the original data (on the original scale - not the transformed scale). Do you think this model fits the data better than the original model?
```{r}
# Tip: To get the fitted curve for your transformed model on the original scale, you have to invert the transformation on the response. Consider the following example. Suppose the transformed response is Dlog = log(Distance) (this is not necessarily the transformations you should use, it is just an example). If 'fit' is a variable containing the fitted values on the transformed scale (i.e., they are fitted values for Dlog), then the fitted values for Distance would be efit = exp(fit), since exp() is the inverse of log().
ggplot(stop_data, aes(x = Speed, y = Distance)) +
geom_point() +
geom_smooth(method = "lm",
formula = exp(stop_transxy_lm$fitted.values) ~ x) +
labs(title= "Speed vs Stopping Distance",
x = "Speed (mph)",
y = "Distance (ft)")
```
This model fits the data better, there is less distance between our transformed model and data points.
#### 17. Briefly summarize (1) the purpose of this data set and analysis and (2) what you learned about this data set from your analysis. Write your response as if you were addressing a non-statistician (do not include any numbers or software output).
The purpose of this dataset is to understand and find the relationship between stopping distance and speed. Stopping distance is a dependent (response) variable that changes based on the speed the car is travelling at. From my analysis I learned that there is a strong positive relationship between the two variables. The faster a car is going, the longer the stopping distance needs to be to avoid collisions. A log transformed model is the best model to use to explain our data because the original simple linear regression model failed to meet certain assumptions necessary to allow us to use the model to make inferences and conclusions about what is going on. After transformations were made and assumptions were met, we can use our model to make predictions on data within the range of our model's data.