-
Notifications
You must be signed in to change notification settings - Fork 0
/
HW_6.qmd
239 lines (153 loc) · 14 KB
/
HW_6.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
---
title: "Homework 6"
subtitle: <center> <h1>Multiple Linear Regression Additional Variable Types</h1> </center>
author: <center> Madison Wozniak <center>
output: html_document
---
<style type="text/css">
h1.title {
font-size: 40px;
text-align: center;
}
</style>
<span style="color:red"> **GRADING RUBRIC (46 possible points)** </span>
* <span style="color:red"> **45 possible points for correctly answering questions** </span>
* <span style="color:red"> **1 possible point for correctly formatting/submitting the assignment.** </span>
```{r setup, include=FALSE}
# load packages here
library(multcomp)
library(tidyverse)
```
## Data and Description
**Note that for the sake of length for this homework assignment, I am not having you check the model assumptions. You certainly can, if you would like, and in "real life" you would definitely need to do this prior to any statistical inference.**
Macroeconomists often speculate that life expectancy is linked with the economic well-being of a country. Macroeconomists also hypothesize that Organisation for Economic Co-operation and Development (OECD) (an international think tank charged with promoting policies that will improve global social and economic well-being) members will have longer life expectancy. To test these hypotheses, the LifeExpectancy.txt data set (found on Canvas) contains the following information:
Variable | Description
-------- | -------------
LifeExp | Average life expectancy in years
Country | Country name
Group | Is the country a member of OECD, Africa, or other?
PPGDP | Per person GDP (on the log scale)
The Group variable indicates if the country is a member of the OECD, a member of the African continent, or belonging to neither group (other). Note that the Country variable is just for your reference - you will not use this variable in your model.
Download LifeExpectancy.txt, and put it in the same folder as this .qmd file.
#### 0. Replace the text "< PUT YOUR NAME HERE >" (above next to "author:") with your full name.
#### 1. Read in the data set, call it "life", remove the "Row" column, and change the class of any categorical variables to factor variables. Print a summary of the data and make sure the data makes sense.
```{r, fig.align='center'}
life <- read_table("LifeExpectancy.txt")
life <- life[-1]
life <- life|>
mutate(Group = as.factor(Group))
summary(life)
```
#### 2. Show a scatterplot with the response on the $y$-axis and the other continuous variable on the $x$-axis. Comment on the the relationship between these two variables.
```{r, fig.align='center'}
ggplot(life, aes(x = PPGDP, y = LifeExp))+
geom_point()
```
There is a strong linear relationship between Life Expectancy and PPGDP, and it looks like there is multicollinearity between PPGDP and another variable not included in the plot that causes a jump in Life Expectancy.
#### 3. Create and print a boxplot with the response on the $y$-axis and the categorical variable on the $x$-axis. Comment on the the relationship between these two variables.
```{r, fig.align='center'}
ggplot(life, aes(x = Group, y = LifeExp))+
geom_boxplot()
```
`africa` seems to be the group with the lowest life expectancy, and `oecd` has the highest, also with the smallest range of life expectancy.
#### 4. Create and print a color-coded scatterplot using all of the variables that will be in your model. Hint: plot the response on the $y$-axis, the other continuous variable on the $x$-axis, and color the points by the categorical variable.
```{r, fig.align='center'}
ggplot(life, aes(x = PPGDP, y = LifeExp, color = Group))+
geom_point()
```
#### 5. Write out the theoretical model (using Greek letters/parameters) that includes main effects for Per Person GDP and the group of the country (you will not write out the fitted model using coefficients, because you have not fit a model yet;)). DO NOT include interactions at this step. Remember, you will need to use dummy variables for Group. **USE "other" AS THE BASELINE CATEGORY**. Use variable names that are descriptive (not $y$, $x_1$, etc.).
$LifeExp_i$ = $\beta_0$ + $\beta_1$ * $I(Group = africa)$ + $\beta_2$ * $I(Group = oecd)$ + $\beta_3$ * $PPGDP$ + $\epsilon_i$
#### 6. Fit the multiple linear regression model from question 5 to the data (no transformations, no interactions, etc.) **using dummy variables that you create manually**. *USE "other" AS THE BASELINE CATEGORY FOR GROUP*. Print a summary of the results.
```{r, fig.align='center'}
levels(life$Group) # order of levels originally
life$Group <- factor(life$Group, levels = c("africa", "oecd", "other"))
life$GroupAfrica <- ifelse(life$Group == "africa", 1, 0)
life$GroupOECD <- ifelse(life$Group == "oecd", 1, 0)
life.lm_no_int <- lm(LifeExp ~ PPGDP + GroupAfrica + GroupOECD, data = life)
summary(life.lm_no_int)
```
#### 7. Fit the multiple linear regression model from question 5 again, but this time let R create the dummy variables for you in the lm function. As before, *USE "other" AS THE BASELINE CATEGORY FOR GROUP*. Print a summary of the results and make sure they are identical to the results from question 6.
```{r, fig.align='center'}
life <- life |>
mutate(Group = fct_relevel(Group, "other"))
life.lm_r <- lm(LifeExp ~ PPGDP + Group, data = life)
summary(life.lm_r)
```
The results are identical to question 6.
#### 8. Briefly interpret the intercept (like we did in class). **Note that you will need to use the word "average" twice since you are predicting an average already (i.e. the response variable is a country's average life expectancy).** You will need to do this here and with the questions following, when interpreting.
On average, the average life expectancy for a country in the `other` category and without per person GDP data, is 50.96 years old.
#### 9. Briefly interpret the coefficient for PPGDP. You do not need to un-transform anything or interpret this in the percentage change framework - you can just write something like "for every one unit increase in per person GDP (log scale)" in your response.
For every one unit increase in per person GDP, the average, average life expectancy in a country in the `other` group, increases by 2.87 years.
#### 10. For equal per person GDP (log scale), how does life expectancy change for countries that are members of the OECD compared to countries that are on the African continent? Show how you obtained this number, and briefly interpret this number (like we did in class).
For countries in the OECD, on average, the average life expectancy is 13.8241 years longer than countries that are on the African continent with equal per person GDP. This number was obtained by subtracting the coefficient of `Groupafrica` from `Groupoecd`.
#### 11. Create 95% confidence intervals for all coefficients (use the `confint` function). You do not need to interpret them in this question.
```{r, fig.align='center'}
confint(life.lm_no_int, level = 0.95)
```
#### 12. Briefly interpret the 95% confidence interval for I(Group=Africa).
We are 95% confident that for equal PPGDP, on average, a country in Africa will have an average life expectancy between 12.802 and 11.7865 years less than a country in the `other` group.
#### 13. Use the `anova` function to conduct a hypothesis test that tests a reduced model compared to the full model. Specifically, test if Group has a significant effect on LifeExp. What do you conclude from the result of the test? Hint: you will need to create another linear model and compare it with the one you made previously.
```{r, fig.align='center'}
life.lm_small <- lm(LifeExp~Group, data = life)
anova(life.lm_small, life.lm_no_int)
```
Since the p-value for Group is less than 0.05, we can conclude that `Group` has a significant effect on `LifeExp`.
#### 14. Create a 95% prediction interval for the life expectancy of a country in the OECD with an average per person GDP (log scale) of 9.5. Print the result, and briefly interpret this interval (like we did in class). (Use the `predict` function.)
```{r, fig.align='center'}
predict(life.lm_no_int,
newdata = data.frame(
PPGDP = 9.5,
GroupOECD = 1,
GroupAfrica = 0),
interval = "prediction",
level = 0.95)
```
We are 95% confident that the average life expectancy for a country in the oecd group with a PPGDP of 9.5 is between 77.654 years and 81.982 years.
#### 15. Plot the fitted model on the scatterplot with the two continuous variables on the axes, colored by the categorical variable. Hint: you should have 3 different lines on your plot, and you will *not* need to have different line types or point shapes (you *will* need to have different colors).
```{r, fig.align='center'}
ggplot(life, aes(x = PPGDP, y = LifeExp, color = Group)) +
geom_point() +
geom_line(aes(x = PPGDP, y = predict(life.lm_no_int), color = Group))
```
#### 16. Fit a multiple linear regression model to the data, where this time you **include an interaction term** between PPGDP and Group. *USE "other" AS THE BASELINE CATEGORY FOR GROUP*. Print a summary of the results.
```{r, fig.align='center'}
life.lm_int <- lm(LifeExp ~ PPGDP + Group + Group:PPGDP, data = life)
summary(life.lm_int)
```
#### 17. Write out the fitted model (using coefficients values from above) for a model with PPGDP, Group, and an interaction between PPGDG and Group. Remember, you will need to use dummy variables for Group. **USE "other" AS THE BASELINE CATEGORY**. Use variable names that are descriptive (not $y$, $x_1$, etc.).
$\hat{LifeExp_i}$ = $50.42403$ $- 11.89511$ * $I(Group = Africa)$ + $11.29201$ * $I(Group = OECD)$ + $2.93882$ * $PPGDP$ $- 0.04128$($I(Group = Africa)$* $PPGDP$) - $0.95268$($I(Group = OECD)$ * $PPGDP$)
#### 18. Use the `anova` function to test if the overall interaction between PPGDP and Group is significant. Print the result. What do you conclude (full sentence)?
```{r, fig.align='center'}
anova(life.lm_int, life.lm_no_int)
```
With a p-value less than 0.05, we can conclude that the interaction between `Group` and `PPGDP` is significant and should be included in our final model.
#### 19. Plot the fitted model (with the interaction included) on the scatterplot with the two continuous variables on the axes, colored by the categorical variable. Hint: you should have 3 different lines on your plot, and you will *not* need to have different line types or point shapes (you *will* need to have different colors).
```{r, fig.align='center'}
ggplot(life, aes(x = PPGDP, y = LifeExp, color = Group)) +
geom_point() +
geom_line(aes(x = PPGDP, y = predict(life.lm_int), color = Group))
```
#### 20. How did the fitted lines change when you included an interaction term compared with when you did not include an interaction term?
The biggest change occurred in the `oecd` group, the right tail of the model was pulled down toward `other` so now the two models are no longer parallel, suggesting the interaction term is important.
#### 21. What is the estimated effect of PPGDP on LifeExp for countries in a country other than those in the OECD or Africa (i.e. in the "other" category)? You should report this number in a complete sentence (as done in class toward the end of the notes). Since this is a continuous-categorical interaction, and since we are focusing on the effect of the continuous variable, you should use the "one unit increase" terminology in your response.
The average, average life expectancy for a country in the `other` group would increase by 2.93882 years for every one unit increase in `PPGDP`.
#### 22. What is the p-value for the test of whether the effect of PPGDP on LifeExp is different between countries in the OECD group and countries in the "other" group?
p = 9.88e-09
#### 23. What is the p-value for the test of whether the effect of PPGDP on LifeExp is different between countries in the OECD group and countries in the African continent? (use the glht() function from the multcomp package to get an answer to this question.)
```{r, fig.align='center'}
c = matrix(c(0, 1, -1), nrow = 1)
summary(glht(lm(LifeExp~Group, data = life), linfct = c))
```
p-val = <2e-16
#### 24. What is the effect of PPGDP on LifeExp for countries in the OECD (relative to the reference group)? You should report a number in a complete sentence (as done in class toward the end of the notes). Since this is a continuous-categorical interaction, and since we are focusing on the effect of the continuous variable, you should use the "one unit increase" terminology in your response.
```{r, fig.align='center'}
11.29201 - .95268
```
The average, average life expectancy for a country in the `OECD` group would increase by 10.33933 years for every one unit increase in `PPGDP`.
#### 25. Conditional on having a PPGDP of 9, what is the estimated effect of belonging to the OECD relative to being in the "other" country group? You should report a number in a complete sentence (as done in class toward the end of the notes).
```{r, fig.align='center'}
2.93882*9 + 11.29388 - .95268*9
```
The average, average life expectancy for a country in the `OECD` group with a `PPGDP` of 9 would be 29.169 years longer than a country not in that group with the same `PPGDP`.
#### 26. Briefly summarize what you learned from this analysis *to a non-statistician*. Write a few sentences about (1) the purpose of this data set and analysis and (2) what you learned about this data set from your analysis. Write your response as if you were addressing a business manager (avoid using statistics jargon) and just provide the main take-aways.
This analysis aimed to look at the relationship between a country's economic well-being based on its PPGDP, and life expectancy. The goal was to understand how PPGDP, and Group (africa, oecd, or other) impact life expectancy. Some key findings were that per person GDP is positively correlated with life expectancy, countries in the OECD have higher life expectancy than countries in Africa and `other`. There is an interaction, meaning life expectancy varies across different groups. Overall, the analysis provides insights into how economics and region contribute to life expectancy.