-
Notifications
You must be signed in to change notification settings - Fork 1
/
stats_1770.Rmd
208 lines (124 loc) · 5.35 KB
/
stats_1770.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
---
title: "Stats 1770 in R"
output:
html_notebook:
toc: yes
theme: spacelab
---
<style type="text/css">
a {color: #3d0066; font-size: 11;}
p {color: #000000; font-family:"monaco" font-size:13;}
h1 {color: #3d0066 }
h2 {color: #6b00b3 }
h3 {color: #9900ff }
h4 {color: #cc0099 }
ul {list-style-type: circle; color: #6b00b3 }
ol {color: #cc0099; font-size:11; }
p1 {color: #004d4d;}
li {color: #004d4d;}
</style>
<t>
![](RStudio-Logo 1.png)</t>
# The basics
Symmetrical plots: left skewed and right skewed
- Population size = N
- Central tendency is where the data gathers around
- **mean** is the average
- **median** the middle value
- **mode** most frequent occurring data item
- **range** largest data item - smallest item
- **variance** average distance from the mean, $\sigma^2$
- **standard deviation** square root of variance
- **proportion (p)** the ratio of data items divided by total
- **InterQuartile Range (IQR)** 3rd quartile minus the 1st quartile
- **sample variance** $s^2$
- **sample mean** $x^-$ x with a bar on top
## simple stats
```{r}
sample.pop = 1:20
mean(sample.pop)
var(sample.pop)
sd(sample.pop)
```
**Central Limit Theorem** = the sample size n taken from population with average $\mu$ and variance $\sigma^2$ . If n > 30 then the sampling is approximately Normal Distributed ($\sigma^2$ / N)
**Confidence Interval**
- confidence level (90%, 95%, 98%)
- define alpha $\alpha$ value, [1 - 0.95] $\alpha$ = 0.05
- find z value $\alpha$ / 2
IF n > 30 use z-table <br>
- if population is normal and std is known, use z-table
- if population is normal and std is not known, use t-table
IF n < 30, use **Degrees of Freedom** and t-test
<hr>
Ex. weights of Canadian adults in kg, normal distribution. Sample size is 25, sample mean is 80.4, sample variance is 20.25. Find 98% confidence interval for population mean.
sample population < 30, use DoF, df= n-1, df= 24
alpha/2 = .02/2 = .010
t-table = 2.492
the average confidence interval
(80.4 - 2.492) < $\mu$ < (80.4 + 2.492)
<hr>
<br>
4 bags of chips of 220g, with a sample mean 217.25, sample variance 6.25. Normal distribution. Alpha level is 0.05.
1. Null: $\mu$ = 220. Alternative: $\mu$ < 220.
2. Confidence Interval: $\alpha$ = 0.05
3. sample size: 4 < 30, use t-table
4. df= 4 - 1 = 3
5. t-table value: (df and alpha) 2.353.
- interested in 1 tail, left side (less chips), t= -2.353
6. reject null if t falls in rejection region
7. t-test = (217.2 -220/ 2.5/ sqrt(4)) = -2.2
8. fail to reject null
# Descriptive Stats
Descriptive stats
- central tendency (mean, median, mode)
- variability (range, std)
# Inferential Stats
there are 2 main uses
- significance testing, determine if differences or relationship between groups is statistically significant (occurrence is not just pure chance)
- estimate population parameters from sample stats
![decision](A4 - 1.png)
Between Groups: select a sample, assign each person to 1 of the binary label of Indep. Var
Within Groups: each person is assigned to all labels, same group gets diff levels, groups are matched related to the dependent variable
# Hypothesis Testing
The p-value is the probability of a result happening by pure randomness alone. The alpha $\alpha$ = 0.05 (95% confidence level, 5% chance of making type 1 error)
IF p-value <= $\alpha$ THEN reject the null, results are significant <br>
IF p-value > $\alpha$ THEN accept the null
## Errors
- type 1 is rejecting a null when shouldn't
- type 2 is accept null when shouldn't
## Effect size
Effect size indicates how strong the relationship between variables or differences are between groups
Interpretation of strength of a relationship:
- for r and Phi
- very strong >= .70
- large = .50
- medium = .36
- small = .10
- for eta
- very strong >= .45
- large = .37
- medium = .24
- small = .10
## Correlation tests of coefficients
Pearson r and bivariate regression (parametric tests)
Correlation coefficients -1 to +1, closer to -1 or +1 the stronger the relationship
-1 **strong** <-------- weak -------> **strong** +1 <br>
- -0.7 -0.5 -0.3 -0.1 <\tab> 0.1 0.3 0.5 0.7
# Chi square test
This non-parametric test can be used for difference questions with nominal or dichotomous variables. Classes: 1 or 2, Condition: on/off
# Phi and Cramer's V
these are tests for association between 2 nominal or dichotomous variables.
- Phi is used when both variables are dichotomous
- Cramer's V when 1+ variable has multiple levels/groupings
# Pearson r
this test is for strength of association between 2 scale variables. Ex. GPA vs Hours of study
# Wilcoxon and McNemar to compare groups
# ANOVA 1-way, F-score
analysis of variance is used to compare 3+ unrelated groups on an independent variable, determines whether there are significant variance between groups. The value is **F-value**
# Post-hoc tests
the ANOVA just tells you that a significant difference exists
- do a Tukey's HSD test if the p-value is not stat significant
- do a Games-Howell test if the p-value is stat significant
# ANOVA 2-way
Independent variables each have 2+ levels, samples must be indep. and each participant must receive only 1 level of each indep. var. Variances must be equal.
There is 3 null hypotheses: there is no diff between various levels (no main effect), there is no diff between various levels of 2nd Indep. var, and there is no interaction between indep. vars