-
Notifications
You must be signed in to change notification settings - Fork 1
/
Stats_Textbook.Rmd
574 lines (375 loc) · 19.9 KB
/
Stats_Textbook.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
---
title: "**All Things Stats**"
author: " *Intro Stats, 4th Ed. (DeVaux, Velleman & Bock)* "
output:
html_notebook:
toc: yes
theme: spacelab
pdf_document:
toc: yes
---
<br>
<h4> This R notebook is made by Zane Dax </h4>
<p> **theme**: spacelab, **font**: monaco </p>
<br>
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
setwd(getwd())
library(plotly)
library(dplyr)
```
<style type="text/css">
a {color: #3d0066; font-size: 11;}
p {color: #000000; font-family:"monaco" font-size:13;}
h1 {color: #3d0066 }
h2 {color: #6b00b3 }
h3 {color: #9900ff }
h4 {color: #cc0099 }
ul {list-style-type: circle; color: #6b00b3 }
ol {color: #cc0099; font-size:11; }
p1 {color: #004d4d;}
li {color: #004d4d;}
</style>
<t>![](RStudio-Logo 1.png)</t>
# Chapter 1 - Exploring & Understanding Data
## Categorical variables
These are also named Nominal because these variables name categories. Numerical values are also categorical, like area codes.
## Quantitative variables
Measured numerical values, in units.
## Ordinal variables
Is the rating scale of a survey, much like a Likert scale, [1:'ok', 2:'good'...]
# Chapter 2 - Describe Categorical data
Three rules of data analysis: make a picture * 3 = tell, show, share.
Use frequency tables and relative frequency table (proportion of data). Bar charts display distributions of a categorical variable and the counts for comparisons.
## Relationship between 2 categorical variables
```{r import dataset}
#library(readr)
#library(dplyr)
#df = read_csv('heart.csv')
#df2 = read_csv('titanic_dataset.csv')
#head(df,10)
#head(df2,10)
```
## Contigency Table
Medical researchers followed 6272 Swedish men for 30 years to see whether there was any association between the amount o ffish in their diet and prostate cancer. The study results are in the table.
```{r}
fish.consumption = c("fish.seldom",
"fish.small_amnt",
"fish.moderate",
"fish.large"
)
no.cancer = c(110,2420,2769,507)
yes.cancer = c(14,201,209,42)
row_totals = no.cancer+yes.cancer
no_sums = sum(no.cancer)
yes_sums = sum(yes.cancer)
fish_df = data.frame( fish.consumption,
no.cancer,
yes.cancer,
row_totals
)
print(fish_df)
```
```{r}
# average fish consumption for those with cancer
avg_fish.and.cancer = fish_df %>% group_by(fish.consumption) %>% summarise(avg_fish.And.cancer = mean(yes.cancer))
avg_fish.and.cancer
# average fish consumption for those with NO cancer
avg_fish.and.NOcancer = fish_df %>% group_by(fish.consumption) %>% summarise(avg_fish.And.NOcancer = mean(no.cancer))
avg_fish.and.NOcancer
```
```{r}
avg_fishDiet_noCancer = mean( fish_df$no.cancer)
avg_fishDiet_Cancer = mean( fish_df$yes.cancer)
paste("average no cancer and fish diet:", avg_fishDiet_noCancer)
paste("average cancer and fish diet:", avg_fishDiet_Cancer)
```
### The research question
* The problem wanting address: *is there an association between fish consumption and prostate cancer?*
* The variables: n= 6272, study duration = 30 years
* Categorical counts in dataset
```{r}
population_total= c(no_sums+yes_sums)
no.cancer.proportion = no_sums / population_total
yes.cancer.proportion = yes_sums/population_total
paste('Participants with no cancer: ', round(no.cancer.proportion*100),'%' )
paste('Particpants with cancer: ', round(yes.cancer.proportion*100),'%' )
```
# Chapter 3 - Display Quantitative data
* Histograms
* stem-and-leaf histogram
* dot plots
## Shape
It is important to check that the values are quantitative. Categorical data does not work with histograms.
Three things to consider:
1. *does the histogram have a single, central or many humps?* Unimodal, bimodal and multimodal, or uniform.
2. *is the histogram symmetric?* The tails indicate skewness
3. *do any unusual features stick out?* Look for outliers
## Median - middle value
* if *n* is **odd** then median is the middle value: ``(n + 1) / 2``
* if *n* is **even** then median is the average of 2 values: ``((n/2) + (n/2))/2 = median``
Spread - how much the data vary around the center
## Range
The difference between the max and min values. Large values influence the range value
* range = max - min
## Interquartile Range
The difference between the quartiles shows how the middle half od the data covers.
* Quartiles: Q1: 25, Q3: 75
* IQR = Q3 - Q1
## Boxplots & summaries
The central box shows the middle half of the data between quartiles. If the median is not centered the distribution is skewed. Whiskers show skewness. Outliers are shown individually to give attention to them.
## The mean or average
Add up all the values for the variable and divide that sum by the number of data values.
* $\Sigma x$ / n
## Standard Deviation
The standard deviation accounts for each value is from the mean, how far the data point is from the mean. The difference is called *deviation*, which can be averaged then squared for only positive values. When you sum up all the squared deviations and find their average you get **variance**. ``s^2 = var() and s = sqrt()`` function.
# Chapter 5 - Standard Deviation (std) & Normal Model
To express the distance from the mean in standard deviations *standardizes* the value. To **standardize** a value, subtract the mean and then divide this difference by the std. These standardized values are called ***z-scores***.
The z-scores measure the distance of a value from the mean in standard deviations. A z-score of 2 indicates that a data value is 2 standard deviations above the mean (units don't apply). Negative z-scores indicate standard deviations below the mean. The further from the mean, the more special it is, negative z-scores are more impressive.
A z-score comparison: z_1 = -1.81 < z_2 = -2.26, because -2.26 is farther from the std mean.
======================================
Example: Standardizing Ski times
2010 Olympic Games, Men's skiing event, 2 races: downhill and slalom.
Skier with lowest total times wins.
- mean slalom time: 52.67 seconds
- standard deviation : 1.614 seconds
- mean downhill time: 116.26 seconds
- standard deviation : 1.914
Bode Miller won gold with combined time 164.92 seconds
* slalom total time 51.01 seconds
* downhill total time 113.91 seconds
Which race was better compared to the competition?
* z_slalom = ( (slalom total time) - (mean slalom time) ) / std = -1.03
* z_downhill = ( (downhill total time) - (mean downhill time) ) / std = -1.23
Faster times are the goal, the z-score of -1.23 standard deviations from the mean is better than z-score -1.03.
========================================
## Shifting & Scaling
Two ways to finding a z-score, shifted by subtracting the mean and then divide by the std.
<p1> CDC Population study, N= 11,000 people. Subgroup of men, n= 80, age = 19:24, average_height = 5.8:5.11, average_weight.kg = 82.36. The NIH max weight for health is 74 kg. To compare weights to the max weight for health, subtract 74 kg from each of the weights. </p1>
<li> average_weight.kg - 74 = 8.36 kg overweight </li>
<br>
**Rescaling** is changing the measurement units, kg to lbs. Need to multiply each value of kg by 2.2 for pound units, so the average weight is 181.19 lbs.
Standardizing the z-scores changes the center by making the mean 0, it also changes the spread making the std = 1, but does not change the shape of the distribution.
## Normal models (bell-shaped curve)
It is common for at least half of the data to have z-scores between -1 and +1. If a z-score of +/- 3 or more **is rare**. A normal model uses **mean** and the **std** for its parameters.
A *standard normal distribution* is one that has a mean of 0 and std of 1.
A normal model shows how extreme a value is by how far from the mean it is.
* 68% of the values fall within 1 std of the mean
* 95% of the values fall within 2 std of the mean
* 99.7% of the values fall within 3 std of the mean
![68-95-99.7 rule](68-95-99_rule.png)
![Normal Distribution](normal-distribution.png)
## Finding Normal Percentiles
SAT Test scores: overall **mean** of 500, **std** of 100.
your score is 600 on test.
What ranking is your score?
z-score = (600 - mean) / std = 1.0
your score is 1 std above the mean ($\mu +1$)
if you scored 680, same mean and std, z-score = 1.8
use a **z-score table**, start with the left side 1.8 then top column .00 .01 ...
to find your percentile value of 0.9641.
What this means is that **96.4% of the z-scores are less than 1.80**, and only 3.6% of people scored better than 680 on the test.
Caution:
- don't use normal distribution model on non-unimodal and symmetric data
- don't use the mean and std when outlier are present
- don't round results in the middle of a calculation, use more than 4 decimal values for more accurate values
# Chapter 6 - Scatterplot, association & correlation
Look for patterns in the scatterplot
- from top left to bottom right = negative
- from bottom left to top right = positive
- look at the form of dots, are they linear?
- the strength of the plot relationship: cohesive stream or a clustered cloud?
- look for outliers
Two important variables are the **response variable** and the **predictor variable**. The predictor (explanatory) variable goes on the x-axis.
## Correlation
The numeric value that determines how strong the association is between variables. Subtracting the mean from each variable just moves the means to zero and makes it easier to se the strength of the association.
Standardize each variable and work with z-scores,sum the z-scores then divide the sum by *n*-1, which is the **correlation coefficient**. The correlation cofficient lies between -1 and +1. This value has no unit because the z-score has no units.
Correlation measures the strength of linear association bewteen 2 quantitative variables, assuming that there is a correlation. No correlation involving a categorical variable, look for a linear relationship in the scatterplot, check for outliers.
Remember: **correlation != causation** Be mindful of lurking variables that explain correlations.
In R, use ``cor(x, y)`` to find correlations
# Chapter 7 - Linear Regression
## Line of "best fit"
A linear model that gives an equation of a straight line through the data.
The predicted variable value is the y^ (y-hat), the difference between x and y^ is called the **residual**, which indicates how far off the model's prediction is at *that data point*. Any variable with a hat is usually indicating it is a predicted value of a variable.
- Residual = observed value - predicted value
- Ex. item has 31g of protein, (predicted) should have 36,7g of fat on linear regression plot but actual fat is 22g. ``Residual = 22 - 36.6 = -14.6g of fat``. This means the actual value is less than the model prediction of fat. Items with negative residuals have less fat than expected, positive residuals have more.
## Linear Model
If the model is good, the data values will be closely scattered around the linear line of best fit.
- ``y^ = intercept + slope `` . Intercept are values on the y-axis, Slope are values on x-axis.
- Ex. line of best fit: (8.4g of fat based on model when x= 0)
- fat^ = 8.4 + 0.91g protein x (fat in grams).
- *This means that 1g of protein is expected on average to have 0.91g of fat*
## Least Squares Line
To find the values of slope and intercept for least square line, you need **correlation, std and means**. Correlations don't have units but slopes do. Changing the units of x and y changes the std. The slope is always the units of **y per unit x**.
to predict y-value for data point x-value
- ex. item has 31g of protein, predict the fat (g)
- x^ = 8.4 + 0.91 (31) = 36.6g of fat
## Examine the Residuals
The residuals are: residual = observed value - predicted value. The standard deviation residual measures how much the points psreadaround the regression line.
- don't need to subtract the mean, the mean of the residual is 0. Use ``*n*-2``
- residual std = sum( (residual *2) / (n-2))
- ex. residual std= 10.6g of fat, the residual = -14.6g of fat, make a histogram of the residuals, 2 * 10.6g = 21.2g of fat from 0
## R squared variation
The variation in the residuals is the key to assessing how well the model fits the data. The R-squared gives the fraction of the data's variation accounted for by the model, 1 - r^2
- ex. $r^2$= 0.76 = 0.58 (58%), 1- 0.58= 0.42 => 42% of variability in total fat
- *this means that 58% of the variability in the fat content is accounted for by variation in the protein content*
> Correlations: A= 0.80 is $r^2$ = $0.80^2$ = 64%. B= 0.40 is $r^2$= $0.40^2$ = 16%. *A has 4 times the correlation of B.*
The $r^2$ values in science range from 80% to 90% or more. Observational studies have lower values.
# Chapter 10 - Sample Surveys
Out of a population there is a sample survey, which often has bias of over/under-representing some groups out of the population.
- **Randomizing** prevents influences of all features ofo ur population by ensuring that the average of the sample looks like the rest of the population.
- Sample size is the number of persons in the sample
Mistakes to avoid:
- sample volunteers, volunteer bias
- sample convenience
- using a bad sampling frame, make sample wide across the population
- non-response bias, missing data = biased data
- response bias, anything that influences the response
# Chapter 11- Experiments
Experiments study the relationship between 2+ variables, 1 must identify at least 1 explanatory variable (a factor) to manipulate and 1+ response variable to measure. The experimenter changes the factors and randomly assigns subjects to treatments at random. Factors have levels, treatment is the combination of factors.
## The 4 Experimental Design Principles
1. Control - control the sources of variation other than the factors being tested by making conditions similar for all groups.
2. Randomize - random equalizes the effects of unknown sources of variation
3. Replicate -
4. Block - group similar individuals together and then randomize within each block, this allows difference among the blocks. Not required for experimental design.
Experiment Design:
state what you want to know
specify the response variable
specify the factor levels & treatments
experiment units
experiment: control, replicate, randomize
visualize data
**Blinding** of study participants and researchers and **Placebo** effect
the best studies are: randomized, comparitive, double-blind, placebo-controlled
Pairing participants that are similar in ways not under study is **matching**
**Confounding** is when levels of 1 factor are associated with the levels of another factor.
Lurking variable creates an association between 2 other variables which can influence both explanatory and response variables. This variable is associated with both x and y but makes it look x causes y. Both confounding and lurking variables are outside of control and make it hard to understand the relationhip in the model.
# Chapter 15 - Sampling Distribution Models
A Normal model has 2 parameters, mean and standard deviation.
Sampling distribution model shows how a statistic from a sample varies from sample to sample, which allows for quantification of that variation.
Formula:`` N(p, sqrt(p x q) /n )`` q = 1-p
Should not sample more than **10%** of the population, to keep individuals independent of each other
<hr>
The CDC report that 22% of 18 year old women in the US have a BMI of >= 25 value. Random sample of 200, 31 females had BMI values >25.
- proportion = 31/200 = 0.155
- expected mean = 0.22
- q = 1 - 0.22
- std = sqrt( (0.22)(0.78) / 200 ) = 0.029
- z = 0.155 - 0.22 /0.029 = -2.24
<hr>
## Central Limit Theorem
The sampling distribution of any mean becomes more normal as the sample size grows.
<hr>
CDC report the mean weght for men in US is 190lbs, std of 59 lbs
- sample n= 10
- mean = 190
- std= 59
- sample std = 59/sqrt(10) = 18.66
- z= (250 - 190) / 18.66 = 3.21
- *the average of 250 lbs is more than 3 standard deviations above the mean*
<hr>
# Chapter 16 - Confidence Interval for Proportions
Out of a population of 156, a sample of 48 individuals selected, this sample is p^. The (proportion) p^ = 48/156 = 30.8% . The std for sampling distributions is sqrt(p x q /n).
**Standard error** is when you estimate the standard deviation of a sampling distribution
```{r}
p.hat = .308
q.hat = 1 - p.hat
n.sample = 156
standard_error = sqrt((p.hat * q.hat)/n.sample)*100
standard_error
SE.high = standard_error * (-2)
SE.low = standard_error * (2)
error_ratio.low = (p.hat * 100) - SE.low
error_ratio.low
error_ratio.high = (p.hat *100) - SE.high
error_ratio.high
```
This sample is Normal and shows that about 68% of all samples of 156 will have p^ within 1 standard error 0.037 of p. And 95% of all these samples will be within +/- 2 standard errors. The true value of p is unknown, even in an interval. This is also known as 1-proportion z-interval.
p^ -2 SE <-------> p^ <------------> p^ +2 SE
Claims:
- "30.8% of *all* population aged 18-22 do X". **NO**, the Sample and population proportions are not the same
- "it is probably true that 30.8% of all population aged 18-22 do X". The true proportion isn't 30.8%
- what is known: within interval of 30.8% +/- 2 x 3.7% = 23.4% to 38.2%
- True: "We are 95% confident that between 23.4% and 38.2% of population between 18 and 22 do X"
## Margin of error: certainty vs precision
The standard error is the **margin of error** whch is used to describe uncertainty in estimating the population value. *Estimate +/- Margin Error*. The larger the confidence interval, the larger the margin of error.
```{r}
# poll of 1,010 people asked about views on topic
n.sample = 1010
reported_margin.error = 0.03 # 3%
p = 0.5
q = 1 - p
# find margin of error
standard_error = sqrt( (p*q) /n.sample)
standard_error
margin_error = (2 * standard_error)*100
margin_error
```
## Critical values
```{r}
# poll of 1,010 people asked about views on topic
n.sample = 1010
reported_margin.error = 0.03 # 3%
p = 0.5
q = 1 - p
# out of this sample, 40% said X, 90% Confidence interval
p.hat = 0.40 # 40%
q.hat = 1 - p.hat
standard_error = sqrt( (p.hat * q.hat) / n.sample)
standard_error
# z-table for 90% = 1.645
# find margin of error
z.score = 1.645
margin_error = z.score * standard_error
margin_error
```
# Chapter 17 - Proportion Hypotheses
Hypotheses are models used to test data.
- Null hypothesis assumes there is no change/difference
- Alternative hypothesis assumes there is a change/difference. evidence must be present in order to reject null
```{r}
n = 550
p = 0.517
q = 1 - p
p
# null_hypothesis
null.h = p
# alt_hypothesis
# alt.h != p
# p^ std
p.hat.std = sqrt( (p *q) / n)
p.hat.std
# reported proportion p^
# of population 550
p.hat = 0.569
# find the z-score
z = (p.hat - p) / p.hat.std
z
# z-table value for z
z.table.value = 0.9927
p_value = 1 - z.table.value
# 2 tail test need to multiply by 2
two_tail_test.p_value = p_value*2
two_tail_test.p_value
# decide
null.h < two_tail_test.p_value
```
The sample proportion lies 2.44 standard devations above the mean. The p value shows that if true proportion is 51.7% ( p= 0.517) then observed proportion of 56.9% (p.hat) X event would occur randomly only about 15 (two tail test p_value) times out of 1000. The p value (0.517) is greater than 2 tailed test p value, therefore reject the null.
# Chapter 18 - t-tests, inferences about means
When sampling distribution model is bell-shaped, when sample size is large the model is nearly normal, when size is small the tails of model chnage.
**Degrees of Freedom** represent the number of independent quantities that are left after estimated the parameters
critical value with t-value from t-table
- if t value is < critical value = don't reject null
- if t value is > critical value = reject null
```{r}
# indep paired t test
t.value = 2.3
p = .05
n = 16
#Degrees.of.Freedom (df)
df = n*2 - 2
df
# t table, p value and df
critical.value = 2.04
t.value > critical.value
```