- Add this Chrome Extension so you can see math equations rendered in GitHub
- Why Statistics?
- Measures of Centrality
- Sampling
- Continuous vs Discrete Data
- Data Science is an extension of statistics, just as it is an abstraction of all the Sciences
- All of our models and practices in data science are either rooted in or validated by probability and statistics
- Experimental design and reporting are performed through the lens of statistics
- Linear Regression, Logistic Regression
- Sampling/Resampling Methods (Bootstrap)
- confidence intervals
- expectation, deviance, etc
- Measures of Central Tendency
- Mean
- Median
- Mode
- Measures of variance or “spread”
- Variance
- Standard Deviation
- sum of the numeric elements, divided by the number of elements, expressed as:
Symbol | Meaning |
---|---|
lowercase greek mu refers to population mean | |
"x-bar" refers to the sample mean | |
capital "X-bar" refers to the sample mean where |
- Can be used to combat large devations and outliers
- “trim” some percent off of max and min of data list:
- Advantage - can help combat outliers that might influence our mean
- Disadvantage - we are removing portions of our data which might be very important
-
Code the
mean()
function -
BONUS: Include a
trim
parameter that removes the greatest and leastn
values -
Test with this data:
a = [1, 5, 7, 10, 15, 23, 35, 67, 220, 2000]
def mean(lst, trim_by=0):
lst_ = lst.copy()
if trim_by > 0:
lst_ = sorted(lst_)[trim_by:-trim_by]
return sum(lst_) / len(lst_)
-
An article published in the journal, Indoor Air, considered two different air samples to test for endotoxin concentrations, the first in urban households, and the second in farmhouses.
- Urban: 6.0, 5.0, 11.0, 33.0, 4.0, 5.0, 80.0, 18.0, 35.0, 17.0, 23.0
- Farmhouse: 4.0, 14.0, 11.0, 9.0, 9.0, 8.0, 4.0, 20.0, 5.0, 8.9, 21.0, 9.2, 3.0, 2.0, 0.3
-
A. Determine the sample mean for each group.
-
B. Determine the trimmed mean for each group by trimming the smallest and largest value from each group.
'''
A. Determine the sample mean for each group.
'''
urban = [6.0, 5.0, 11.0, 33.0, 4.0, 5.0, 80.0, 18.0, 35.0, 17.0, 23.0]
farmhouse = [4.0, 14.0, 11.0, 9.0, 9.0, 8.0, 4.0, 20.0, 5.0, 8.9, 21.0, 9.2, 3.0, 2.0, 0.3]
print(f'Sample Mean Urban: {round(mean(urban), 1)}')
print(f'Sample Mean Farmhouse: {round(mean(farmhouse), 1)}')
'''
B. Determine the trimmed mean for each group by trimming the smallest and largest value
from each group.
'''
print(f'Sample Trimmed Mean Urban: {round(mean(urban, trim_by=1), 1)}')
print(f'Sample Trimmed Mean Farmhouse: {round(mean(farmhouse, trim_by=1), 1)}')
- the middle value of a numeric data set
Symbol | Meaning |
---|---|
Where A is the collection on which to take the median | |
"x-tilde" is also used to represent median |
- Sort the List
- Find the middle value(s)
- if list is odd, just select the middle value
- if list is even, take mean of two middle values
- example:
odd_list = [13, 18, 13, 14, 13, 16, 14, 21, 13]
even_list = [15, 14, 10, 8, 12, 8, 16, 13]
- Code the
median()
function. Make sure to account for even and odd length lists.
def median(lst):
lst_sorted = sorted(lst)
# if odd
if len(lst) % 2 == 1:
mid = int(len(lst) / 2)
return lst_sorted[mid]
else:
upper_mid_idx = int(len(lst)/2)
return mean([lst_sorted[upper_mid_idx-1], lst_sorted[upper_mid_idx]])
An article published in the journal, Indoor Air, considered two different air samples to test for endotoxin concentrations, the first in urban households, and the second in farmhouses.
- Urban: 6.0, 5.0, 11.0, 33.0, 4.0, 5.0, 80.0, 18.0, 35.0, 17.0, 23.0
- Farmhouse: 4.0, 14.0, 11.0, 9.0, 9.0, 8.0, 4.0, 20.0, 5.0, 8.9, 21.0, 9.2, 3.0, 2.0
Calculate the Median of both groups
urban = [6.0, 5.0, 11.0, 33.0, 4.0, 5.0, 80.0, 18.0, 35.0, 17.0, 23.0]
farmhouse = [4.0, 14.0, 11.0, 9.0, 9.0, 8.0, 4.0, 20.0, 5.0, 8.9, 21.0, 9.2, 3.0, 2.0]
# print(sorted(urban))
# print(f'Median Urban: {round(median(urban), 1)}')
# print(sorted(farmhouse))
# print(f'Median Farmhouse: {round(median(farmhouse), 1)}')
An issue of a recent magazine reported the following home sale amounts for a sample of homes in Alameda, CA, all of which were sold in the previous month (1000s of $):
{ 590, 615, 575, 608, 350, 1285, 408, 540, 555, 679 }
a. Find the mean value of the homes sold in April
b. Find the median value of the homes sold in April
Do you think mean or median is a “better” measure of center for this data? why?
An issue of a recent magazine reported the following home sale amounts for a sample of homes in Alameda, CA, all of which were sold in the previous month (1000s of $):
{ 590, 615, 575, 608, 350, 1285, 408, 540, 555, 679 }
a. Find the mean value of the homes sold in April * 620.5
b. Find the median value of the homes sold in April * 582.5
c. Do you think mean or median is a “better” measure of center for this data? why? * For this set of data, it’s likely that the median would be a more accurate summary statistic. This is due to outlier resistance
-
The mean is not resistant to outliers, that is, the mean can be a misleading summary statistic if it is being greatly affected by a small number of outliers.
-
The median is resistant to outliers, that is, the median is not affected by a small number of outliers, and is the superior measure of center when outliers are present.
-
Choosing between mean and median can often be informed by the particular dataset which is being analyzed.
- Unlike the median and mean, the mode can be applied to categorical data.
- Special Case: The median can be applied to categorical data which is ordinal in nature
- The mode is not very useful for continuous data.
- Rather than trying to represent the “center” of a dataset or distribution, the mode seeks to find the element with the greatest frequency.
- Can consider first mode, second mode, and so on in describing a data set
Code the mode() function.
'''
Mode
'''
def mode(lst):
most_occurring = lst[0]
for item in lst[1:]:
if lst.count(item) > lst.count(most_occurring):
most_occurring = item
return most_occurring
mode_lst = ['kangaroo', 'muskrat', 'platypus', 'muskrat', 'squid', 'squirrel', 'muskrat']
# print(mode(mode_lst))
- A sample is comprised of materials selected from a population
- It's important to compose a sample that is representative
- We are often attempting to approach a population parameter by way of measuring sample statistics
- Notice the semantics here
- the mean of a population is considered a parameter
- the mean of a sample is considered a statistic
- Notice the semantics here
- What can you do to better approach the population parameter?
# from random import choice
def get_samps(sample_range, num_samples=5):
samples = []
for _ in range(num_samples):
samples.append(choice(sample_range))
return samples
num_samples = 5
sample_range = list(range(0, 99+1))
print(f'mu: {mean(sample_range)}')
# print(f'x_bar: {mean(get_samps(sample_range, num_samples=5))}')
means = []
for _ in range(100000):
means.append(mean(get_samps(sample_range, num_samples)))
print(f'mean of means: {mean(means)}')
- Continuous data
- “Measurable to an infinite precision”
- “Not countable”
- Can take any value in a given range
- Always infinite
- Examples:
- Distance (or any magnitude)
- Weight
- Time
- Discrete data
- “Countable”
- Can only take specific values in a given range
- Example
- Number of students in a class
- Steps on the way to class