Skip to content

Latest commit

 

History

History
371 lines (250 loc) · 11.3 KB

10_summary_stats_centrality.md

File metadata and controls

371 lines (250 loc) · 11.3 KB

Statistics: Measures of Centrality

  • Why Statistics?
  • Measures of Centrality
  • Sampling
  • Continuous vs Discrete Data











Why are probability and statistics important?

  • Data Science is an extension of statistics, just as it is an abstraction of all the Sciences
  • All of our models and practices in data science are either rooted in or validated by probability and statistics
    • Experimental design and reporting are performed through the lens of statistics
    • Linear Regression, Logistic Regression
    • Sampling/Resampling Methods (Bootstrap)
    • confidence intervals
    • expectation, deviance, etc











Statistics vs Machine Learning vs Artificial Intelligence

ds comic











Basic Summary Statistics Overview

  • Measures of Central Tendency
    • Mean
    • Median
    • Mode
  • Measures of variance or “spread”
    • Variance
    • Standard Deviation











Mean

  • sum of the numeric elements, divided by the number of elements, expressed as:

$$ \frac{1}{n} \sum_{i=1}^n a_i $$

Symbol Meaning
$\mu$ lowercase greek mu refers to population mean
$\overline x$ "x-bar" refers to the sample mean
$\overline X$ capital "X-bar" refers to the sample mean where $\bold X$ is a random variable











Trimmed Mean

  • Can be used to combat large devations and outliers
  • “trim” some percent off of max and min of data list:
  • Advantage - can help combat outliers that might influence our mean
  • Disadvantage - we are removing portions of our data which might be very important











BREAKOUT (4 minutes)

  • Code the mean() function

  • BONUS: Include a trim parameter that removes the greatest and least n values

  • Test with this data:

a = [1, 5, 7, 10, 15, 23, 35, 67, 220, 2000]











BREAKOUT SOLUTION

def mean(lst, trim_by=0):
    lst_ = lst.copy()

    if trim_by > 0:
        
        lst_ = sorted(lst_)[trim_by:-trim_by]

    return sum(lst_) / len(lst_)











BREAKOUT (4 minutes)

  • An article published in the journal, Indoor Air, considered two different air samples to test for endotoxin concentrations, the first in urban households, and the second in farmhouses.

    • Urban: 6.0, 5.0, 11.0, 33.0, 4.0, 5.0, 80.0, 18.0, 35.0, 17.0, 23.0
    • Farmhouse: 4.0, 14.0, 11.0, 9.0, 9.0, 8.0, 4.0, 20.0, 5.0, 8.9, 21.0, 9.2, 3.0, 2.0, 0.3
  • A. Determine the sample mean for each group.

  • B. Determine the trimmed mean for each group by trimming the smallest and largest value from each group.











BREAKOUT SOLUTION

'''
A. Determine the sample mean for each group.
'''
urban = [6.0, 5.0, 11.0, 33.0, 4.0, 5.0, 80.0, 18.0, 35.0, 17.0, 23.0]
farmhouse = [4.0, 14.0, 11.0, 9.0, 9.0, 8.0, 4.0, 20.0, 5.0, 8.9, 21.0, 9.2, 3.0, 2.0, 0.3]

print(f'Sample Mean Urban: {round(mean(urban), 1)}')
print(f'Sample Mean Farmhouse: {round(mean(farmhouse), 1)}')

'''
B. Determine the trimmed mean for each group by trimming the smallest and largest value 
from each group.
'''
print(f'Sample Trimmed Mean Urban: {round(mean(urban, trim_by=1), 1)}')
print(f'Sample Trimmed Mean Farmhouse: {round(mean(farmhouse, trim_by=1), 1)}')











Median

  • the middle value of a numeric data set
Symbol Meaning
$med(A)$ Where A is the collection on which to take the median
$\tilde x$ "x-tilde" is also used to represent median











Calculating Median

  1. Sort the List
  2. Find the middle value(s)
    • if list is odd, just select the middle value
    • if list is even, take mean of two middle values
  • example:
odd_list = [13, 18, 13, 14, 13, 16, 14, 21, 13]
even_list = [15, 14, 10, 8, 12, 8, 16, 13]











BREAKOUT (6 minutes)

  • Code the median() function. Make sure to account for even and odd length lists.











BREAKOUT SOLUTION

def median(lst):
    lst_sorted = sorted(lst)

    # if odd
    if len(lst) % 2 == 1:
        mid = int(len(lst) / 2)
        return lst_sorted[mid]
    else:
        upper_mid_idx = int(len(lst)/2)
        return mean([lst_sorted[upper_mid_idx-1], lst_sorted[upper_mid_idx]])











BREAKOUT (3 minutes)

An article published in the journal, Indoor Air, considered two different air samples to test for endotoxin concentrations, the first in urban households, and the second in farmhouses.

  • Urban: 6.0, 5.0, 11.0, 33.0, 4.0, 5.0, 80.0, 18.0, 35.0, 17.0, 23.0
  • Farmhouse: 4.0, 14.0, 11.0, 9.0, 9.0, 8.0, 4.0, 20.0, 5.0, 8.9, 21.0, 9.2, 3.0, 2.0

Calculate the Median of both groups











BREAKOUT SOLUTION

urban = [6.0, 5.0, 11.0, 33.0, 4.0, 5.0, 80.0, 18.0, 35.0, 17.0, 23.0]
farmhouse = [4.0, 14.0, 11.0, 9.0, 9.0, 8.0, 4.0, 20.0, 5.0, 8.9, 21.0, 9.2, 3.0, 2.0]

# print(sorted(urban))
# print(f'Median Urban: {round(median(urban), 1)}')
# print(sorted(farmhouse))
# print(f'Median Farmhouse: {round(median(farmhouse), 1)}')











BREAKOUT (4 minutes)

An issue of a recent magazine reported the following home sale amounts for a sample of homes in Alameda, CA, all of which were sold in the previous month (1000s of $):

{ 590, 615, 575, 608, 350, 1285, 408, 540, 555, 679 }

a. Find the mean value of the homes sold in April

b. Find the median value of the homes sold in April

Do you think mean or median is a “better” measure of center for this data? why?











BREAKOUT SOLUTION

An issue of a recent magazine reported the following home sale amounts for a sample of homes in Alameda, CA, all of which were sold in the previous month (1000s of $):

{ 590, 615, 575, 608, 350, 1285, 408, 540, 555, 679 }

a. Find the mean value of the homes sold in April * 620.5

b. Find the median value of the homes sold in April * 582.5

c. Do you think mean or median is a “better” measure of center for this data? why? * For this set of data, it’s likely that the median would be a more accurate summary statistic. This is due to outlier resistance











Outlier Resistance

  • The mean is not resistant to outliers, that is, the mean can be a misleading summary statistic if it is being greatly affected by a small number of outliers.

  • The median is resistant to outliers, that is, the median is not affected by a small number of outliers, and is the superior measure of center when outliers are present.

  • Choosing between mean and median can often be informed by the particular dataset which is being analyzed.











Mode

  • Unlike the median and mean, the mode can be applied to categorical data.
    • Special Case: The median can be applied to categorical data which is ordinal in nature
  • The mode is not very useful for continuous data.
  • Rather than trying to represent the “center” of a dataset or distribution, the mode seeks to find the element with the greatest frequency.
  • Can consider first mode, second mode, and so on in describing a data set











BREAKOUT (5 minutes)

Code the mode() function.











BREAKOUT SOLUTION

'''
Mode
'''
def mode(lst):
    most_occurring = lst[0]

    for item in lst[1:]:
        if lst.count(item) > lst.count(most_occurring):
            most_occurring = item

    return most_occurring


mode_lst = ['kangaroo', 'muskrat', 'platypus', 'muskrat', 'squid', 'squirrel', 'muskrat']

# print(mode(mode_lst))











A quick note on sampling

  • A sample is comprised of materials selected from a population
  • It's important to compose a sample that is representative
  • We are often attempting to approach a population parameter by way of measuring sample statistics
    • Notice the semantics here
      • the mean of a population is considered a parameter
      • the mean of a sample is considered a statistic











A sampling example using Python

  • What can you do to better approach the population parameter?
# from random import choice

def get_samps(sample_range, num_samples=5):
    
    samples = []

    for _ in range(num_samples):
        samples.append(choice(sample_range))
    
    return samples

num_samples = 5
sample_range = list(range(0, 99+1))
print(f'mu: {mean(sample_range)}')
# print(f'x_bar: {mean(get_samps(sample_range, num_samples=5))}')

means = []
for _ in range(100000):
    means.append(mean(get_samps(sample_range, num_samples)))

print(f'mean of means: {mean(means)}')











Continuous vs. Discrete

  • Continuous data
    • “Measurable to an infinite precision”
    • “Not countable”
    • Can take any value in a given range
    • Always infinite
    • Examples:
      • Distance (or any magnitude)
      • Weight
      • Time
  • Discrete data
    • “Countable”
    • Can only take specific values in a given range
    • Example
      • Number of students in a class
      • Steps on the way to class