title | duration | creator | ||||
---|---|---|---|---|---|---|
Statistics Fundamentals |
3 hr |
|
DS | Lesson 4
After this lesson, you will be able to:
- Explain the difference between causation vs. correlation
- Test a hypothesis within a sample case study
- Validate your findings using statistical analysis (p-values, confidence intervals)
Before this lesson, you should already be able to:
- Explain the difference between variance and bias
- Use descriptive stats to understand your data
TIMING | TYPE | TOPIC |
---|---|---|
5 min | Opening | Lesson Objectives |
30 min | Introduction | Confidence Intervals |
30 min | Introduction | Hypothesis Testing |
30 min | Demo | Hypothesis Testing: Case Study |
5 min | Introduction | Validate your findings |
20 min | Demo | P-values, CI: Case Study |
35 min | Independent Practice | Practice with p-values and CI |
15 min | Wrap-up | Review Guided Practice |
- Review any questions from last session
- Discuss Current Lesson Objectives
- Review prior exit tickets
Today we will use advertising data from an example in An Introduction to Statistical Learning by Gareth James.
You'll remember from last time that we worked on descriptive statistics. How would we tell if there is a difference between our groups? How would we know if this difference was real or if our finding is simply due to chance?
These are the questions we often tackle when we are building out our models in the Refine & Build steps of our data science workflow.
For example, if we are working on sales data, how would we know if there was a difference between the buying patterns of men and women at Acme Inc? Hypothesis testing!
Generally speaking, you start with a null hypothesis and an alternative hypothesis, which is opposite the null. Then, you check whether the data supports rejecting your null hypothesis or failing to reject the null hypothesis.
Note that "failing to reject" the null is not the same as "accepting" the null hypothesis. Your alternative hypothesis may indeed be true, but you don't necessarily have enough data to show that yet.
This distinction is important to help you avoid overstating your findings. You should only state what your data and analysis can truly represent.
Here is an example of a conventional hypothesis test:
- Null hypothesis: There is no relationship between Gender and Sales.
- Alternative hypothesis: There is a relationship between gender and Sales
Let's dive into this more with the demo.
Check: What is the null hypothesis? Why is this important to use?
How do we tell if the association we observed is statistically significant?
Statistical Significance is the likelihood that a result or relationship is caused by something other than mere random chance. Statistical hypothesis testing is traditionally employed to determine if a result is statistically significant or not.
Typically, we use a cut point of 5%. In other words, we say that something is NOT statistically significant if there is a less than 5% chance that our finding was due to chance alone.
When data scientists present results and say we found a significant result- it is almost always using these criteria. Let's dive into them further to understand p-values and confidence intervals.
Check: What does a 95% confidence interval indicate?
For this exercise, you will look through a variety of analyses and interpret the findings.
You will be presented a series of outputs (similar to the ones we will generate once we start regression) and tables from a published analysis.
For this lab you will be asked to read these outputs and tables and determine if the findings are statically significant or not.
You will also get practice looking at the output and understanding how the model was built (e.g. identifying predictor/exposure vs outcome).
Any questions?
UPCOMING PROJECTS | Unit Project 2 |