##Lecture 1 Summary
- We talked about different roles of Data Scientists
- T-Shaped Data Scientists
- Data Science Workflow
- Continuous, Discrete and Qualitative Data
- Supervised vs Unsupervised Learning
- Set up github accounts
- set ipython notebook
- Introduced Numpy
- Classification vs Clustering and Regression vs Dimentionality Reduction
- Flexibility vs Interpretability
- Different types of data (Cross-Sectional, Time-Series, Panel Data)
- Walkthrough Acquire& Parses with Pandas
- HW 1 assigned - Due date Feb 8th at 6:30PM
- Measures of central tendency (Mean, Median, Mode, Quartiles, Percentiles)
- Measures of Variability (IQR, Standard Deviation, Variance)
- Skewness Coefficient
- Kurtosis Coefficient
- Boxplots
- Bias vs Variance
- Central Limit Theorem – Standard Error of Mean
- Class/Dummy Variables
- Walkthrough describing and visualizing data in Pandas
- Linear Regression lines
- Single Variable and Multi-Variable Regression Lines
- Capture non-linearity using Linear Regression lines.
- Interpretting regression coefficients
- Dealing with dummy variables in regression lines
- intro on sklearn and searborn library
- HW 2 assigned - Due date Feb 17th 2016 at 6:30PM
- Hypothesis test - test of significance on regression coefficients
- p-value
- Capture non-linearity using Linear Regression lines.
- Different types of errors and R-squared
- Interaction Effects
- Bias-Variance Trade off
- Validation (Test vs Train set)
- Cross-Validation
- Ridge and Lasso Regression
- (Optional) Backward Selection, Forward Selection, All Subset Selection. (If you want to use these methods you need to use R)
- Types of missing data (MCAR, MAR, NMAR)
- Single imputation and their limitations
- Imuptation using regression lines and error
- Hot deck imputation
- multiple imputation
- Classification Problems
- Misclassifciation Error
- KNN algorithm for Classification
- Cross-Validation for KNN Algorithm
- Limitations of KNN Algorithm
- KNN algorithm for Regression
- Intro to Logistic Regression
- Odds vs Probability
- Using Logistic Regression to Make predictions
- How one interprets coefficients of Logistic Regression model
- Strength and weaknesses of Logistic Regression Model
- Unbalanced observations and Logistic Regression
- FP/FN/TP/TN/FPR/TPR
- The effect of chaning Threshold
- ROC Curves
- Area Under Curve
- How to compare classifciation algorithms
- Decision Tree for Regression
- Greedy Approach
- Decision Tree for Classification
- Gini Index and Entropy index
- Limitation of Simple Decision Tree
- Bagging
- Random Forest
- Boosting
- Tuning parameters for boosting and Random Forest
Additional Resources
- Decision Tree - Video - Part 1
- Decision Tree - Video - Part 2
- Decision Tree - Video - Part 3
- BootStrap - Video
- Definition of Natural Language Processing
- NLP applications
- Basic NLP practice
- Stop words, bag-of-words, IF-DIF
Additional Resources
- If you want to learn a lot more NLP, check out the excellent video lectures and slides from this Coursera course (which is no longer being offered).
- Natural Language Processing with Python is the most popular book for going in-depth with the Natural Language Toolkit (NLTK).
- A Smattering of NLP in Python provides a nice overview of NLTK, as does this notebook from DAT5.
- spaCy is a newer Python library for text processing that is focused on performance (unlike NLTK).
- If you want to get serious about NLP, Stanford CoreNLP is a suite of tools (written in Java) that is highly regarded.
- When working with a large text corpus in scikit-learn, HashingVectorizer is a useful alternative to CountVectorizer.
- Automatically Categorizing Yelp Businesses discusses how Yelp uses NLP and scikit-learn to solve the problem of uncategorized businesses.
- Modern Methods for Sentiment Analysis shows how "word vectors" can be used for more accurate sentiment analysis.
- Identifying Humorous Cartoon Captions is a readable paper about identifying funny captions submitted to the New Yorker Caption Contest.
- Principal Component Analysis
- Computation of PCAs
- Geometry of PCAs
- Proportion of Variance Explained
Additional Resources
- This tutorial on Principal Components Analysis (PCA) includes good refreshers on covariance and linear algebra
- To go deeper on Singular Value Decomposition, read Kirk Baker's excellent tutorial.
- Chapter 10 of Statistical Learning with applications in R