Springboard Data Science Portfolio

This repository is a curated collection of my work and solutions to case studies, mini-projects, and a guided capstone that are part of the Springboard Data Science program.

This README contains summaries (objectives, methods, and takeaways) of the following curriculum categories:

Case Studies: Industry case studies that cover a wide range of topics and techniques in data science
Statistical Foundations: Case studies that explore and apply fundamental statistical techniques, without necessarily building predictive models
Mini Projects: Projects using industry-standard tools and libraries for data manipulation and analysis
Guided Capstone: An end-to-end data science project experience

Not included in this repo are my two independent capstones, which have their own repositories. Links below:

For more of my project work or independent capstones, please visit the pinned repositories on my Github or visit my website: https://www.dennisvdang.com.

Case Studies

Below are summaries of case studies that cover a wide range of topics and demonstrate the application of various data science techniques and algorithms.

Customer Segmentation (Clustering Case Study)

Objective

Segment customers based on their purchasing behavior and response to marketing campaign offers, and provide insights into the characteristics of each customer segment to enable targeted marketing strategies

Methodology

Cleaned and transformed data to create a matrix of customer responses to different offers
Applied k-means clustering and determined the optimal number of clusters using the Elbow, Silhouette, and Gap statistic methods
Visualized the customer segments in a 2D space using Principal Component Analysis (PCA)
Analyzed the distribution of key features across the identified clusters and revealed the characteristics of each customer segment
Libraries: Pandas, Scikit-learn, Matplotlib, Seaborn

Key Takeaways

Identified distinct customer segments with varying offer preferences and purchasing behaviors
Demonstrated the effectiveness of k-means clustering for customer segmentation
Highlighted the importance of feature selection and dimensionality reduction for effective clustering and visualization

Link to Customer Segmentation Case Study notebook

Customer Purchasing Prediction (Decision Tree Case Study)

Objective

Build a predictive model to estimate the likelihood of RR Diner Coffee's loyal customers purchasing the Hidden Farm coffee

Methodology

Analyzed customer data such as age, gender, salary, online purchases, distance from the flagship store, and spending habits to understand key drivers of customer purchasing behavior
Implemented four decision tree models with varying parameters to understand the impact of different parameters on model performance:
1. Default Decision Tree Model
2. Decision Tree with Max Depth = 3
3. Decision Tree with Min Samples Split = 10
4. Decision Tree with Min Samples Leaf = 5
Evaluated model performance using accuracy, balanced accuracy, precision, and recall metrics
Compared the performance of a single decision tree model to a Random Forest ensemble model
Libraries: Pandas, Scikit-learn, Matplotlib

Key Takeaways

Identified key factors influencing customer purchasing decisions
Demonstrated the effectiveness of decision trees for predictive modeling
Highlighted the importance of model simplicity and interpretability for specific datasets

Link to Customer Purchasing Prediction Case Study notebook

COVID-19 Prediction (Random Forest Case Study)

Objective

Predict the severity of COVID-19 cases using patient characteristics and symptoms

Methodology

Preprocessed and explored the dataset to understand feature-target relationships
Built and evaluated random forest models with different hyperparameters and feature subsets
Interpreted feature importances to identify key risk factors for severe COVID-19
Libraries: Pandas, Numpy, Scikit-learn, Matplotlib, Seaborn

Key Takeaways

Demonstrated the application of random forests for predicting disease severity
Highlighted the importance of feature selection and hyperparameter tuning

Link to COVID-19 Random Forest Case Study notebook

Diabetes Prediction (Grid Search KNN Case Study)

Objective

Optimize the hyperparameters of a K-Nearest Neighbors (KNN) classifier using grid search on the Pima Indians Diabetes dataset.

Methodology

Loaded and preprocessed the Pima Indians Diabetes dataset and standardize the features.
Implemented a loop to train KNN models with a range of neighbor values (1-9) and evaluated the training and test scores.
Identified the optimal number of neighbors based on the maximum test score.
Fit the KNN model with the optimal number of neighbors and evaluated its performance on the training and test sets.
Plotted the confusion matrix and printed the classification report for the optimal model.
Applied the grid search method to find the optimal number of estimators in a Random Forest model.
Libraries: Pandas, Scikit-learn, Matplotlib

Key Takeaways

Grid search is an effective technique for optimizing hyperparameters of machine learning models, such as the number of neighbors in KNN and the number of estimators in Random Forest.
Identifying the optimal hyperparameters can significantly improve the model's performance on both the training and test datasets.
Visualizing the model's performance, such as through confusion matrices and classification reports, provides valuable insights into the model's strengths and weaknesses.

Link to Diabetes Prediction Case Study notebook

London Housing Price Forecasting (Time Series Analysis/EDA Case Study)

Objective

Analyze trends in London's housing market and forecast future prices

Methodology

Conducted exploratory data analysis (EDA) to understand the impact of the 2008 financial crisis on London's housing market
Applied statistical tests to explore the relationship between pre-crisis prices and crisis recovery time
Employed time series models (ARIMA, SARIMA) and Random Forest to forecast property values
Evaluated model performance using MSE and MAE
Libraries: Pandas, Numpy, Statsmodels, Matplotlib, Seaborn

Key Takeaways

Identified trends and patterns in London's housing market over time
Demonstrated the application of time series modeling for forecasting

Link to London Housing Case Study notebook

Wine Quality Prediction (Linear Regression Case Study)

Objective

Predict alcohol levels in wine using regression analysis and build an accurate model

Methodology

Performed univariate and multivariate analysis to iterate towards an accurate model
Utilized EDA to visualize correlations and inform model selection
Developed multiple linear regression models, iterating to improve accuracy and reduce redundancy
Evaluated models based on R-squared, AIC, and BIC metrics
Libraries: Pandas, NumPy, Matplotlib, Seaborn, Statsmodels, Scikit-learn

Key Takeaways

Importance of iterative model development and feature selection
Demonstrated the application of linear regression in predicting continuous variables
Identified the most elegant and economical model (rModel4) using a few predictors effectively

Link to Linear Regression Case Study notebook

Heart Disease Prediction (Logistic Regression Case Study)

Objective

Build a logistic regression model to predict the presence of heart disease using patient health data

Methodology

Introduced logistic regression as a fundamental classification algorithm
Covered key concepts like classification, model evaluation, discriminative vs generative classifiers
Performed data preprocessing (handling missing values, encoding categorical variables)
Split the data into training and testing sets
Built and evaluated logistic regression models using scikit-learn
Explored model tuning and evaluation metrics (accuracy, precision, recall, ROC curves)
Visualized decision boundaries and interpreted model coefficients
Libraries: Pandas, Scikit-learn, Matplotlib, NumPy

Key Takeaways

Logistic regression is a simple yet effective algorithm for binary classification problems
Importance of proper data preprocessing and model evaluation for classification tasks
Logistic regression provides interpretable models and discriminative decision boundaries
Need to consider multiple evaluation metrics beyond accuracy for imbalanced datasets

Link to Logistic Regression Case Study notebook

Passenger Survival Prediction (Gradient Boosting Case Study)

Objective

Explore the concept of gradient boosting using decision trees as base predictors to predict passenger survival in the Titanic dataset

Methodology

Utilized the Titanic dataset to demonstrate the application of gradient boosting for binary classification
Preprocessed the dataset, handling missing values and encoding categorical variables
Demonstrated the process of fitting a series of decision trees on residual errors to improve predictions
Evaluated model performance with multiple learning rates and calculated the ROC curve
Libraries: Pandas, NumPy, Scikit-learn, Matplotlib

Key Takeaways

Gradient boosting can significantly improve model performance in predicting passenger survival
Importance of tuning hyperparameters, such as learning rate and number of estimators, to optimize model performance
Demonstrated the effectiveness of gradient boosting in handling complex datasets with a mix of numerical and categorical features

Link to Gradient Boosting Case Study notebook

Flight Delay Prediction (Bayesian Optimization Case Study)

Objective

Use Bayesian Optimization to tune hyperparameters of a LightGBM model for flight delay prediction

Methodology

Loaded and preprocessed the flight delay dataset, handling missing values and encoding categorical features.
Evaluated the LightGBM model with given hyperparameters using cross-validation.
Set up the Bayesian Optimization process using the BayesianOptimization library, specifying the hyperparameter search space.
Performed Bayesian Optimization to find the optimal hyperparameters that maximize the AUC (Area Under the Receiver Operating Characteristic Curve) metric.
Trained the final LightGBM model with the optimal hyperparameters obtained from Bayesian Optimization.
Evaluated the final model's performance on the test set and compared it with a majority class baseline.
Libraries: Pandas, NumPy, LightGBM, Scikit-learn, Bayesian Optimization, Matplotlib

Key Takeaways

Bayesian Optimization is an effective technique for optimizing hyperparameters of complex machine learning models, such as LightGBM.
Visualizing the model's performance using metrics like AUC, confusion matrix, and accuracy provides insights into the model's strengths and weaknesses.

Link to Bayesian Optimization Case Study notebook

Sales Forecasting (Time Series Forecasting with ARIMA Case Study)

Objective

To analyze the sales data of Cowboy Cigarettes from 1949 to 1960 and predict future sales using time series forecasting with the ARIMA model.

Methodology

Loaded and explored the historical sales data of Cowboy Cigarettes.
Performed data cleaning and transformation, including log transformation to stabilize the variance.
Conducted exploratory data analysis (EDA) to understand the trend, seasonality, and noise in the sales data.
Utilized the Augmented Dickey-Fuller test to check for stationarity and applied differencing to achieve a stationary time series.
Employed the ARIMA model for time series forecasting, optimizing the model parameters (p, d, q) based on the Akaike Information Criterion (AIC).
Forecasted future sales for the next two years and visualized the original data along with the forecasts.

Key Takeaways:

The ARIMA model effectively captured the underlying trend and seasonality in Cowboy Cigarettes' sales data, providing a close fit to the historical data.
Forecasting revealed an increasing trend in sales, suggesting a growing demand for Cowboy Cigarettes.
The case study highlighted the importance of time series decomposition and the capability of ARIMA models in forecasting future trends based on historical data.
Libraries: Pandas, NumPy, Matplotlib, Seaborn, Statsmodels

Link to Time Series Analysis and Forecasting with ARIMA Case Study notebook

Statistical Foundations

Summaries of case studies that explore and apply fundamental statistical techniques and mathematical principles, without necessarily building predictive models.

Cosine Similarity

Objective

Use cosine similarity to compare numeric data and text datasets

Methodology

Calculated similarity measures for sentences/paragraphs
Libraries: NumPy, Pandas, Matplotlib, Scipy, Scikit-learn

Key Takeaways

Cosine similarity effective for comparing numeric and text data
Applicable in NLP and recommendation systems

Link to Cosine Similarity Case Study notebook

Euclidean and Manhattan Distance

Objective

Demonstrate calculation and comparison of Euclidean and Manhattan distances

Methodology

Visualized distribution of distances through histograms
Highlighted application in data science, foundational for PCA
Libraries: Pandas, NumPy, Matplotlib

Key Takeaways

Euclidean and Manhattan distances have different behaviors

Link to Euclidean and Manhattan Distance Case Study notebook

Frequentist Inference

Objective

Apply Frequentist inference to real-world data, addressing practical business questions for a hospital

Methodology

Emphasized statistical concepts like z-statistic, t-statistic, Central Limit Theorem, and confidence intervals
Explored the Central Limit Theorem and its implications
Estimated population mean and standard deviation from a sample
Calculated confidence intervals
Libraries: Pandas, NumPy, Matplotlib

Key Takeaways

Demonstrated the application of Frequentist inference in making data-driven decisions
Highlighted the importance of statistical concepts in real-world scenarios

Link to Frequentist Inference Case Study notebook (Part 1) (Part 2)

Summary of Mini Projects

API Mini Project

Objective

Explore the use of APIs for data collection and integration in data analysis workflows

Methodology

Demonstrated making API requests, handling authentication, and parsing JSON responses
Collected data from the GitHub API and performed basic analysis
Libraries: Requests, Pandas, Matplotlib

Key Takeaways

APIs enable access to vast amounts of data for analysis
Python libraries simplify API interactions and data manipulation

Link to Mini Project

SQL Mini Project

Objective

Demonstrate the use of SQL for data manipulation, aggregation, and analysis

Methodology

Performed various SQL queries on a sample database (filtering, joining, grouping, sorting)
Showcased SQL's power in handling complex data queries and transformations
Libraries/Tools: SQLite, Pandas, MySQL, PHPMyAdmin

Key Takeaways

SQL is a powerful tool for data manipulation and analysis
Proficiency in SQL and SQL libraries in Python is essential for working with relational databases

Link to Mini Project

Capstone

Guided Capstone

Objective

Complete a comprehensive data science project covering all stages of the data science lifecycle

Methodology

Collaborated with a team to define project scope, objectives, and deliverables
Conducted extensive EDA to gain insights and inform modeling
Built and evaluated multiple machine learning models (regression, classification, clustering)
Communicated findings through visualizations, reports, and presentations
Libraries: Pandas, Numpy, Scikit-learn, Matplotlib, Seaborn

Key Takeaways

Hands-on experience with the complete data science lifecycle
Importance of collaboration, communication, and project management in data science

Link to Capstone Project

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
Case Studies		Case Studies
Projects		Projects
Springboard Guided Capstone		Springboard Guided Capstone
images		images
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Springboard Data Science Portfolio

Table of Contents

Case Studies

Customer Segmentation (Clustering Case Study)

Customer Purchasing Prediction (Decision Tree Case Study)

COVID-19 Prediction (Random Forest Case Study)

Diabetes Prediction (Grid Search KNN Case Study)

London Housing Price Forecasting (Time Series Analysis/EDA Case Study)

Wine Quality Prediction (Linear Regression Case Study)

Heart Disease Prediction (Logistic Regression Case Study)

Passenger Survival Prediction (Gradient Boosting Case Study)

Flight Delay Prediction (Bayesian Optimization Case Study)

Sales Forecasting (Time Series Forecasting with ARIMA Case Study)

Statistical Foundations

Cosine Similarity

Euclidean and Manhattan Distance

Frequentist Inference

Summary of Mini Projects

API Mini Project

SQL Mini Project

Capstone

Guided Capstone

About

Releases

Packages

Languages

dennisvdang/Springboard-Portfolio

Folders and files

Latest commit

History

Repository files navigation

Springboard Data Science Portfolio

Table of Contents

Case Studies

Customer Segmentation (Clustering Case Study)

Customer Purchasing Prediction (Decision Tree Case Study)

COVID-19 Prediction (Random Forest Case Study)

Diabetes Prediction (Grid Search KNN Case Study)

London Housing Price Forecasting (Time Series Analysis/EDA Case Study)

Wine Quality Prediction (Linear Regression Case Study)

Heart Disease Prediction (Logistic Regression Case Study)

Passenger Survival Prediction (Gradient Boosting Case Study)

Flight Delay Prediction (Bayesian Optimization Case Study)

Sales Forecasting (Time Series Forecasting with ARIMA Case Study)

Statistical Foundations

Cosine Similarity

Euclidean and Manhattan Distance

Frequentist Inference

Summary of Mini Projects

API Mini Project

SQL Mini Project

Capstone

Guided Capstone

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages