Skip to content

STATS 141XP Final Project (Identifying MLB Draft Biases)

Notifications You must be signed in to change notification settings

pddiii/MLB-Draft-Biases

Repository files navigation

MLB-Draft-Biases

The focus of the project within this repository is to analyze and identify Draft Biases within the Major League Baseball (MLB) Amateur Draft. The Draft has been taking place since 1965, and in the past has featured up to 100 rounds (today's version features 20 rounds).

Aside from identifying these draft biases we sought to determine if there was a method for identifying a player's success level in the MLB. Our metric of success used was FanGraphs Wins Above Replacement (fWAR). Our predictors were comprised entirely of variables that included information of player demographics, and physical characteristics.

Contributors

  • Peter D. DePaul III
    • Data Collection
    • Data Cleaning
    • EDA and Visualizations
    • Model Creation
    • Final Report
  • Anish Ravilla
    • Final Report
  • Robin Lee
    • EDA and Visualizations
    • Final Report
  • Alan Wong
  • Kevin Kim
  • Hongye Zhang

Data Dictionary

To find our dictionary of variables click below:

Data Collection and Data Cleaning

Our Data Collection process was performed utilizing baseballR, and pybaseball respectively. These processes can be found below:

Our data cleaning process was performed utilizing R and several packages (primarily those in tidyverse)

Data Files

The data files we used to build our models, and the raw data we collected are stored all within the file linked below:

Report

To read the report on our findings click the link below:

Libraries and Resources Utilized

  • bookdown
    • Used for generating the report utilizing the Bookdown syntax language Link
  • ggplot2
    • Used to create the visualizations and EDA in the Report Link
  • gridExtra
  • corrplot
  • data.table
    • Utilized to decrease memory of our data objects to reduce processing time. Link
  • tidyverse
    • Utilized for the data cleaning process Link
  • tidymodels
    • Utilized for the creation of the boosted decision tree prediction model Link
  • xgboost
    • The xgboost engine was used for the boosted decision tree prediction model Link
  • doParallel
    • Used for parallel processing during the tuning process of the model hyperparameters Link
  • vip
    • Used to create the variable importance plot for the model. Link
  • caret
    • Utilized for the training process of the model Link
  • Boruta
    • Used to confirm feature selection importance Link
  • kableExtra
    • Used to create LaTeX formatted tables within the report. Link
  • maps
  • mapsdata
  • mapproj
  • reshape2

About

STATS 141XP Final Project (Identifying MLB Draft Biases)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published