Skip to content

ML model that predicts run value of pitches based on pitch sequencing features

Notifications You must be signed in to change notification settings

Jensen-holm/Sequence-Plus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sequence+

The main goal of Sequence+ is to create a model like Stuff+, Location+, and Pitcher+ that aims to measure the run value of a pitch sequence. Sequence+ will be made using features related to tunneling, and a mix of things that are typically included in Location+ & Stuff+.

General Approach

I have a run expectancy matrix for 2024, that describes roughly how many runs each event in each possible situation is worth. These values are averaged out for each count, which will help isolate this run expectancy value from baserunning situations. Using this and pitch by pitch advanced data from 2020-2023, I am aiming to build a model that can predict the cumulative run expectancy for a given sequence of two pitches. This target variable is refered to as seq_delta_run_exp thoughout the documentation and code.

Features

In order to try and include pitch tunneling in this model, I use the kinematic equations to estimate the location of the baseball in 3D space at the commit point, decision point, release, and over the plate.

Yu Darvish Sample

See 3D_pitch_location_estimation.md or the feature engineering section of sequence+.ipynb for details on how I am doing this.

Some more features related to pitch sequences and pitch quality were added that would hopefully help explain variance in seq_delta_run_exp. ultimatley, the features below were the ones chosen.

Feature Importances

The reason that 9 features were chosen specifically is because of this plot:

RFECV REsults

Recursive feature selection was done with a step of 1, and as you can see there is a clear 'elbow' in the plot at 9 features. Those 9 features were ultimatley selected for the final model.

Model

The model architecture that I ended up using is a LightGBM Regressor model with a RobustScaler. The RobustScaler() improves performance because it handles outliers better than StandardScaler().

I used optuna to tune hyperparameters to fit a slightly better model that I would guessing them on my own.

Evaluation

Road Map

  • Feature Engineering
  • Model Building
  • Evaluation
  • Deploy in HuggingFace Dashboard

References

  • statcast-era-pitches: Used this huggingface dataset to effeciently load dataset of pitches thrown from 2017-present.
  • Carter Kessinger & Johnny Davis: These guys sparked the idea for using kinematic equations for 3D distances at commit & decision points for a better TunnelScore.
  • TJ Nestico: He posts lots of great content on X, and I am using a pitch by pitch run expectancy matrix found in his project tjStuff+

Contact

Feel free to reach out to me with any questions or feedback

Email: jensenh87@gmail.com
X: @_holmj_

About

ML model that predicts run value of pitches based on pitch sequencing features

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published