The main goal of Sequence+ is to create a model like Stuff+, Location+, and Pitcher+ that aims to measure the run value of a pitch sequence. Sequence+ will be made using features related to tunneling, and a mix of things that are typically included in Location+ & Stuff+.
I have a run expectancy matrix for 2024, that describes roughly how many runs each event in each possible situation is worth. These values are averaged out for each count, which will help isolate this run expectancy value from baserunning situations. Using this and pitch by pitch advanced data from 2020-2023, I am aiming to build a model that can predict the cumulative run expectancy for a given sequence of two pitches. This target variable is refered to as seq_delta_run_exp
thoughout the documentation and code.
In order to try and include pitch tunneling in this model, I use the kinematic equations to estimate the location of the baseball in 3D space at the commit point, decision point, release, and over the plate.
See 3D_pitch_location_estimation.md or the feature engineering section of sequence+.ipynb for details on how I am doing this.
Some more features related to pitch sequences and pitch quality were added that would hopefully help explain variance in seq_delta_run_exp
. ultimatley, the features below were the ones chosen.
The reason that 9 features were chosen specifically is because of this plot:
Recursive feature selection was done with a step of 1, and as you can see there is a clear 'elbow' in the plot at 9 features. Those 9 features were ultimatley selected for the final model.
The model architecture that I ended up using is a LightGBM Regressor model with a RobustScaler. The RobustScaler() improves performance because it handles outliers better than StandardScaler().
I used optuna to tune hyperparameters to fit a slightly better model that I would guessing them on my own.
- Feature Engineering
- Model Building
- Evaluation
- Deploy in HuggingFace Dashboard
- statcast-era-pitches: Used this huggingface dataset to effeciently load dataset of pitches thrown from 2017-present.
- Carter Kessinger & Johnny Davis: These guys sparked the idea for using kinematic equations for 3D distances at commit & decision points for a better TunnelScore.
- TJ Nestico: He posts lots of great content on X, and I am using a pitch by pitch run expectancy matrix found in his project tjStuff+
Feel free to reach out to me with any questions or feedback
Email: jensenh87@gmail.com
X: @_holmj_