ML Training

We train in github actions
We have a somewhat small dataset so we do not use CUDA + cuDNN (train on the CPU)
we train a single LSTM
model uses multi-step (multiple time steps ahead) and multivariate (multiple features) time series forecasting

Dataset Input Features (7 years of data)

Station ID	Basin ID	Year (Static)	Day of Year (Static)	Current SWE %	Year Max SWE %	Flow Day 0 (Min)	Flow Day 0 (Max)	Temp Day 0 (Min)	Temp Day 0 (Max)
ST001	B1	2023	45	45%	80%	140 cfs	160 cfs	66.2°F	68.0°F
ST002	B1	2020	120	50%	85%	150 cfs	170 cfs	69.8°F	71.6°F
ST003	B2	2024	300	55%	90%	160 cfs	180 cfs	73.4°F	75.2°F

Flow Lag -1 Day	Flow Lag -3 Day	Flow Lag -7 Day	Day of Year (Sin)	Day of Year (Cos)	Temp Day +1 (Min)	Temp Day +1 (Max)	...	Temp Day +14 (Min)	Temp Day +14 (Max)
+140 cfs	-130 cfs	+120 cfs	0.6995	0.7147	66.2°F	69.8°F	...	69.8°F	73.4°F
+150 cfs	+140 cfs	+130 cfs	0.8827	-0.4700	68.0°F	71.6°F	...	71.6°F	75.2°F
-160 cfs	-150 cfs	-140 cfs	-0.9057	0.4239	71.6°F	75.2°F	...	75.2°F	78.8°F

Historical Flow lag trend (1, 3, and 7 day)
Current Flow
Site ID (Site specific characteristics)
current + seasonal maximum SWE for the sub-watershed (HUC8; geolocation/region characteristics)
Day of year (seasonal)
current daily temp (min/max)
Future Temperature Forecast for 14+ days (min/max)

hidden layers: 2
dense layer units: 5-10
Dropout: ~20%
Dropout layer: on every layer
Weight Initialization: Glorot uniform initialization
weight decay: 0.97
activation functions: Recurrent Activation: tanh, Gate Activations: sigmoid Output layer : linear
Learning rate: 0.00001-0.1
momentum: 0.5-0.9
Epochs: employ the early stopping method to find number
batch size: 32-512
Output units: 28 (14+ days of Flow predictions (min + max values)
Regularization: consider L1 regularization ONLY if/when overfitting is an issue
Adaptive learning rate: Adam
Feature importance: SHAP's DeepExplainer (IMDB Sentiment Classification)

Using a hierarchical structure we embed the broader watershed data first, then specialize with individual Station ID embedding.
- Basin Fallback Mechanism: In training, we simulate unseen Station IDs by holding out some stations from batches. This forces model to rely on Watershed/Basin input only. We record trained stations in stations.txt. Then in inference, we implement this fallback by attempting to look up a given station in stations.txt
- 'mask_zero=True' for the station_embedding layer, which helps omit the station embedding gracefully

Understanding LSTM Networks
- RNNs that avoid the Long-term Dependency problem
  - "The long term dependency problem is that, when you have larger network through time, the gradient decays quickly during back propagation. So training a RNN having long unfolding in time becomes impossible. But LSTM avoids this decay of gradient problem by allowing you to make a super highway (cell states) through time, these highways allow the gradient to freely flow backward in time."

Groundwater and/or soil moisture measurement data (predominate for Winter flow forecasting)
Precipitation data (historical + forecasted; more predominate in North West and East Coast regions, not Colorado)
Dam release data (numeric outflow or a binary value; requires dissecting upstream dams from a Station then a look up)