Skip to content

ML Training

Tyler M edited this page May 12, 2024 · 35 revisions

  • We train in github actions
  • We have a somewhat small dataset so we do not use CUDA + cuDNN (train on the CPU)
  • we train a single LSTM
  • model uses multi-step (multiple time steps ahead) and multivariate (multiple features) time series forecasting

Dataset Input Features (7 years of data)

  • Historical Daily Flow data (Min + Max in CFS)
  • day of year + year
  • Site Monitoring ID
  • Snow Water Equivalent (SWE) data
  • temperature data (min + max)

Dataset Format

Station ID Basin ID Year (Static) Day of Year (Static) Current SWE % Year Max SWE % Flow Day 0 (Min) Flow Day 0 (Max) Temp Day 0 (Min) Temp Day 0 (Max)
ST001 B1 2023 45 45% 80% 140 cfs 160 cfs 66.2°F 68.0°F
ST002 B1 2020 120 50% 85% 150 cfs 170 cfs 69.8°F 71.6°F
ST003 B2 2024 300 55% 90% 160 cfs 180 cfs 73.4°F 75.2°F

Dynamically Generated Features

  • Flow Lags
  • Future temperature windows
  • Sin/Cos Day of the Year (for seasonal patterns)
    • Calc handles leap years
  • Drop Columns: Static data (year, day of year)
Flow Lag -1 Day Flow Lag -3 Day Flow Lag -7 Day Day of Year (Sin) Day of Year (Cos) Temp Day +1 (Min) Temp Day +1 (Max) ... Temp Day +14 (Min) Temp Day +14 (Max)
+140 cfs -130 cfs +120 cfs 0.6995 0.7147 66.2°F 69.8°F ... 69.8°F 73.4°F
+150 cfs +140 cfs +130 cfs 0.8827 -0.4700 68.0°F 71.6°F ... 71.6°F 75.2°F
-160 cfs -150 cfs -140 cfs -0.9057 0.4239 71.6°F 75.2°F ... 75.2°F 78.8°F

Model Input Features

  • Historical Flow lag trend (1, 3, and 7 day)
  • Current Flow
  • Site ID (Site specific characteristics)
  • current + seasonal maximum SWE for the sub-watershed (HUC8; geolocation/region characteristics)
  • Day of year (seasonal)
  • current daily temp (min/max)
  • Future Temperature Forecast for 14+ days (min/max)

LSTM Specs

  • hidden layers: 2
  • dense layer units: 5-10
  • Dropout: ~20%
  • Dropout layer: on every layer
  • Weight Initialization: Glorot uniform initialization
  • weight decay: 0.97
  • activation functions: Recurrent Activation: tanh, Gate Activations: sigmoid Output layer : linear
  • Learning rate: 0.00001-0.1
  • momentum: 0.5-0.9
  • Epochs: employ the early stopping method to find number
  • batch size: 32-512
  • Output units: 28 (14+ days of Flow predictions (min + max values)
  • Regularization: consider L1 regularization ONLY if/when overfitting is an issue
  • Adaptive learning rate: Adam
  • Feature importance: SHAP's DeepExplainer (IMDB Sentiment Classification)

Visualize the LSTM:

  • produce validation curves (train vs validation):
  • plot the loss over time
  • plot accuracy over time
  • visualize the predictions made by the model (actual vs predicted)
  • feature importance with SHAP summary plot

Hierarchical Embeddings w/ Fallback

  • Using a hierarchical structure we embed the broader watershed data first, then specialize with individual Station ID embedding.
    • Basin Fallback Mechanism: In training, we simulate unseen Station IDs by holding out some stations from batches. This forces model to rely on Watershed/Basin input only. We record trained stations in stations.txt. Then in inference, we implement this fallback by attempting to look up a given station in stations.txt
    • 'mask_zero=True' for the station_embedding layer, which helps omit the station embedding gracefully

LSTM Refs

  • Understanding LSTM Networks
    • RNNs that avoid the Long-term Dependency problem
      • "The long term dependency problem is that, when you have larger network through time, the gradient decays quickly during back propagation. So training a RNN having long unfolding in time becomes impossible. But LSTM avoids this decay of gradient problem by allowing you to make a super highway (cell states) through time, these highways allow the gradient to freely flow backward in time."

Future Features to be considered

  • Groundwater and/or soil moisture measurement data (predominate for Winter flow forecasting)
  • Precipitation data (historical + forecasted; more predominate in North West and East Coast regions, not Colorado)
  • Dam release data (numeric outflow or a binary value; requires dissecting upstream dams from a Station then a look up)