PyTorch implementation of Hamiltonian deep neural networks as presented in "Hamiltonian Deep Neural Networks Guaranteeing Non-vanishing Gradients by Design".
git clone https://github.com/DecodEPFL/HamiltonianNet.git
cd HamiltonianNet
python setup.py install
2D classification examples:
./examples/run.py --dataset [DATASET] --model [MODEL]
where available values for DATASET
are swiss_roll
and double_moons
.
Distributed training on 2D classification examples:
./examples/run_distributed.py --dataset [DATASET]
where available values for DATASET
are swiss_roll
and double_circles
.
Classification over MNIST dataset:
./examples/run_MNIST.py --model [MODEL]
where available values for MODEL
are MS1
and H1
.
To reproduce the counterexample of Appendix III:
./examples/gradient_analysis/perturbation_analysis.py
H-DNNs are obtained after the discretization of an ordinary differential equation (ODE) that represents a time-varying Hamiltonian system. The time varying dynamics of a Hamiltonian system is given by
where y(t) ∈ ℝn represents the state, H(y,t): ℝn × ℝ → ℝ is the Hamiltonian function and the n × n matrix J, called interconnection matrix, satisfies .
After discretization, we have
- H1-DNN:
- H2-DNN:
We consider two benchmark classification problems: "Swiss roll" and "Double circles", each of them with two categories and two features.
An example of each dataset is shown in the figures above together with the predictions of a trained 64-layer H1-DNN (colored regions on the background). For these examples, the two features data is augmented, leading to yk ∈ ℝ4, k = 0,...,64.
Figures below shows the hidden feature vectors —the states yk— of all the test data after training. First, a change of basis is performed in order to have the classification hyperplane perpendicular to the first basis vector x1. Then, projections are performed on the new coordinate planes.
Previous work conjetured that some classes of H-DNNs avoid exploding gradients when y(t) varies arbitrarily slow. The following numerical example shows that, unfortunately, this is not the case.
We consider the simple case, where the underlying ODE is
ẏ(t) = ε J tanh( y(t) ) with .
We study the evolution of y(t) and yγ(t), t ∈ [t0, T] and t0 ∈ [0, T], with initial conditions y(t0) = y0 and yγ(t0) = y0 + γβ, with γ = 0.05 and β the unitary vectors. The initial condition y0 is set randomly, and normalized to have unitary norm.
The left Figure shows the time evolution of y(t), in blue, and yγ(t), in orange, when a perturbation is applied at a time t0 = T-t. The nominal initial condition (y(T-t)) is indicated with a blue circle and the perturbated one (yγ(T-t)) with an orange cross. A zoom is presented on the right side, where a green vector indicates the difference between yγ(T) and y(T).
Figure on the right presents the entries (1,1) and (2,2) of the BSM matrix. Note that the value coincides in sign and magnitud with the green vector.
This numerical experiment confirms that the entries of the BSM matrix (we only show 2 of the 4 entries) diverge as the depth of the network increases (i.e. as the perturbation is introduced further away from the output).
This work is licensed under a Creative Commons Attribution 4.0 International License.
[1] Clara L. Galimberti, Luca Furieri, Liang Xu and Giancarlo Ferrari Trecate. "Hamiltonian Deep Neural Networks Guaranteeing Non-vanishing Gradients by Design," arXiv:2105.13205, 2021.
[2] Clara L. Galimberti, Liang Xu and Giancarlo Ferrari Trecate. "A unified framework for Hamiltonian deep neural networks," The third annual Learning for Dynamics & Control (L4DC) conference, preprint arXiv:2104.13166 available, 2021.
[3] Eldad Haber and Lars Ruthotto. "Stable architectures for deep neural networks," Inverse Problems, vol. 34, p. 014004, Dec 2017.
[4] Bo Chang, Lili Meng, Eldad Haber, Lars Ruthotto, David Begert and Elliot Holtham. "Reversible architectures for arbitrarily deep residual neural networks," AAAI Conference on Artificial Intelligence, 2018.