Demonstration of training a small ResNet on CIFAR10 to 94% test accuracy in 79 seconds as described in this blog series.
Instructions to reproduce on an AWS p3.2xlarge
instance:
- setup an instance with AMI:
Deep Learning AMI (Ubuntu) Version 11.0
(ami-c47c28bc
inus-west-2
) - ssh into the instance:
ssh -i $KEY_PAIR ubuntu@$PUBLIC_IP_ADDRESS -L 8901:localhost:8901
- on the remote machine
source activate pytorch_p36
pip install pydot
(optional for network visualisation)git clone https://github.com/davidcpage/cifar10-fast.git
jupyter notebook --no-browser --port=8901
- open the jupyter notebook url in a browser, open
demo.ipynb
and run all the cells
In my test, 35 out of 50 runs reached 94% test set accuracy with a median of 94.08%. Runtime for 24 epochs is roughly 79s.
A second notebook experiments.ipynb
contains code to reproduce the main results from the posts.
NB: demo.ipynb
also works on the latest Deep Learning AMI (Ubuntu) Version 16.0
, but some examples in experiments.ipynb
trigger a core dump when using TensorCores in versions after 11.0
.
To reproduce DAWNBench timings, setup the AWS p3.2xlarge
instance as above but instead of launching a jupyter notebook on the remote machine, change directory to cifar10-fast
and run python dawn.py
from the command line. Timings in DAWNBench format will be saved to logs.tsv
.
Note that DAWNBench timings do not include validation time, as in this FAQ, but do include initial preprocessing, as indicated here. DAWNBench timing is roughly 74 seconds which breaks down as 79s (as above) -7s (validation)+ 2s (preprocessing).
- Core functionality has moved to
core.py
whilst PyTorch specific stuff is intorch_backend.py
to allow easier experimentation with different frameworks. - Stats (loss/accuracy) are collected on the GPU and bulk transferred to the CPU at the end of each epoch. This speeds up some experiments so timings in
demo.ipynb
andexperiments.ipynb
no longer match the blog posts.