Estimates of memory consumption and FLOP counts for various convolutional neural networks.
The numbers below are given for single element batches.
model | input size | param mem | feat. mem | flops | src | performance |
---|---|---|---|---|---|---|
alexnet | 227 x 227 | 233 MB | 3 MB | 727 MFLOPs | MCN | 41.80 / 19.20 |
caffenet | 224 x 224 | 233 MB | 3 MB | 724 MFLOPs | MCN | 42.60 / 19.70 |
squeezenet1-0 | 224 x 224 | 5 MB | 30 MB | 837 MFLOPs | PT | 41.90 / 19.58 |
squeezenet1-1 | 224 x 224 | 5 MB | 17 MB | 360 MFLOPs | PT | 41.81 / 19.38 |
vgg-f | 224 x 224 | 232 MB | 4 MB | 727 MFLOPs | MCN | 41.40 / 19.10 |
vgg-m | 224 x 224 | 393 MB | 12 MB | 2 GFLOPs | MCN | 36.90 / 15.50 |
vgg-s | 224 x 224 | 393 MB | 12 MB | 3 GFLOPs | MCN | 37.00 / 15.80 |
vgg-m-2048 | 224 x 224 | 353 MB | 12 MB | 2 GFLOPs | MCN | 37.10 / 15.80 |
vgg-m-1024 | 224 x 224 | 333 MB | 12 MB | 2 GFLOPs | MCN | 37.80 / 16.10 |
vgg-m-128 | 224 x 224 | 315 MB | 12 MB | 2 GFLOPs | MCN | 40.80 / 18.40 |
vgg-vd-16-atrous | 224 x 224 | 82 MB | 58 MB | 16 GFLOPs | N/A | - / - |
vgg-vd-16 | 224 x 224 | 528 MB | 58 MB | 16 GFLOPs | MCN | 28.50 / 9.90 |
vgg-vd-19 | 224 x 224 | 548 MB | 63 MB | 20 GFLOPs | MCN | 28.70 / 9.90 |
googlenet | 224 x 224 | 51 MB | 26 MB | 2 GFLOPs | MCN | 34.20 / 12.90 |
resnet18 | 224 x 224 | 45 MB | 23 MB | 2 GFLOPs | PT | 30.24 / 10.92 |
resnet34 | 224 x 224 | 83 MB | 35 MB | 4 GFLOPs | PT | 26.70 / 8.58 |
resnet-50 | 224 x 224 | 98 MB | 103 MB | 4 GFLOPs | MCN | 24.60 / 7.70 |
resnet-101 | 224 x 224 | 170 MB | 155 MB | 8 GFLOPs | MCN | 23.40 / 7.00 |
resnet-152 | 224 x 224 | 230 MB | 219 MB | 11 GFLOPs | MCN | 23.00 / 6.70 |
resnext-50-32x4d | 224 x 224 | 96 MB | 132 MB | 4 GFLOPs | L1 | 22.60 / 6.49 |
resnext-101-32x4d | 224 x 224 | 169 MB | 197 MB | 8 GFLOPs | L1 | 21.55 / 5.93 |
resnext-101-64x4d | 224 x 224 | 319 MB | 273 MB | 16 GFLOPs | PT | 20.81 / 5.66 |
inception-v3 | 299 x 299 | 91 MB | 89 MB | 6 GFLOPs | PT | 22.55 / 6.44 |
SE-ResNet-50 | 224 x 224 | 107 MB | 103 MB | 4 GFLOPs | SE | 22.37 / 6.36 |
SE-ResNet-101 | 224 x 224 | 189 MB | 155 MB | 8 GFLOPs | SE | 21.75 / 5.72 |
SE-ResNet-152 | 224 x 224 | 255 MB | 220 MB | 11 GFLOPs | SE | 21.34 / 5.54 |
SE-ResNeXt-50-32x4d | 224 x 224 | 105 MB | 132 MB | 4 GFLOPs | SE | 20.97 / 5.54 |
SE-ResNeXt-101-32x4d | 224 x 224 | 187 MB | 197 MB | 8 GFLOPs | SE | 19.81 / 4.96 |
SENet | 224 x 224 | 440 MB | 347 MB | 21 GFLOPs | SE | 18.68 / 4.47 |
SE-BN-Inception | 224 x 224 | 46 MB | 43 MB | 2 GFLOPs | SE | 23.62 / 7.04 |
densenet121 | 224 x 224 | 31 MB | 126 MB | 3 GFLOPs | PT | 25.35 / 7.83 |
densenet161 | 224 x 224 | 110 MB | 235 MB | 8 GFLOPs | PT | 22.35 / 6.20 |
densenet169 | 224 x 224 | 55 MB | 152 MB | 3 GFLOPs | PT | 24.00 / 7.00 |
densenet201 | 224 x 224 | 77 MB | 196 MB | 4 GFLOPs | PT | 22.80 / 6.43 |
mcn-mobilenet | 224 x 224 | 16 MB | 38 MB | 579 MFLOPs | AU | 29.40 / - |
Click on the model name for a more detailed breakdown of feature extraction costs at different input image/batch sizes if needed. The performance numbers are reported as top-1 error/top-5 error
on the 2012 ILSVRC validation data. The src
column indicates the source of the benchmark scores using the following abberviations:
- MCN - scores obtained from the matconvnet website.
- PT - scores obtained from the PyTorch torchvision module.
- L1 - evaluated locally (follow link to view benchmark code).
- AU - numbers reported by the paper authors.
These numbers provide an estimate of performance, but note that there may be small differences between the evaluation scripts from different sources.
References:
- alexnet - Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012.
- squeezenet - Iandola, Forrest N., et al. "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size." arXiv preprint arXiv:1602.07360 (2016).
- vgg-m - Chatfield, Ken, et al. "Return of the devil in the details: Delving deep into convolutional nets." arXiv preprint arXiv:1405.3531 (2014).
- vgg-vd-16/vgg-vd-19 - Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).
- vgg-vd-16-reduced - Liu, Wei, Andrew Rabinovich, and Alexander C. Berg. "Parsenet: Looking wider to see better." arXiv preprint arXiv:1506.04579 (2015)
- googlenet - Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
- inception - Szegedy, Christian, et al. "Rethinking the inception architecture for computer vision." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.
- resnet - He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
- resnext - Xie, Saining, et al. "Aggregated residual transformations for deep neural networks." arXiv preprint arXiv:1611.05431 (2016).
- SENets - Jie Hu, Li Shen and Gang Sun. "Squeeze-and-Excitation Networks." arXiv preprint arXiv:1709.01507 (2017).
- Densenet - Huang, Gao, et al. "Densely connected convolutional networks." CVPR, (2017).
model | input size | param memory | feature memory | flops |
---|---|---|---|---|
rfcn-res50-pascal | 600 x 850 | 122 MB | 1 GB | 79 GFLOPS |
rfcn-res101-pascal | 600 x 850 | 194 MB | 2 GB | 117 GFLOPS |
ssd-pascal-vggvd-300 | 300 x 300 | 100 MB | 116 MB | 31 GFLOPS |
ssd-pascal-vggvd-512 | 512 x 512 | 104 MB | 337 MB | 91 GFLOPS |
ssd-pascal-mobilenet-ft | 300 x 300 | 22 MB | 37 MB | 1 GFLOPs |
faster-rcnn-vggvd-pascal | 600 x 850 | 523 MB | 600 MB | 172 GFLOPS |
The input sizes used are "typical" for each of the architectures listed, but can be varied. Anchor/priorbox generation and roi/psroi-pooling are not included in flop estimates. The ssd-pascal-mobilenet-ft detector uses the MobileNet feature extractor (the model used here was imported from the architecture made available by chuanqi305).
References:
- faster-rcnn - Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in neural information processing systems. 2015..
- r-fcn - Li, Yi, Kaiming He, and Jian Sun. "R-fcn: Object detection via region-based fully convolutional networks." Advances in Neural Information Processing Systems. 2016.
- ssd - Liu, Wei, et al. "Ssd: Single shot multibox detector." European conference on computer vision. Springer, Cham, 2016.
- mobilenets - Howard, Andrew G., Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. "Mobilenets: Efficient convolutional neural networks for mobile vision applications." arXiv preprint arXiv:1704.04861 (2017).
model | input size | param memory | feature memory | flops |
---|---|---|---|---|
pascal-fcn32s | 384 x 384 | 519 MB | 423 MB | 125 GFLOPS |
pascal-fcn16s | 384 x 384 | 514 MB | 424 MB | 125 GFLOPS |
pascal-fcn8s | 384 x 384 | 513 MB | 426 MB | 125 GFLOPS |
deeplab-vggvd-v2 | 513 x 513 | 144 MB | 755 MB | 202 GFLOPs |
deeplab-res101-v2 | 513 x 513 | 505 MB | 4 GB | 346 GFLOPs |
In this case, the input sizes are those which are typically taken as input crops during training. The deeplab-res101-v2 model uses multi-scale input, with scales x1, x0.75, x0.5
(computed relative to the given input size).
References:
- pascal-fcn - Long, Jonathan, Evan Shelhamer, and Trevor Darrell. "Fully convolutional networks for semantic segmentation." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015..
- deeplab - DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs Liang-Chieh Chen^, George Papandreou^, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille (^equal contribution) Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
model | input size | param memory | feature memory | flops |
---|---|---|---|---|
multipose-mpi | 368 x 368 | 196 MB | 245 MB | 134 GFLOPS |
multipose-coco | 368 x 368 | 200 MB | 246 MB | 136 GFLOPS |
References:
- multipose - Cao, Zhe, et al. "Realtime multi-person 2d pose estimation using part affinity fields." arXiv preprint arXiv:1611.08050 (2016)..
The numbers for each architecture should be reasonably framework agnostic. It is assumed that all weights and activations are stored as floats (with 4 bytes per datum) and that all relus are performed in-place. Feature memory therefore represents an estimate of the total memory consumption of the features computed via a forward pass of the network for a given input, assuming that memory is not re-used (the exception to this is that, as noted above, relus are performed in-place and do not add to the feature memory total). In practice, many frameworks will clear features from memory when they are no-longer required by the execution path and will therefore require less memory than is noted here. The feature memory statistic is simply a rough guide as to "how big" the activations of the network look.
Fused multiply-adds are counted as single operations. The numbers should be considered to be rough approximations - modern hardware makes it very difficult to accurately count operations (and even if you could, pipelining etc. means that it is not necessarily a good estimate of inference time).
The tool for computing the estimates is implemented as a module for the autonn wrapper of matconvnet and is included in this repo, so feel free to take a look for extra details. This module can be installed with the vl_contrib
package manager (it has two dependencies which can be installed in a similar manner: autonn and mcnExtraLayers). Matconvnet versions of all of the models can be obtained from either here or here.
For further reading on the topic, the 2017 ICLR submission An analysis of deep neural network models for practical applications is interesting. If you find any issues, or would like to add additional models, add an issue/PR.