Name		Name	Last commit message	Last commit date
parent directory ..
ade20k		ade20k
ade20k32		ade20k32
cocoseg21		cocoseg21
ti-robokit/edgeai-tv		ti-robokit/edgeai-tv
utils/tf-deeplab		utils/tf-deeplab
voc2012/tf1-models		voc2012/tf1-models
readme.md		readme.md

readme.md

Semantic Segmentation Benchmark

Introduction

Semantic Segmentation from Image (or Image Segmentation) is an important problem in Computer Vision. This page describes some of the popular Semantic Segmentation models.

Datasets

COCO dataset: Microsoft COCO: Common Objects in Context, Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, Piotr Dollár, https://arxiv.org/abs/1405.0312, https://cocodataset.org

COCOSeg21 dataset: For information about 21 class subset of COCO Segmentation, see the description in torchvision models

Cityscapes dataset: Cityscapes is a large scale dataset containing street scenes from 50 different cities with high quality pixel level annotations.

ADE20K dataset: Scene Parsing Dataset Scene Parsing through ADE20K Dataset. Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso and Antonio Torralba. Computer Vision and Pattern Recognition (CVPR), 2017. Semantic Understanding of Scenes through ADE20K Dataset. Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso and Antonio Torralba. International Journal on Computer Vision (IJCV). https://groups.csail.mit.edu/vision/datasets/ADE20K/, http://sceneparsing.csail.mit.edu/

ADE20K32: 32 class dataset subset of ADE20K described in the paper: MLPerf Inference Benchmark, Vijay Janapa Reddi et. al., https://arxiv.org/abs/1911.02549

Models

EdgeAI-Torchvision

Dataset	Model Name	Input Size	GigaMACs	MeanIoU%	Available	Notes
	ADE20K32 dataset models
ADE20K32	MobileNetV2S16+DeepLabV3Lite	512x512	3.28	51.01	Y
ADE20K32	MobileNetV2+UNetLite	512x512	2.427	49.95	Y
ADE20K32	MobileNetV2+FPNLite	512x512	3.481	50.72	Y
ADE20K32	MobileNetV2-1.4+FPNLite	512x512	6.646	52.93	Y
-
ADE20K32	RegNetX400MF+FPNLite	384x384	3.1526	51.03	Y
ADE20K32	RegNetX800MF+FPNLite	512x512	8.0683	53.29	Y
	COCOSeg21 dataset models
COCOSeg21	MobileNetV2S16+DeepLabV3Lite	512x512	3.161	57.77	Y
COCOSeg21	MobileNetV2+UNetLite	512x512	2.009	57.01
COCOSeg21	MobileNetV2+FPNLite	512x512	3.357
COCOSeg21	RegNetX800MF+FPNLite	512x512	7.864	61.15	Y
	Cityscapes dataset models
Cityscapes	MobileNetV2S16+DeepLabV3Lite	768x384	3.54	69.13
Cityscapes	MobileNetV2+UNetLite	768x384	2.20	68.94
Cityscapes	MobileNetV2+FPNLite	768x384	3.84	70.39
Cityscapes	MobileNetV2+FPNLite	1536x768	15.07	74.61
-
Cityscapes	MobileNetV2S16+DeepLabV3Lite-QAT	768x384	3.54	68.77		QAT* model
Cityscapes	MobileNetV2+UNetLite-QAT	768x384	2.20	68.18		QAT* model
Cityscapes	MobileNetV2+FPNLite-QAT	768x384	3.84	69.88		QAT* model
-
Cityscapes	RegNetX800MF+FPNLite	768x384	8.84	72.01
Cityscapes	RegNetX1.6GF+FPNLite	1024x512	24.29	75.84
Cityscapes	RegNetX3.2FF+FPNLite	1536x768	111.16	78.90
Cityscapes	RegNetX400MF+FPNLite	768x384	6.09	68.03
Cityscapes	RegNetX400MF+FPNLite	1536x768	24.37	73.96

*- Quantization Aware Training using 8b precision

Tensorflow DeepLab Models

Dataset	Model Name	Input Size	GigaMACs	MeanIoU%	Available
-	Our modified models
VOC2012	deeplabv3_mnv2_dm05[10]	512x512	2.77	66.94	Y
VOC2012	deeplabv3_mnv2[10]	512x512	8.60	72.66	Y
VOC2012	deeplabv3_xception[10]	512x512	171.77	81.74
-	Original DeepLab models
Cityscapes	MobileNetV2+DeepLab[10]	769x769	21.27	70.71
Cityscapes	MobileNetV3+DeepLab[10]	769x769	15.95	72.41
Cityscapes	Xception65+DeepLab[10]	769x769	418.64	78.79

Torchvision Models

Dataset	Model Name	Input Size	GigaMACs	MeanIoU%
	COCOSeg21 dataset models
COCOSeg21	LR-ASPP MobileNetV3-Large[11]	520		57.9
COCOSeg21	DeepLabV3 MobileNetV3-Large[11]	520		60.3
COCOSeg21	DeepLabV3 ResNet50[11]	520		66.4
	Cityscapes dataset models
Cityscapes	ResNet50+FCN[3,11]	1040x520	285.4	71.6
Cityscapes	ResNet50+DeepLabV3[5,11]	1040x520	337.5	73.5

Other Miscellaneous Models

These are for reference only and are not included in this Model Zoo.

Dataset	Model Name	Input Size	GigaMACs	MeanIoU%	Available	Notes
Cityscapes	ERFNet[8]	1024x512	27.705	69.7
Cityscapes	MobileNetV2+SwiftNet[9]	2048x1024	41.0	75.3

Notes

The suffix 'Lite' in the name of models such as DeepLabV3Lite, FPNLite & UNetLite indicates the use of Depthwise convolutions or Grouped convolutions. If the feature extractor (encoder) uses Depthwise Convolutions, then Depthwise convolutions are used throughout such models - even in the neck and decoder. If the feature extractor (encoder) uses grouped convolutions as in the case of RegNetX, then grouped convolutions (with the same group size as that of the feature extractor) are used even in the neck and decoder.
GigaMACS: Complexity in Giga Multiply-Accumulations (lower is better). This is an important metric to watch out for when selecting models for embedded inference.
MeanIoU%: Original Floating Point MeanIoU Accuracy% obtained after training.

References

[1] The PASCAL Visual Object Classes (VOC) Challenge, Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J. and Zisserman, A. International Journal of Computer Vision, 88(2), 303-338, 2010, http://host.robots.ox.ac.uk/pascal/VOC/

[2] The Cityscapes Dataset for Semantic Urban Scene Understanding, Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, Bernt Schiele, https://arxiv.org/abs/1604.01685

[3] Fully Convolutional Networks for Semantic Segmentation, Jonathan Long, Evan Shelhamer, Trevor Darrell, https://arxiv.org/abs/1411.4038

[4] Feature Pyramid Networks for Object Detection Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, Serge Belongie, https://arxiv.org/abs/1612.03144

[5] Rethinking Atrous Convolution for Semantic Image Segmentation, Liang-Chieh Chen, George Papandreou, Florian Schroff, Hartwig Adam, https://arxiv.org/abs/1706.05587

[6] U-Net: Convolutional Networks for Biomedical Image Segmentation, Olaf Ronneberger, Philipp Fischer, Thomas Brox, https://arxiv.org/abs/1505.04597

[7] Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation, Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, Hartwig Adam, https://arxiv.org/abs/1802.02611

[8] E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Arroyo. Erfnet: Efficient residual factorized convnet for real-time semantic segmentation. IEEE Transactions on Intelligent Transportation Systems, 19(1):263272, 2018.

[9] In Defense of Pre-trained ImageNet Architectures for Real-time Semantic Segmentation of Road-driving Images, Marin Oršić*, Ivan Krešo*, Siniša Šegvić, Petra Bevandić (* denotes equal contribution), CVPR, 2019.

[10] Tensorflow DeepLab ModelZoo, https://github.com/tensorflow/models/blob/master/research/deeplab/g3doc/model_zoo.md

[11] Torchvision Models, https://pytorch.org/docs/stable/torchvision/models.html, https://github.com/pytorch/vision/tree/master/torchvision

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

segmentation

segmentation

readme.md

Semantic Segmentation Benchmark

Introduction

Datasets

Models

EdgeAI-Torchvision

Tensorflow DeepLab Models

Torchvision Models

Other Miscellaneous Models

Notes

References

Files

segmentation

Directory actions

More options

Directory actions

More options

Latest commit

History

segmentation

Folders and files

parent directory

readme.md

Semantic Segmentation Benchmark

Introduction

Datasets

Models

EdgeAI-Torchvision

Tensorflow DeepLab Models

Torchvision Models

Other Miscellaneous Models

Notes

References