Our segmentation code is developed on top of MMSegmentation v0.20.2.
For details see Vision Transformer Adapter for Dense Predictions.
If you use this code for a paper please cite:
@article{chen2022vitadapter,
title={Vision Transformer Adapter for Dense Predictions},
author={Chen, Zhe and Duan, Yuchen and Wang, Wenhai and He, Junjun and Lu, Tong and Dai, Jifeng and Qiao, Yu},
journal={arXiv preprint arXiv:2205.08534},
year={2022}
}
Install MMSegmentation v0.20.2.
# recommended environment: torch1.9 + cuda11.1
pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install mmcv-full==1.4.2 -f https://download.openmmlab.com/mmcv/dist/cu111/torch1.9.0/index.html
pip install timm==0.4.12
pip install mmdet==2.22.0 # for Mask2Former
pip install mmsegmentation==0.20.2
ln -s ../detection/ops ./
cd ops & sh make.sh # compile deformable attention
Preparing ADE20K/Cityscapes/COCO Stuff/Pascal Context according to the guidelines in MMSegmentation.
Name | Year | Type | Data | Repo | Paper |
---|---|---|---|---|---|
DeiT | 2021 | Supervised | ImageNet-1K | repo | paper |
AugReg | 2021 | Supervised | ImageNet-22K | repo | paper |
BEiT | 2021 | MIM | ImageNet-22K | repo | paper |
Uni-Perceiver | 2022 | Supervised | Multi-Modal | repo | paper |
BEiTv2 | 2022 | MIM | ImageNet-22K | repo | paper |
Note that due to the capacity limitation of GitHub Release, some files are provided as
.zip
packages. Please unzip them before load into model.
ADE20K val
Method | Backbone | Pre-train | Lr schd | Crop Size | mIoU (SS) | mIoU (MS) | #Param | Config | Download |
---|---|---|---|---|---|---|---|---|---|
UperNet | ViT-Adapter-T | DeiT-T | 160k | 512 | 42.6 | 43.6 | 36M | config | model | log |
UperNet | ViT-Adapter-S | DeiT-S | 160k | 512 | 46.2 | 47.1 | 58M | config | model | log |
UperNet | ViT-Adapter-B | DeiT-B | 160k | 512 | 48.8 | 49.7 | 134M | config | model | log |
UperNet | ViT-Adapter-T | AugReg-T | 160k | 512 | 43.9 | 44.8 | 36M | config | model | log |
UperNet | ViT-Adapter-B | AugReg-B | 160k | 512 | 51.9 | 52.5 | 134M | config | model | log |
UperNet | ViT-Adapter-L | AugReg-L | 160k | 512 | 53.4 | 54.4 | 364M | config | model | log |
UperNet | ViT-Adapter-L | Uni-Perceiver-L | 160k | 512 | 55.0 | 55.4 | 364M | config | model | log |
UperNet | ViT-Adapter-L | BEiT-L | 160k | 640 | 58.0 | 58.4 | 451M | config | model | log |
Mask2Former | ViT-Adapter-L | BEiT-L | 160k | 640 | 58.3 | 59.0 | 568M | config | model | log |
Mask2Former | ViT-Adapter-L | BEiT-L+COCO-Stuff | 80k | 896 | 59.4 | 60.5 | 571M | config | model | log |
Mask2Former | ViT-Adapter-L | BEiTv2-L+COCO-Stuff | 80k | 896 | 61.2 | 61.5 | 571M | config | model | log |
Cityscapes val
Method | Backbone | Pre-train | Lr schd | Crop Size | mIoU (SS) | mIoU (MS) | #Param | Config | Download |
---|---|---|---|---|---|---|---|---|---|
Mask2Former | ViT-Adapter-L | Mapillary | 80k | 896 | 84.9 | 85.8 | 571M | config | model | log |
COCO-Stuff-10K
Method | Backbone | Pre-train | Lr schd | Crop Size | mIoU (SS) | mIoU (MS) | #Param | Config | Download |
---|---|---|---|---|---|---|---|---|---|
Mask2Former | ViT-Adapter-B | BEiT-B | 40k | 512 | 50.0 | 50.5 | 120M | config | model | log |
UperNet | ViT-Adapter-L | BEiT-L | 80k | 512 | 51.0 | 51.4 | 451M | config | model | log |
Mask2Former | ViT-Adapter-L | BEiT-L | 40k | 512 | 53.2 | 54.2 | 568M | config | model | log |
COCO-Stuff-164K
Method | Backbone | Pre-train | Lr schd | Crop Size | mIoU (SS) | mIoU (MS) | #Param | Config | Download |
---|---|---|---|---|---|---|---|---|---|
UperNet | ViT-Adapter-L | BEiT-L | 80k | 640 | 50.5 | 50.7 | 451M | config | model | log |
Mask2Former | ViT-Adapter-L | BEiT-L | 80k | 896 | 51.7 | 52.0 | 571M | config | model | log |
Mask2Former | ViT-Adapter-L | BEiTv2-L | 80k | 896 | 52.3 | - | 571M | config | model | log |
Pascal Context
Method | Backbone | Pre-train | Lr schd | Crop Size | mIoU (SS) | mIoU (MS) | #Param | Config | Download |
---|---|---|---|---|---|---|---|---|---|
Mask2Former | ViT-Adapter-B | BEiT-B | 40k | 480 | 64.0 | 64.4 | 120M | config | model | log |
UperNet | ViT-Adapter-L | BEiT-L | 80k | 480 | 67.0 | 67.5 | 451M | config | model | log |
Mask2Former | ViT-Adapter-L | BEiT-L | 40k | 480 | 67.8 | 68.2 | 568M | config | model | log |
To evaluate ViT-Adapter-L + Mask2Former (896) on ADE20k val on a single node with 8 gpus run:
sh dist_test.sh configs/ade20k/mask2former_beit_adapter_large_896_80k_ade20k_ss.py /path/to/checkpoint_file 8 --eval mIoU
This should give
Summary:
+-------+-------+-------+
| aAcc | mIoU | mAcc |
+-------+-------+-------+
| 86.61 | 59.43 | 73.55 |
+-------+-------+-------+
To train ViT-Adapter-L + UperNet on ADE20k on a single node with 8 gpus run:
sh dist_train.sh configs/ade20k/upernet_beit_adapter_large_640_160k_ade20k_ss.py 8
To inference a single image like this:
CUDA_VISIBLE_DEVICES=0 python image_demo.py \
configs/ade20k/mask2former_beit_adapter_large_896_80k_ade20k_ss.py \
released/mask2former_beit_adapter_large_896_80k_ade20k.pth.tar \
data/ade/ADEChallengeData2016/images/validation/ADE_val_00000591.jpg \
--palette ade20k
The result will be saved at demo/ADE_val_00000591.jpg
.
To inference a single video like this:
CUDA_VISIBLE_DEVICES=0 python video_demo.py demo.mp4 \
configs/ade20k/mask2former_beit_adapter_large_896_80k_ade20k_ss.py \
released/mask2former_beit_adapter_large_896_80k_ade20k.pth.tar \
--output-file results.mp4 \
--palette ade20k