GeoMIM: Towards Better 3D Knowledge Transfer via Masked Image Modeling for Multi-view 3D Understanding (ICCV 2023)
Welcome to the official repository of GeoMIM, a groundbreaking pretraining approach for multi-view camera-based 3D perception. This repository provides the pretraining and finetuning code and pretrained models to reproduce the exceptional results presented in our paper.
The implementation of pretraining is based on bevfusion. See the pretrain
folder for further details.
After pretraining, we finetune the pretrained Swin Transformer for multi-view camera-based 3D perception. We use the BEVDet for finetuning. We provide models with different techniques used in BEVDet, including CBGS, 4D, Depth, and Stereo. We also provide models for occpancy prediction using the implementation in BEVDet repo. See the bevdet
folder for further details.
We provide the GeoMIM pretrained Swin-Base and Large checkpoints.
Model | Download |
---|---|
Swin-Base | Model |
Swin-Large | Model |
We have achieved strong performance on the nuScenes benchmark with GeoMIM. Here are some quantitative results on 3D detection:
Config | mAP | NDS | Download |
---|---|---|---|
bevdet-swinb-4d-256x704-cbgs | 33.98 | 47.19 | Model |
bevdet-swinb-4d-256x704-cbgs-geomim | 42.25 | 53.1 | Model |
bevdet-swinb-4d-stereo-256x704-cbgs-geomim | 45.33 | 55.1 | Model |
bevdet-swinb-4d-stereo-512x1408-cbgs | 47.2 | 57.6 | Model (#) |
bevdet-swinb-4d-stereo-512x1408-cbgs-geomim | 52.04 | 60.92 | Model |
Here are some quantitative results on occpancy prediction:
Config | mIoU | Download |
---|---|---|
bevdet-occ-swinb-4d-stereo-2x (*) | 42.0 | Model (#) |
bevdet-occ-swinb-4d-stereo-2x-geomim | 45.0 | Model |
bevdet-occ-swinb-4d-stereo-2x-geomim (*) | 45.73 | Model |
bevdet-occ-swinl-4d-stereo-2x-geomim | 46.27 | Model |
(*) Load 3D detection checkpoint. (#) Original BEVDet checkpoint.
- Pretraining on nuscenes dataset
- Finetuning on 3D detection task
- Finetuning on occpancy prediction task
If you find GeoMIM beneficial for your research, kindly consider citing our paper:
@inproceedings{liu2023geomim,
title={GeoMIM: Towards Better 3D Knowledge Transfer via Masked Image Modeling for Multi-view 3D Understanding},
author={Jihao Liu, Tai Wang, Boxiao Liu, Qihang Zhang, Yu Liu, Hongsheng Li},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
year={2023}
}
For any questions or inquiries, please feel free to reach out to the authors: Jihao Liu (email) and Tai Wang (email)