We present FIBER (Fusion In-the-Backbone transformER) a novel Vision and Language architecture that performs deep multi-modal fusion. We also propose a new Vision-Language Pre-training (VLP) strategy, that first learns through coarse-grained image level objectives, and then obtains better fine-grained understanding capabilties by training on image-text-box data. While previous work required pseudo-annotating large amounts of image-text data to boost performance on fine-grained reasoning tasks, we show that we can equal and often surpass these results using our two-stage approach, using 25x less box annotated data. This opens the doors to scale up fine-grained models in an efficient manner without resorting to high resolution training using box annotated data. Our improved architecture also obtains state of the art performance on VQAv2, NLVR2, COCO captioning and Image-text Retrieval while being more efficient in terms of training time and memory than existing coarse and fine-grained models having similar performance.
TL;DR
- What : A new architecture for Vision and Language tasks + a new pre-training strategy that benefits both image level and region level tasks.
- How: We add cross-modality attention blocks into the image and text backbone & split pre-training into low and high resolution stages.
- Outcome: State-of-the art results on a captioning, VQA, NLVR2, and more + efficient use of expensive fine-grained data, surpassing phrase grounding performance of models using 25x more box-annotated data!
In this repository we provide code and pre-trained checkpoints for coarse-grained pre-training on image-text data and fine-grained pre-training on image-text-box data. We also provide instructions, code and checkpoints for fine-tuning FIBER on all the downstream tasks reported in the paper. Please see respective directories for instructions.
Results on Visual Question Answering, Visual Reasoning, Image-Text Retrieval and Image Captioning
Task | VQAv2 | NLVR2 | F30k Retrieval | COCO Retrieval | COCO Captioning |
---|---|---|---|---|---|
Split | test-std | test-P | test | Karpathy test | Karpathy test |
Metric | VQA Score | Acc. | IR@1/TR@1 | IR@1/TR@1 | CIDEr |
FIBER-Base | 78.46 | 85.52 | 81.44/92.90 (ITC) 84.10/95.10 (ITM) | 58.01/75.38 (ITC) 59.03/75.14 (ITM) | 144.4 |
Results on Phrase Grounding and Referring Expression Comprehension
Task | F30k Grounding | RefCOCO | RefCOCO+ | RefCOCOg |
---|---|---|---|---|
Split | test | val/testA/testB | val/testA/testB | val/test |
Metric | R@1/R@5/R@10 | Acc. | Acc. | Acc. |
FIBER-Base | 87.4/96.4/97.6 | 90.68/92.59/87.26 | 85.74/90.13/79.38 | 87.11/87.32 |
Results on Object Detection on COCO, LVIS and ODinW
Task | COCO Detection | LVIS | ODinW |
---|---|---|---|
Split | Val 2017 | MiniVal | 13 Datasets |
Metric | Zero-shot/Fine-tune AP | Zero-shot/Fine-tune AP | Avg. Zero-shot/Fine-tune AP |
FIBER-Base | 49.3/58.4 | 35.8/56.9 | 47.0/65.9 |
@inproceedings{fiber2022,
title={Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone},
author={Dou, Zi-Yi* and Kamath, Aishwarya* and Gan, Zhe* and Zhang, Pengchuan and Wang, Jianfeng and Li, Linjie and Liu, Zicheng and Liu, Ce and LeCun, Yann and Peng, Nanyun and Gao, Jianfeng and Wang, Lijuan},
booktitle={NeurIPS},
year={2022},
}
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.