Skip to content

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

Notifications You must be signed in to change notification settings

microsoft/FIBER

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

WebsitePaper

Introduction

We present FIBER (Fusion In-the-Backbone transformER) a novel Vision and Language architecture that performs deep multi-modal fusion. We also propose a new Vision-Language Pre-training (VLP) strategy, that first learns through coarse-grained image level objectives, and then obtains better fine-grained understanding capabilties by training on image-text-box data. While previous work required pseudo-annotating large amounts of image-text data to boost performance on fine-grained reasoning tasks, we show that we can equal and often surpass these results using our two-stage approach, using 25x less box annotated data. This opens the doors to scale up fine-grained models in an efficient manner without resorting to high resolution training using box annotated data. Our improved architecture also obtains state of the art performance on VQAv2, NLVR2, COCO captioning and Image-text Retrieval while being more efficient in terms of training time and memory than existing coarse and fine-grained models having similar performance.

TL;DR

  • What : A new architecture for Vision and Language tasks + a new pre-training strategy that benefits both image level and region level tasks.
  • How: We add cross-modality attention blocks into the image and text backbone & split pre-training into low and high resolution stages.
  • Outcome: State-of-the art results on a captioning, VQA, NLVR2, and more + efficient use of expensive fine-grained data, surpassing phrase grounding performance of models using 25x more box-annotated data!

In this repository we provide code and pre-trained checkpoints for coarse-grained pre-training on image-text data and fine-grained pre-training on image-text-box data. We also provide instructions, code and checkpoints for fine-tuning FIBER on all the downstream tasks reported in the paper. Please see respective directories for instructions.

Model Performance

Using 1st stage pre-training

Results on Visual Question Answering, Visual Reasoning, Image-Text Retrieval and Image Captioning

TaskVQAv2NLVR2F30k RetrievalCOCO RetrievalCOCO Captioning
Splittest-stdtest-PtestKarpathy testKarpathy test
MetricVQA ScoreAcc.IR@1/TR@1IR@1/TR@1CIDEr
FIBER-Base78.4685.5281.44/92.90 (ITC) 84.10/95.10 (ITM)58.01/75.38 (ITC) 59.03/75.14 (ITM)144.4

Using 2nd stage pre-training

Results on Phrase Grounding and Referring Expression Comprehension

TaskF30k GroundingRefCOCORefCOCO+RefCOCOg
Splittestval/testA/testBval/testA/testBval/test
MetricR@1/R@5/R@10Acc.Acc.Acc.
FIBER-Base87.4/96.4/97.690.68/92.59/87.2685.74/90.13/79.3887.11/87.32

Results on Object Detection on COCO, LVIS and ODinW

TaskCOCO DetectionLVISODinW
SplitVal 2017MiniVal13 Datasets
MetricZero-shot/Fine-tune APZero-shot/Fine-tune APAvg. Zero-shot/Fine-tune AP
FIBER-Base49.3/58.435.8/56.947.0/65.9

Citation

@inproceedings{fiber2022,
  title={Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone},
  author={Dou, Zi-Yi* and Kamath, Aishwarya* and Gan, Zhe* and Zhang, Pengchuan and Wang, Jianfeng and Li, Linjie and Liu, Zicheng and Liu, Ce and LeCun, Yann and Peng, Nanyun and Gao, Jianfeng and Wang, Lijuan},
  booktitle={NeurIPS},
  year={2022},
}

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

About

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

Topics

Resources

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •