A curated list of vision-and-language pre-training. :-)
Please feel free to send me pull requests or email (chihung.chan@outlook.com) to add links.
Survey | Authors |
---|---|
A Survey of Vision-Language Pre-Trained Models | Yifan Du, Zikang Liu, Junyi Li, Wayne Xin Zhao |
VLP: A Survey on Vision-Language Pre-training | Feilong Chen, Duzhen Zhang, Minglun Han, Xiuyi Chen, Jing Shi, Shuang Xu, Bo Xu |
Vision-and-Language Pretrained Models: A Survey | Siqu Long, Feiqi Cao, Soyeon Caren Han, Haiqin Yang |
Vision-and-Language Pretraining | Thong Nguyen, Cong-Duy Nguyen, Xiaobao Wu, Anh Tuan Luu |
Method | Venue | Reference | Authors |
---|---|---|---|
2019 | |||
VisualBERT | Arxiv-2019 | VisualBERT: A Simple and Performant Baseline for Vision and Language | Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang |
ViLBERT | NeurIPS-2019 | ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks | Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee |
LXMERT | EMNLP-2019 | LXMERT: Learning Cross-Modality Encoder Representations from Transformers | Hao Tan, Mohit Bansal |
2020 | |||
ImageBERT | Arxiv-2020 | ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data | Di Qi, Lin Su, Jia Song, Edward Cui, Taroon Bharti, Arun Sacheti |
InterBERT | Arxiv-2020 | InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining | Junyang Lin, An Yang, Yichang Zhang, Jie Liu, Jingren Zhou, Hongxia Yang |
PixelBERT | Arxiv-2020 | Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers | Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, Jianlong Fu |
VALUE | ECCV-2020 | Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models | Jize Cao, Zhe Gan, Yu Cheng, Licheng Yu, Yen-Chun Chen, Jingjing Liu |
UNITER | ECCV-2020 | UNITER: UNiversal Image-TExt Representation Learning | Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, Jingjing Liu |
VisDial-BERT | ECCV-2020 | Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline | Vishvak Murahari, Dhruv Batra, Devi Parikh, Abhishek Das |
OSCAR | ECCV-2020 | Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks | Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, Jianfeng Gao |
X-LXMERT | EMNLP-2020 | X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers | Jaemin Cho, Jiasen Lu, Dustin Schwenk, Hannaneh Hajishirzi, Aniruddha Kembhavi |
Unicoder-VL | AAAI-2020 | Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training | Gen Li, Nan Duan, Yuejian Fang, Ming Gong, Daxin Jiang, Ming Zhou |
VLP | AAAI-2020 | Unified Vision-Language Pre-Training for Image Captioning and VQA | Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, Jianfeng Gao |
ERNIE-ViL | AAAI-2021 | ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph | Fei Yu, Jiji Tang, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang |
VL-BERT | ICLR-2020 | VL-BERT: Pre-training of Generic Visual-Linguistic Representations | Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, Jifeng Dai |
12-IN-1 | CVPR-2020 | 12-in-1: Multi-Task Vision and Language Representation Learning | Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, Stefan Lee |
VILLA | NeurIPS-2020 | Large-Scale Adversarial Training for Vision-and-Language Representation Learning | Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, Jingjing Liu |
2021 | |||
X-VLM | Arxiv-2021 | Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts | Yan Zeng, Xinsong Zhang, Hang Li |
KD-VLP | Arxiv-2021 | KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation | Yongfei Liu, Chenfei Wu, Shao-yen Tseng, Vasudev Lal, Xuming He, Nan Duan |
VLMO | Arixv-2021 | VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts | Wenhui Wang, Hangbo Bao, Li Dong, Furu Wei |
UNICORN | Arxiv-2021 | Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling | Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Faisal Ahmed, Zicheng Liu, Yumao Lu, Lijuan Wang |
MANGO | Arxiv-2021 | A Closer Look at the Robustness of Vision-and-Language Pre-trained Models | Linjie Li, Zhe Gan, Jingjing Liu |
XGPT | NLPCC-2021 | XGPT: Cross-modal Generative Pre-Training for Image Captioning | Qiaolin Xia, Haoyang Huang, Nan Duan, Dongdong Zhang, Lei Ji, Zhifang Sui, Edward Cui, Taroon Bharti, Xin Liu, Ming Zhou |
ROSITA | ACMMM-2021 | ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration | Yuhao Cui, Zhou Yu, Chunqi Wang, Zhongzhou Zhao, Ji Zhang, Meng Wang, Jun Yu |
Analysis | Findings-2021 | Does Vision-and-Language Pretraining Improve Lexical Grounding? | Tian Yun, Chen Sun, Ellie Pavlick |
Analysis | TACL-2021 | Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers | Lisa Anne Hendricks, John Mellor, Rosalia Schneider, Jean-Baptiste Alayrac, Aida Nematzadeh |
Volta | TACL-2021 | Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs | Emanuele Bugliarello, Ryan Cotterell, Naoaki Okazaki, Desmond Elliott |
VL-T5 | ICML-2021 | Unifying Vision-and-Language Tasks via Text Generation | Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal |
ViLT | ICML-2021 | ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision | Wonjae Kim, Bokyung Son, Ildoo Kim |
Visual Parsing | NeurIPS-2021 | Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training | Hongwei Xue, Yupan Huang, Bei Liu, Houwen Peng, Jianlong Fu, Houqiang Li, Jiebo Luo |
ALBEF | NeurIPS-2021 | Align before Fuse: Vision and Language Representation Learning with Momentum Distillation | Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Deepak Gotmare, Shafiq Joty, Caiming Xiong, Steven Hoi |
E2E-VLP | ACL-2021 | E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning | Haiyang Xu, Ming Yan, Chenliang Li, Bin Bi, Songfang Huang, Wenming Xiao, Fei Huang |
SOHO | CVPR-2021 | Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning | Zhicheng Huang, Zhaoyang Zeng, Yupan Huang, Bei Liu, Dongmei Fu, Jianlong Fu |
VLN-BERT | CVPR-2021 | A Recurrent Vision-and-Language BERT for Navigation | Yicong Hong, Qi Wu, Yuankai Qi, Cristian Rodriguez-Opazo, Stephen Gould |
VinVL | CVPR-2021 | VinVL: Revisiting Visual Representations in Vision-Language Models | Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, Jianfeng Gao |
SimVLM | ICLR-2021 | SimVLM: Simple Visual Language Model Pretraining with Weak Supervision | Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, Yuan Cao |
2022 | |||
mPLUG | Arxiv-2022 | mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections | Chenliang Li, Haiyang Xu, Junfeng Tian, Wei Wang, Ming Yan, Bin Bi, Jiabo Ye, Hehong Chen, Guohai Xu, Zheng Cao, Ji Zhang, Songfang Huang, Fei Huang, Jingren Zhou |
CoCa | Arxiv-2022 | Contrastive Captioners are Image-Text Foundation Models | Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, Yonghui Wu |
Flamingo | Arxiv-2022 | Flamingo: a Visual Language Model for Few-Shot Learning | Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, Karen Simonyan |
BLIP | Arxiv-2022 | BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation | Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi |
Bridge-Tower | Arxiv-2022 | Bridge-Tower: Building Bridges Between Encoders in Vision-Language Representation Learning | Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Nan Duan |
VLMbench | Arxiv-2022 | VLMbench: A Compositional Benchmark for Vision-and-Language Manipulation | Kaizhi Zheng, Xiaotong Chen, Odest Chadwicke Jenkins, Xin Eric Wang |
MixGen | Arxiv-2022 | MixGen: A New Multi-Modal Data Augmentation | Xiaoshuai Hao, Yi Zhu, Srikar Appalaraju, Aston Zhang, Wanqian Zhang, Bo Li, Mu Li |
DaVinci | Arxiv-2022 | Prefix Language Models are Unified Modal Learners | Shizhe Diao, Wangchunshu Zhou, Xinsong Zhang, Jiawei Wang |
MetaLM | Arxiv-2022 | Language Models are General-Purpose Interface | Yaru Hao, Haoyu Song, Li Dong, Shaohan Huang, Zewen Chi, Wenhui Wang, Shuming Ma, Furu Wei |
VL-BEIT | Arxiv-2022 | VL-BEIT: Generative Vision-Language Pretraining | Hangbo Bao, Wenhui Wang, Li Dong, Furu Wei |
VLUE | Arxiv-2022 | VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models | Wangchunshu Zhou, Yan Zeng, Shizhe Diao, Xinsong Zhang |
VL-CheckList | Arxiv-2022 | VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations | Tiancheng Zhao, Tianqi Zhang, Mingwei Zhu, Haozhan Shen, Kyusong Lee, Xiaopeng Lu, Jianwei Yin |
Analysis | AAAI-2022 | Are Vision-Language Transformers Learning Multimodal Representations? A Probing Perspective | Emmanuelle Salin, Badreddine Farah, Stéphane Ayache, Benoit Favre |
CLIP-ViL | ICLR-2022 | How Much Can CLIP Benefit Vision-and-Language Tasks? | Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, Kurt Keutzer |
METER | CVPR-2022 | An Empirical Study of Training End-to-End Vision-and-Language Transformers | Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng, Zicheng Liu, Michael Zeng |
UVLP | CVPR-2022 | Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment | Mingyang Zhou, Licheng Yu, Amanpreet Singh, Mengjiao Wang, Zhou Yu, Ning Zhang |
TCL | CVPR-2022 | Vision-Language Pre-Training with Triple Contrastive Learning | Jinyu Yang, Jiali Duan, Son Tran, Yi Xu, Sampath Chanda, Liqun Chen, Belinda Zeng, Trishul Chilimbi, Junzhou Huang |
OFA | ICML-2022 | Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework | Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, Hongxia Yang |
VLMixer | ICML-2022 | VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix | Teng Wang, Wenhao Jiang, Zhichao Lu, Feng Zheng, Ran Cheng, Chengguo Yin, Ping Luo |
Method | Venue | Reference | Authors |
---|---|---|---|
2021 | |||
ALIGN | Arxiv-2021 | Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision | Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig |
FILIP | Arxiv-2021 | FILIP: Fine-grained Interactive Language-Image Pre-Training | Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, Chunjing Xu |
SLIP | Arxiv-2021 | SLIP: Self-supervision meets Language-Image Pre-training | Norman Mu, Alexander Kirillov, David Wagner, Saining Xie |
CLIP | ICML-2021 | Learning Transferable Visual Models From Natural Language Supervision | Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever |
2022 | |||
Analysis | Arxiv-2022 | Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP) | Alex Fang, Gabriel Ilharco, Mitchell Wortsman, Yuhao Wan, Vaishaal Shankar, Achal Dave, Ludwig Schmidt |
ProtoCLIP | Arxiv-2022 | Prototypical Contrastive Language Image Pretraining | Delong Chen, Zhao Wu, Fan Liu, Zaiquan Yang, Yixiang Huang, Yiping Bao, Erjin Zhou |
Method | Venue | Reference | Authors |
---|---|---|---|
2021 | |||
ViT-BERT | Arxiv-2021 | Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text | Qing Li, Boqing Gong, Yin Cui, Dan Kondratyuk, Xianzhi Du, Ming-Hsuan Yang, Matthew Brown |
UNIMO | ACL-2021 | UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning | Wei Li, Can Gao, Guocheng Niu, Xinyan Xiao, Hao Liu, Jiachen Liu, Hua Wu, Haifeng Wang |
2022 | |||
SkillNet | Arxiv-2022 | One Model, Multiple Modalities: A Sparsely Activated Approach for Text, Sound, Image, Video and Code | Yong Dai, Duyu Tang, Liangxin Liu, Minghuan Tan, Cong Zhou, Jingquan Wang, Zhangyin Feng, Fan Zhang, Xueyu Hu, Shuming Shi |
data2vec | Arxiv-2022 | data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language | Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli |
UNIFIED-IO | Arxiv-2022 | UNIFIED-IO: A UNIFIED MODEL FOR VISION, LANGUAGE, AND MULTI-MODAL TASKS | Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, Aniruddha Kembhavi |
Uni-Perceiver | CVPR-2022 | Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks | Xizhou Zhu, Jinguo Zhu, Hao Li, Xiaoshi Wu, Xiaogang Wang, Hongsheng Li, Xiaohua Wang, Jifeng Dai |
FLAVA | CVPR-2022 | FLAVA: A Foundational Language And Vision Alignment Model | Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, Douwe Kiela |
Dataset | Images | Image-Text Pairs | Duration (hrs) | Note |
---|---|---|---|---|
SBU | 875k | 875k | - | reference, website |
FLIKR | 29k | 145k | - | reference, website |
COCO | 113k | 567k | - | reference, website |
COCO/OI Narratives | 849k | 873k | - | reference, website |
VG | 108k | 5.4m | - | reference, website |
VGQA | 108k | 1.8m | - | reference, website |
VQA | 83k | 444k | - | reference, website |
GQA | 82k | 1m | - | reference, website |
CC3M | 3m | 3m | - | reference, website |
CC12M | 12m | 12m | - | reference, website |
YFCC-15M | 15m | 15m | - | reference, website |
WebImageText | 400m | 400m | - | reference |
LAION-400M | 400m | 400m | - | website |
LAION-2B | 2b | 2b | - | website |
RedCaps | 12m | 12m | reference, website | |
AltText | 1.8b | 1.8b | - | reference |
ImageNet-Captions | 464k | 464k | - | reference, website |
Kinetics | - | - | 1.4k | reference, website |
TVQA | - | - | 0.4k | reference, website |
HT100M | - | - | 134k | reference, website |
WebVid2M | - | - | 13k | reference, website |
The following contents are adapted from this survey.
Task | Description |
---|---|
1. Classification | |
Visual Question Answering (VQA) | Giving a visual input (image or video), VQA represents the task of correctly providing an answer to a question. |
Visual Reasoning and Compositional Question Answering (GQA) | GQA is an upgraded version of VQA and aims to advance research on the visual reasoning of natural scenes. |
Natural Language for Visual Reasoning (NLVR) | The input of the NLVR task is two images and a text description, and the output is whether the corresponding relationship between the images and the text description is consistent (two labels: true or false). |
Visual Entailment (VE) | In the VE task, image is the premise, and text is the hypothesis. Our goal is to predict whether the text is "Entailment Image". There are three labels, Entailment, Neutral, and Contradiction. |
Visual Commonsense Reasoning (VCR) | VCR exists in the form of multiple-choice questions. For a question, there are several alternative answers. The model must choose an answer from several answers and then select the reason for choosing this answer from several alternative reasons. |
Grounding Referring Expressions (GRE) | The GRE task is to localize an image region given a text reference. The model can output a score for each region, and the region with the highest score is used as the prediction region. |
Visual Spatial Reasoning (VSR) | The Visual Spatial Reasoning (VSR) corpus is a collection of caption-image pairs with true/false labels. Each caption describes the spatial relation of two individual objects in the image, and a vision-language model (VLM) needs to judge whether the caption is correctly describing the image (True) or not (False). |
2. Regression | |
Multi-modal Sentiment Analysis (MSA) | MSA is aimed to detect sentiments in videos by leveraging multi-modal signals (e.g., vision, language, etc.). It is to predict the affective orientation of an utterance as a continuous intensity variable. |
3. Retrieval | |
Vision-Language Retrieval | VLR involves understanding both vision (image or video) and language domains with appropriate matching strategies. It includes two subtasks, vision-to-text, and text-to-vision retrieval, where vision-to-text retrieval is to fetch the top-most relevant text description from a larger pool of descriptions as per the vision and vice versa. |
4. Generation | |
Visual Captioning (VC) | VC aims to generate semantically and syntactically appropriate text descriptions for a given visual (image or video) input. |
Novel Object Captioning at Scale (NoCaps) | NoCaps extends the VC task to test a model's capability of describing novel objects from the Open Images dataset, which are unseen in the training corpus. |
Visual Dialogue (VD) | The task form of VD is given an image (or video), a dialogue history, and a language question, and let the model generate an answer for the question. |
5. Others | |
Multi-modal Machine Translation (MMT) | MMT is a two-fold task of translation and text generation, translating text from one language to another with additional information from other modalities, i.e., image. |
Vision-Language Navigation (VLN) | VLN is a grounding language task of an agent's locomotion as it sees and explores the real-world dynamics based on linguistic instructions. |
Optical Character Recognition (OCR) | OCR generally refers to detecting and recognizing text information in images, which includes two parts: text detection (similar to regression) and text recognition (similar to classification). |
To the extent possible under law, Zhihong Chen has waived all copyright and related or neighboring rights to this work.
This repo started from this survey. We thank the authors for their comprehensive review of existing studies.