Image Dataset | Num | Video Dataset | Num |
---|---|---|---|
COCO | 123K | Flickr | 31K |
SBU | IM | VG | 108K |
CC3M | 3.3M | CUB | 11K |
Multi30K | 151K | CC14M | 14M |
Fashion-Gen | 293K | XTD | 10K |
Amazon reviews | 14M |
Video Dataset | Num | Video Dataset | Num |
---|---|---|---|
MSRVTT | 10K | MSVD | 2K |
LSMDC | 118K | DIDEMO | 8K |
YouCook2 | 2K | YFCC | 100M |
CrossTask | 4K | HowTo100M | 100M |
VaTeX | 41K | Mining Youtube | 20K |
WebVid2M | 2M | QueryYD | 1K |
VGGSound | 200K | LiveBot | 2K |
Kinetics-700 | 650K | FCVID | 91K |
ActivityNet | 20K |
Task | Paper | Dataset | Pretraining | Year |
---|---|---|---|---|
Image Retrieval | Multi-Lingual Acquisition on Multimodal Pre-training for Cross-modal Retrieval [link] |
CC3M, Multi30K, COCO, XTD, MSRVTT | Yes | 2022 |
Video Retrieval | Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval[link] |
MSRVTT, MSVD, DIDEMO, LSMDC | No | 2022 |
Video Retrieval | Everything at Once – Multi-modal Fusion Transformer for Video Retrieval [link] | HowTo100M, Web, CrossTask, Mining YouTube | Yes | 2022 |
Video Retrieval | Cross Modal Retrieval with Querybank Normalisation [link] | MSRVTT, LSMDC, MSVD, VaTeX, QueryYD, DiDeMo | No | 2022 |
Image Retrieval | A Differentiable Semantic Metric Approximation in Probabilistic Embedding for Cross-Modal Retrieval [link] | COCO, CUB, Flickr | No | 2022 |
Image Retrieval | EI-CLIP: Entity-aware Interventional Contrastive Learning for E-commerce Cross-modal Retrieval [link] | Fashion-Gen, Amazon reviews | Yes | 2022 |
Image Retrieval | COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval [link] | CC3M, CC14M, SBU, VG, COCO, Flickr | Yes | 2022 |
Image Retrieval | ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval [link] | VG, Flickr, TC, CTC | No | 2022 |
Video Retrieval | X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval [link] | MSVD, LSMDC, MSRVTT | No | 2022 |
Video Retrieval | ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound [link] | [WebVid2M, VGGSound] | Yes | 2022 |
Video Retrieval | VTC: Improving Video-Text Retrieval with User Comments [link] | LiveBot, Kinetics-700 | Yes | 2022 |
Video Retrieval | Dual-Stream Knowledge-Preserving Hashing for Unsupervised Video Retrieval [link] | FCVID, ActivityNet and YFCC | No | 2022 |
Video Retrieval | MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval [link] | CC3M, WebVid-2M, MSVD, LSMDC, DiDeMo | Yes | 2022 |
Video Retrieval | Multi-Query Video Retrieval [link] | MSR-VTT, MSVD, VATEX | No | 2022 |
Video Retrieval | Selective Query-guided Debiasing for Video Corpus Moment Retrieval [link] | TVR, ActivityNet, DiDeMo | No | 2022 |
Video Retrieval | TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval [link] | MSRVTT, VATEX, LSMDC, ActivityNet, DiDeMo | No | 2022 |
Video Retrieval | Learning Linguistic Association Towards Efficient Text-Video Retrieval [link] | MSRVTT, MSVD, VATEX | No | 2022 |
Image Retrieval | CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for Image-Text Retrieval [link] | COCO, Flickr | No | 2022 |
Image Retrieval | Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for Improved Cross-Modal Retrieval [link] | COCO, Flickr, CC14K | No | 2021 |
Image Retrieval | Learning with Noisy Correspondence for Cross-modal Matching [link] | COCO, Flickr, CC152K | No | 2022 |
Image Retrieval | Probabilistic Embeddings for Cross-Modal Retrieval [link] | COCO, CUB | No | 2021 |
Video Retrieval | TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval [link] | MSRVTT | No | 2021 |
Video Retrieval | Adaptive Cross-Modal Prototypes for Cross-Domain Visual-Language Retrieval [link] | COCO, MSRVTT, MSVD, LSMDC | No | 2021 |
Image Retrieval | Learning Cross-Modal Retrieval with Noisy Labels [link] | Wikipedia, INRIA-Websearch, NUS-WIDE, XMediaNet | Yes | 2021 |
Image Retrieval | Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning [link] | Recipe1M | Yes | 2021 |
Image Retrieval | Ask&Confirm: Active Detail Enriching for Cross-Modal Retrieval with Partial Query [link] | VG | No | 2021 |
Image Retrieval | Product1M: Towards Weakly Supervised Instance-Level Product Retrieval via Cross-Modal Pretraining [link] | Product1M | Yes | 2021 |
Image Retrieval | Wasserstein Coupled Graph Learning for Cross-Modal Retrieval [link] | Fickr, COCO, Real World Scene Graph, Moviegraphs | No | 2021 |
Image Retrieval | Deep Hash Distillation for Image Retrieval [link] | ImageNet, NUS-WIDE, COCO | Yes | 2021 |
Video Retrieval | Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval [link] | MSRVTT, TGIF, MSVD, VATEX | No | 2021 |
Video Retrieval | CodeCMR: Cross-Modal Retrieval For Function-Level Binary Source Code Matching [link] | YouCook2, MSRVTT, HowTo100M, CrossTask, Mining Youtube | Yes | 2020 |
Image Retrieval | IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval [link] | KWAI-AD, Flickr, COCO | No | 2020 |
Video Retrieval | Multi-modal Transformer for Video Retrieval [link] | [MSRVTT, ActivityNet, LSMDC] | No | 2020 |
Image Retrieval | Preserving Semantic Neighborhoods for Robust Cross-modal Retrieval [link] | Politics, GoodNews, CC, COCO | Yes | 2020 |
Image Retrieval | Learning Joint Visual Semantic Matching Embeddings for Language-guided Retrieval [link] | Fashion200k | Yes | 2020 |