Image Retrieval |
Multi-Lingual Acquisition on Multimodal Pre-training for Cross-modal Retrieval [link] |
CC3M, Multi30K, COCO, XTD, MSRVTT |
Yes |
2022 |
Video Retrieval |
Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval[link] |
MSRVTT, MSVD, DIDEMO, LSMDC |
No |
2022 |
Video Retrieval |
Everything at Once – Multi-modal Fusion Transformer for Video Retrieval [link] |
HowTo100M, Web, CrossTask, Mining YouTube |
Yes |
2022 |
Video Retrieval |
Cross Modal Retrieval with Querybank Normalisation [link] |
MSRVTT, LSMDC, MSVD, VaTeX, QueryYD, DiDeMo |
No |
2022 |
Image Retrieval |
A Differentiable Semantic Metric Approximation in Probabilistic Embedding for Cross-Modal Retrieval [link] |
COCO, CUB, Flickr |
No |
2022 |
Image Retrieval |
EI-CLIP: Entity-aware Interventional Contrastive Learning for E-commerce Cross-modal Retrieval [link] |
Fashion-Gen, Amazon reviews |
Yes |
2022 |
Image Retrieval |
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval [link] |
CC3M, CC14M, SBU, VG, COCO, Flickr |
Yes |
2022 |
Image Retrieval |
ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval [link] |
VG, Flickr, TC, CTC |
No |
2022 |
Video Retrieval |
X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval [link] |
MSVD, LSMDC, MSRVTT |
No |
2022 |
Video Retrieval |
ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound [link] |
[WebVid2M, VGGSound] |
Yes |
2022 |
Video Retrieval |
VTC: Improving Video-Text Retrieval with User Comments [link] |
LiveBot, Kinetics-700 |
Yes |
2022 |
Video Retrieval |
Dual-Stream Knowledge-Preserving Hashing for Unsupervised Video Retrieval [link] |
FCVID, ActivityNet and YFCC |
No |
2022 |
Video Retrieval |
MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval [link] |
CC3M, WebVid-2M, MSVD, LSMDC, DiDeMo |
Yes |
2022 |
Video Retrieval |
Multi-Query Video Retrieval [link] |
MSR-VTT, MSVD, VATEX |
No |
2022 |
Video Retrieval |
Selective Query-guided Debiasing for Video Corpus Moment Retrieval [link] |
TVR, ActivityNet, DiDeMo |
No |
2022 |
Video Retrieval |
TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval [link] |
MSRVTT, VATEX, LSMDC, ActivityNet, DiDeMo |
No |
2022 |
Video Retrieval |
Learning Linguistic Association Towards Efficient Text-Video Retrieval [link] |
MSRVTT, MSVD, VATEX |
No |
2022 |
Image Retrieval |
CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for Image-Text Retrieval [link] |
COCO, Flickr |
No |
2022 |
Image Retrieval |
Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for Improved Cross-Modal Retrieval [link] |
COCO, Flickr, CC14K |
No |
2021 |
Image Retrieval |
Learning with Noisy Correspondence for Cross-modal Matching [link] |
COCO, Flickr, CC152K |
No |
2022 |
Image Retrieval |
Probabilistic Embeddings for Cross-Modal Retrieval [link] |
COCO, CUB |
No |
2021 |
Video Retrieval |
TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval [link] |
MSRVTT |
No |
2021 |
Video Retrieval |
Adaptive Cross-Modal Prototypes for Cross-Domain Visual-Language Retrieval [link] |
COCO, MSRVTT, MSVD, LSMDC |
No |
2021 |
Image Retrieval |
Learning Cross-Modal Retrieval with Noisy Labels [link] |
Wikipedia, INRIA-Websearch, NUS-WIDE, XMediaNet |
Yes |
2021 |
Image Retrieval |
Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning [link] |
Recipe1M |
Yes |
2021 |
Image Retrieval |
Ask&Confirm: Active Detail Enriching for Cross-Modal Retrieval with Partial Query [link] |
VG |
No |
2021 |
Image Retrieval |
Product1M: Towards Weakly Supervised Instance-Level Product Retrieval via Cross-Modal Pretraining [link] |
Product1M |
Yes |
2021 |
Image Retrieval |
Wasserstein Coupled Graph Learning for Cross-Modal Retrieval [link] |
Fickr, COCO, Real World Scene Graph, Moviegraphs |
No |
2021 |
Image Retrieval |
Deep Hash Distillation for Image Retrieval [link] |
ImageNet, NUS-WIDE, COCO |
Yes |
2021 |
Video Retrieval |
Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval [link] |
MSRVTT, TGIF, MSVD, VATEX |
No |
2021 |
Video Retrieval |
CodeCMR: Cross-Modal Retrieval For Function-Level Binary Source Code Matching [link] |
YouCook2, MSRVTT, HowTo100M, CrossTask, Mining Youtube |
Yes |
2020 |
Image Retrieval |
IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval [link] |
KWAI-AD, Flickr, COCO |
No |
2020 |
Video Retrieval |
Multi-modal Transformer for Video Retrieval [link] |
[MSRVTT, ActivityNet, LSMDC] |
No |
2020 |
Image Retrieval |
Preserving Semantic Neighborhoods for Robust Cross-modal Retrieval [link] |
Politics, GoodNews, CC, COCO |
Yes |
2020 |
Image Retrieval |
Learning Joint Visual Semantic Matching Embeddings for Language-guided Retrieval [link] |
Fashion200k |
Yes |
2020 |