Awesome Video-Text Retrieval by Deep Learning

A curated list of deep learning resources for video-text retrieval.

Contributing

Please feel free to pull requests to add papers.

Markdown format:

- `[Author Journal/Booktitle Year]` Title. Journal/Booktitle, Year. [[paper]](link) [[code]](link) [[homepage]](link)

Implementations

PyTorch

TensorFlow

jsfusion

Others

w2vv(Keras)

Useful Toolkit

Extracting CNN features from video frames by MXNet

Papers

2021

[Dong et al. TPAMI21] Dual Encoding for Video Retrieval by Text. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021. [paper] [code]
[Wei et al. TPAMI21] Universal Weighting Metric Learning for Cross-Modal Retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021. [paper]
[Lei et al. CVPR21] Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling. CVPR, 2021. [paper] [code]
[Wray et al. CVPR21] On Semantic Similarity in Video Retrieval. CVPR, 2021. [paper] [code]
[Chen et al. CVPR21] Learning the Best Pooling Strategy for Visual Semantic Embedding. CVPR, 2021. [paper][code]
[Wang et al. CVPR21] T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval. CVPR, 2021. [paper]
[Miech et al. CVPR21] Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers. CVPR, 2021. [paper]
[Patrick et al. ICLR21] Support-set Bottlenecks for Video-text Representation Learning. ICLR, 2021. [paper]
[Qi et al. TIP21] Semantics-Aware Spatial-Temporal Binaries for Cross-Modal Video Retrieval. IEEE Transactions on Image Processing, 2021. [paper]
[Qi et al. TIP21] Semantics-Aware Spatial-Temporal Binaries for Cross-Modal Video Retrieval. IEEE Transactions on Image Processing, 2021. [paper]
[Dong et al. NEUCOM21] Multi-level Alignment Network for Domain Adaptive Cross-modal Retrieval. Neurocomputing, 2021. [paper] [code]
[Chen et al. AAAI21] Mind-the-Gap! Unsupervised Domain Adaptation for Text-Video Retrieval. AAAI, 2021. [paper]
[Song et al. TMM21] Spatial-temporal Graphs for Cross-modal Text2Video Retrieval. IEEE Transactions on Multimedia, 2021. [paper]
[Jin et al. SIGIR21] Hierarchical Cross-Modal Graph Consistency Learning for Video-Text Retrieval. SIGIR, 2020. [paper]
[He et al. SIGIR21]Improving Video Retrieval by Adaptive Margin. SIGIR, 2020. [paper]
[Wang et al. IJCAI21] Dig into Multi-modal Cues for Video Retrieval with Hierarchical Alignment. IJCAI, 2021. [paper]
[Dzabraev et al. CVPRW21] MDMMT: Multidomain Multimodal Transformer for Video Retrieval. CVPR Workshops, 2021.[paper]
[Hao et al. ICME21]What Matters: Attentive and Relational Feature Aggregation Network for Video-Text Retrieval. ICME, 2021. [paper]
[Wu et al. ICME21]Multi-Dimensional Attentive Hierarchical Graph Pooling Network for Video-Text Retrieval. ICME, 2021. [paper]
[Song et al. ICIP21] Semantic-Preserving Metric Learning for Video-Text Retrieval. IEEE International Conference on Image Processing, 2021. [paper]
[Ioana et al. ARXIV21] TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval. arXiv:2104.08271, 2021. [paper]
[Bian et al. ARXIV21] Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval. arXiv:2104.00650, 2021. [paper][code]
[Liu et al. ARXIV21] HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval. arXiv:2103.15049, 2021. [paper]
[Akbari et al. ARXIV21] VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text. arXiv:2104.11178 , 2021. [paper] [code]
[Fang et al. ARXIV21] CLIP2Video: Mastering Video-Text Retrieval via Image CLIP. arXiv:2106.11097, 2021. [paper] [code]

2020

[Yang et al. SIGIR20] Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval. SIGIR, 2020. [paper]
[Ging et al. NeurIPS20] COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning. NeurIPS, 2020. [paper] [code]
[Gabeur et al. ECCV20] Multi-modal Transformer for Video Retrieval. ECCV, 2020. [paper] [code] [homepage]
[Li et al. TMM20] SEA: Sentence Encoder Assembly for Video Retrieval by Textual Queries. IEEE Transactions on Multimedia, 2020. [paper]
[Wang et al. TMM20] Learning Coarse-to-Fine Graph Neural Networks for Video-Text Retrieval. IEEE Transactions on Multimedia, 2020. [paper]
[Chen et al. TMM20] Interclass-Relativity-Adaptive Metric Learning for Cross-Modal Matching and Beyond. IEEE Transactions on Multimedia, 2020. [paper]
[Wu et al. ACMMM20] Interpretable Embedding for Ad-Hoc Video Search. ACM Multimedia, 2020. [paper]
[Feng et al. IJCAI20] Exploiting Visual Semantic Reasoning for Video-Text Retrieval. IJCAI, 2020. [paper]
[Wei et al. CVPR20] Universal Weighting Metric Learning for Cross-Modal Retrieval. CVPR, 2020. [paper]
[Doughty et al. CVPR20] Action Modifiers: Learning from Adverbs in Instructional Videos. CVPR, 2020. [paper]
[Chen et al. CVPR20] Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning. CVPR, 2020. [paper]
[Zhu et al. CVPR20] ActBERT: Learning Global-Local Video-Text Representations. CVPR, 2020. [paper]
[Miech et al. CVPR20] End-to-End Learning of Visual Representations From Uncurated Instructional Videos. CVPR, 2020. [paper] [code] [homepage]
[Zhao et al. ICME20] Stacked Convolutional Deep Encoding Network For Video-Text Retrieval. ICME, 2020. [paper]
[Luo et al. ARXIV20] UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation. arXiv:2002.06353, 2020. [paper]

2019

[Dong et al. CVPR19] Dual Encoding for Zero-Example Video Retrieval. CVPR, 2019. [paper] [code]
[Song et al. CVPR19] Polysemous visual-semantic embedding for cross-modal retrieval. CVPR, 2019. [paper]
[Wray et al. ICCV19] Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings. ICCV, 2019. [paper]
[Xiong et al. ICCV19] A Graph-Based Framework to Bridge Movies and Synopses. ICCV, 2019. [paper]
[Li et al. ACMMM19] W2VV++ Fully Deep Learning for Ad-hoc Video Search. ACM Multimedia, 2019. [paper] [code]
[Liu et al. BMVC19] Use What You Have: Video Retrieval Using Representations From Collaborative Experts. MBVC, 2019. [paper] [code]
[Choi et al. BigMM19] From Intra-Modal to Inter-Modal Space: Multi-Task Learning of Shared Representations for Cross-Modal Retrieval. International Conference on Multimedia Big Data, 2019. [paper]

2018

[Dong et al. TMM18] Predicting visual features from text for image and video caption retrieval. IEEE Transactions on Multimedia, 2018. [paper] [code]
[Zhang et al. ECCV18] Cross-Modal and Hierarchical Modeling of Video and Text. ECCV, 2018. [paper] [code]
[Yu et al. ECCV18] A Joint Sequence Fusion Model for Video Question Answering and Retrieval. ECCV, 2018. [paper]
[Shao et al. ECCV18] Find and focus: Retrieve and localize video events with natural language queries. ECCV, 2018. [paper]
[Mithun et al. ICMR18] Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval. ICMR, 2018. [paper] [code]
[Miech et al. arXiv18] Learning a Text-Video Embedding from Incomplete and Heterogeneous Data. arXiv preprint arXiv:1804.02516, 2018. [paper] [code]

Before

[Yu et al. CVPR17] End-to-end concept word detection for video captioning, retrieval, and question answering. CVPR, 2017. [paper] [code]
[OtaniEmail et al. ECCVW2016] Learning joint representations of videos and sentences with web image search. ECCV Workshop, 2016. [paper]
[Xu et al. AAAI15] Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. AAAI, 2015. [paper]

Ad-hoc Video Search

For the papers targeting at ad-hoc video search in the context of TRECVID, please refer to here.

Other Related

[Rouditchenko et al. INTERSPEECH21] AVLnet: Learning Audio-Visual Language Representations from Instructional Videos. Interspeech, 2021. [paper] [code]
[Li et al. arXiv20] Learning Spatiotemporal Features via Video and Text Pair Discrimination. arXiv preprint arXiv:2001.05691, 2020. [paper]

Datasets

[MSVD] David et al. Collecting Highly Parallel Data for Paraphrase Evaluation. ACL, 2011. [paper] [dataset]
[MSRVTT] Xu et al. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. CVPR, 2016. [paper] [dataset]
[TGIF] Li et al. TGIF: A new dataset and benchmark on animated GIF description. CVPR, 2016. [paper] [homepage]
[AVS] Awad et al. Trecvid 2016: Evaluating video search, video event detection, localization, and hyperlinking. TRECVID Workshop, 2016. [paper] [dataset]
[LSMDC] Rohrbach et al. Movie description. IJCV, 2017. [paper] [dataset]
[ActivityNet Captions] Krishna et al. Dense-captioning events in videos. ICCV, 2017. [paper] [dataset]
[DiDeMo] Hendricks et al. Localizing Moments in Video with Natural Language. ICCV, 2017. [paper] [code]
[HowTo100M] Miech et al. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. ICCV, 2019. [paper] [homepage]
[VATEX] Wang et al. VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research. ICCV, 2019. [paper] [homepage]

Licenses

To the extent possible under law, danieljf24 all copyright and related or neighboring rights to this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Video-Text Retrieval by Deep Learning

Contributing

Table of Contents

Implementations

PyTorch

TensorFlow

Others

Useful Toolkit

Papers

2021

2020

2019

2018

Before

Ad-hoc Video Search

Other Related

Datasets

Licenses

About

Releases

Packages

lxt-007/awesome-video-text-retrieval

Folders and files

Latest commit

History

Repository files navigation

Awesome Video-Text Retrieval by Deep Learning

Contributing

Table of Contents

Implementations

PyTorch

TensorFlow

Others

Useful Toolkit

Papers

2021

2020

2019

2018

Before

Ad-hoc Video Search

Other Related

Datasets

Licenses

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages