Skip to content

A curated list of deep learning resources for video-text retrieval.

Notifications You must be signed in to change notification settings

lxt-007/awesome-video-text-retrieval

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 

Repository files navigation

Awesome Video-Text Retrieval by Deep Learning Awesome

A curated list of deep learning resources for video-text retrieval.

Contributing

Please feel free to pull requests to add papers.

Markdown format:

- `[Author Journal/Booktitle Year]` Title. Journal/Booktitle, Year. [[paper]](link) [[code]](link) [[homepage]](link)

Table of Contents

Implementations

PyTorch

TensorFlow

Others

Useful Toolkit

Papers

2021

  • [Dong et al. TPAMI21] Dual Encoding for Video Retrieval by Text. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021. [paper] [code]
  • [Wei et al. TPAMI21] Universal Weighting Metric Learning for Cross-Modal Retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021. [paper]
  • [Lei et al. CVPR21] Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling. CVPR, 2021. [paper] [code]
  • [Wray et al. CVPR21] On Semantic Similarity in Video Retrieval. CVPR, 2021. [paper] [code]
  • [Chen et al. CVPR21] Learning the Best Pooling Strategy for Visual Semantic Embedding. CVPR, 2021. [paper][code]
  • [Wang et al. CVPR21] T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval. CVPR, 2021. [paper]
  • [Miech et al. CVPR21] Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers. CVPR, 2021. [paper]
  • [Patrick et al. ICLR21] Support-set Bottlenecks for Video-text Representation Learning. ICLR, 2021. [paper]
  • [Qi et al. TIP21] Semantics-Aware Spatial-Temporal Binaries for Cross-Modal Video Retrieval. IEEE Transactions on Image Processing, 2021. [paper]
  • [Qi et al. TIP21] Semantics-Aware Spatial-Temporal Binaries for Cross-Modal Video Retrieval. IEEE Transactions on Image Processing, 2021. [paper]
  • [Dong et al. NEUCOM21] Multi-level Alignment Network for Domain Adaptive Cross-modal Retrieval. Neurocomputing, 2021. [paper] [code]
  • [Chen et al. AAAI21] Mind-the-Gap! Unsupervised Domain Adaptation for Text-Video Retrieval. AAAI, 2021. [paper]
  • [Song et al. TMM21] Spatial-temporal Graphs for Cross-modal Text2Video Retrieval. IEEE Transactions on Multimedia, 2021. [paper]
  • [Jin et al. SIGIR21] Hierarchical Cross-Modal Graph Consistency Learning for Video-Text Retrieval. SIGIR, 2020. [paper]
  • [He et al. SIGIR21]Improving Video Retrieval by Adaptive Margin. SIGIR, 2020. [paper]
  • [Wang et al. IJCAI21] Dig into Multi-modal Cues for Video Retrieval with Hierarchical Alignment. IJCAI, 2021. [paper]
  • [Dzabraev et al. CVPRW21] MDMMT: Multidomain Multimodal Transformer for Video Retrieval. CVPR Workshops, 2021.[paper]
  • [Hao et al. ICME21]What Matters: Attentive and Relational Feature Aggregation Network for Video-Text Retrieval. ICME, 2021. [paper]
  • [Wu et al. ICME21]Multi-Dimensional Attentive Hierarchical Graph Pooling Network for Video-Text Retrieval. ICME, 2021. [paper]
  • [Song et al. ICIP21] Semantic-Preserving Metric Learning for Video-Text Retrieval. IEEE International Conference on Image Processing, 2021. [paper]
  • [Ioana et al. ARXIV21] TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval. arXiv:2104.08271, 2021. [paper]
  • [Bian et al. ARXIV21] Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval. arXiv:2104.00650, 2021. [paper][code]
  • [Liu et al. ARXIV21] HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval. arXiv:2103.15049, 2021. [paper]
  • [Akbari et al. ARXIV21] VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text. arXiv:2104.11178 , 2021. [paper] [code]
  • [Fang et al. ARXIV21] CLIP2Video: Mastering Video-Text Retrieval via Image CLIP. arXiv:2106.11097, 2021. [paper] [code]

2020

  • [Yang et al. SIGIR20] Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval. SIGIR, 2020. [paper]
  • [Ging et al. NeurIPS20] COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning. NeurIPS, 2020. [paper] [code]
  • [Gabeur et al. ECCV20] Multi-modal Transformer for Video Retrieval. ECCV, 2020. [paper] [code] [homepage]
  • [Li et al. TMM20] SEA: Sentence Encoder Assembly for Video Retrieval by Textual Queries. IEEE Transactions on Multimedia, 2020. [paper]
  • [Wang et al. TMM20] Learning Coarse-to-Fine Graph Neural Networks for Video-Text Retrieval. IEEE Transactions on Multimedia, 2020. [paper]
  • [Chen et al. TMM20] Interclass-Relativity-Adaptive Metric Learning for Cross-Modal Matching and Beyond. IEEE Transactions on Multimedia, 2020. [paper]
  • [Wu et al. ACMMM20] Interpretable Embedding for Ad-Hoc Video Search. ACM Multimedia, 2020. [paper]
  • [Feng et al. IJCAI20] Exploiting Visual Semantic Reasoning for Video-Text Retrieval. IJCAI, 2020. [paper]
  • [Wei et al. CVPR20] Universal Weighting Metric Learning for Cross-Modal Retrieval. CVPR, 2020. [paper]
  • [Doughty et al. CVPR20] Action Modifiers: Learning from Adverbs in Instructional Videos. CVPR, 2020. [paper]
  • [Chen et al. CVPR20] Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning. CVPR, 2020. [paper]
  • [Zhu et al. CVPR20] ActBERT: Learning Global-Local Video-Text Representations. CVPR, 2020. [paper]
  • [Miech et al. CVPR20] End-to-End Learning of Visual Representations From Uncurated Instructional Videos. CVPR, 2020. [paper] [code] [homepage]
  • [Zhao et al. ICME20] Stacked Convolutional Deep Encoding Network For Video-Text Retrieval. ICME, 2020. [paper]
  • [Luo et al. ARXIV20] UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation. arXiv:2002.06353, 2020. [paper]

2019

  • [Dong et al. CVPR19] Dual Encoding for Zero-Example Video Retrieval. CVPR, 2019. [paper] [code]
  • [Song et al. CVPR19] Polysemous visual-semantic embedding for cross-modal retrieval. CVPR, 2019. [paper]
  • [Wray et al. ICCV19] Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings. ICCV, 2019. [paper]
  • [Xiong et al. ICCV19] A Graph-Based Framework to Bridge Movies and Synopses. ICCV, 2019. [paper]
  • [Li et al. ACMMM19] W2VV++ Fully Deep Learning for Ad-hoc Video Search. ACM Multimedia, 2019. [paper] [code]
  • [Liu et al. BMVC19] Use What You Have: Video Retrieval Using Representations From Collaborative Experts. MBVC, 2019. [paper] [code]
  • [Choi et al. BigMM19] From Intra-Modal to Inter-Modal Space: Multi-Task Learning of Shared Representations for Cross-Modal Retrieval. International Conference on Multimedia Big Data, 2019. [paper]

2018

  • [Dong et al. TMM18] Predicting visual features from text for image and video caption retrieval. IEEE Transactions on Multimedia, 2018. [paper] [code]
  • [Zhang et al. ECCV18] Cross-Modal and Hierarchical Modeling of Video and Text. ECCV, 2018. [paper] [code]
  • [Yu et al. ECCV18] A Joint Sequence Fusion Model for Video Question Answering and Retrieval. ECCV, 2018. [paper]
  • [Shao et al. ECCV18] Find and focus: Retrieve and localize video events with natural language queries. ECCV, 2018. [paper]
  • [Mithun et al. ICMR18] Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval. ICMR, 2018. [paper] [code]
  • [Miech et al. arXiv18] Learning a Text-Video Embedding from Incomplete and Heterogeneous Data. arXiv preprint arXiv:1804.02516, 2018. [paper] [code]

Before

  • [Yu et al. CVPR17] End-to-end concept word detection for video captioning, retrieval, and question answering. CVPR, 2017. [paper] [code]
  • [OtaniEmail et al. ECCVW2016] Learning joint representations of videos and sentences with web image search. ECCV Workshop, 2016. [paper]
  • [Xu et al. AAAI15] Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. AAAI, 2015. [paper]

Ad-hoc Video Search

  • For the papers targeting at ad-hoc video search in the context of TRECVID, please refer to here.

Other Related

  • [Rouditchenko et al. INTERSPEECH21] AVLnet: Learning Audio-Visual Language Representations from Instructional Videos. Interspeech, 2021. [paper] [code]
  • [Li et al. arXiv20] Learning Spatiotemporal Features via Video and Text Pair Discrimination. arXiv preprint arXiv:2001.05691, 2020. [paper]

Datasets

  • [MSVD] David et al. Collecting Highly Parallel Data for Paraphrase Evaluation. ACL, 2011. [paper] [dataset]
  • [MSRVTT] Xu et al. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. CVPR, 2016. [paper] [dataset]
  • [TGIF] Li et al. TGIF: A new dataset and benchmark on animated GIF description. CVPR, 2016. [paper] [homepage]
  • [AVS] Awad et al. Trecvid 2016: Evaluating video search, video event detection, localization, and hyperlinking. TRECVID Workshop, 2016. [paper] [dataset]
  • [LSMDC] Rohrbach et al. Movie description. IJCV, 2017. [paper] [dataset]
  • [ActivityNet Captions] Krishna et al. Dense-captioning events in videos. ICCV, 2017. [paper] [dataset]
  • [DiDeMo] Hendricks et al. Localizing Moments in Video with Natural Language. ICCV, 2017. [paper] [code]
  • [HowTo100M] Miech et al. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. ICCV, 2019. [paper] [homepage]
  • [VATEX] Wang et al. VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research. ICCV, 2019. [paper] [homepage]

Licenses

CC0

To the extent possible under law, danieljf24 all copyright and related or neighboring rights to this repository.

About

A curated list of deep learning resources for video-text retrieval.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published