Skip to content

friedrichor/Awesome-Multimodal-Papers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

60 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Awesome-Multimodal-Papers

A curated list of awesome Multimodal studies.

Multimodal Papers

Visual Understanding

Title Venue Date Code Supplement
✨ Apollo: An Exploration of Video Understanding in Large Multimodal Models (Exploration) (Meta) arXiv 2024-12-13 Star Project Page
CompCap: Improving Multimodal Large Language Models with Composite Captions (Meta) arXiv 2024-12-09 - -
✨ Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling (InternVL 2.5) arXiv 2024-12-06 Star Project Page
xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs arXiv 2024-10-21 - Project Page
✨ Video Instruction Tuning With Synthetic Data (LLaVA-Video, LLaVA-NeXT Series) arXiv 2024-10-03 Star Project Page
Dataset
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions arXiv 2024-09-26 - Project Page
Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model (MGLMM, Alibaba) arXiv 2024-09-20 Star Project Page
POINTS: Improving Your Vision-language Model with Affordable Strategies (WeChat) arXiv 2024-09-07 Star -
✨ xGen-MM (BLIP-3): A Family of Open Large Multimodal Models arXiv 2024-08-16 Star Project Page
Collections
✨ LLaVA-OneVision: Easy Visual Task Transfer (LLaVA-NeXT Series) arXiv 2024-08-06 Star Project Page
Dataset
Tarsier: Recipes for Training and Evaluating Large Video Description Models (Tarsier, Dream1k, by ByteDance) arXiv 2024-07-30 Star Dataset
✨ InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output arXiv 2024-07-03 Star -
TokenPacker: Efficient Visual Projector for Multimodal LLM arXiv 2024-07-02 Star -
✨ Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs (Cambrian, Data Rationing) arXiv 2024-06-24 Star Project Page
Dataset
✨ Long Context Transfer from Language to Vision (LongVA, by Ziwei Liu, Chunyuan Li) arXiv 2024-06-24 Star Project Page
Generative Visual Instruction Tuning arXiv 2024-06-17 Star -
✨ VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding arXiv 2024-06-13 Star Collections
✨ 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities (Apple) arXiv 2024-06-13 Star Project Page
Wechat
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs arXiv 2024-06-11 Star -
Wings: Learning Multimodal LLMs without Text-only Forgetting arXiv 2024-06-05 - -
Enhancing Multimodal Large Language Models with Multi-instance Visual Prompt Generator for Visual Representation Enrichment (MIVPG) arXiv 2024-06-05 - -
PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM arXiv 2024-06-05 Star -
OLIVE: Object Level In-Context Visual Embeddings ACL 2024 2024-06-02 Star -
X-VILA: Cross-Modality Alignment for Large Language Model (by NVIDIA) arXiv 2024-05-29 - Wechat
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models arXiv 2024-05-24 Star -
Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for Multimodal Large Language Models arXiv 2024-05-24 - -
LOVA3: Learning to Visual Question Answering, Asking and Assessment arXiv 2024-05-23 Star -
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability arXiv 2024-05-23 Star Project Page
Chameleon: Mixed-Modal Early-Fusion Foundation Models (Meta) arXiv 2024-05-16 Star Blog
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts arXiv 2024-05-09 Star Project Page
Dataset
Wechat
ImageInWords: Unlocking Hyper-Detailed Image Descriptions (Google) arXiv 2024-05-05 Star Project Page
Dataset
✨ What matters when building vision-language models? (Idefics2) arXiv 2024-05-03 Star
Collections
MANTIS: Interleaved Multi-Image Instruction Tuning arXiv 2024-05-02 Star Project Page
Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs CVPR 2024 Workshop 2024-04-23 - Wechat
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models arXiv 2024-04-19 Star Project Page
Dataset
MoVA: Adapting Mixture of Vision Experts to Multimodal Context arXiv 2024-04-19 Star -
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models arXiv 2024-04-18 - Project Page
Project Page
LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation? (LaDiC) NAACL 2024 2024-04-16 Star -
AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception (AesExpert, AesMMIT Dataset) arXiv 2024-04-15 Star -
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models (Ferret-v2) arXiv 2024-04-11 - -
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies (MiniCPM series) arXiv 2024-04-09 Star
Star
Blog
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs (Ferret-UI) arXiv 2024-04-08 - -
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding CVPR 2024 2024-04-08 Star Project Page
Koala: Key frame-conditioned long video-LLM CVPR 2024 2024-04-05 Star Project Page
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens arXiv 2024-04-04 Star Project Page
LongVLM: Efficient Long Video Understanding via Large Language Models arXiv 2024-04-04 Star -
VideoAgent: Long-form Video Understanding with Large Language Model as Agent (key frame) arXiv 2024-03-15 - -
✨ MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training (Apple) arXiv 2024-03-14 - -
UniCode: Learning a Unified Codebook for Multimodal Large Language Models arXiv 2024-03-14 - -
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context arXiv 2024-03-08 - Project Page
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models arXiv 2023-03-05 Star -
RegionGPT: Towards Region Understanding Vision Language Model CVPR 2024 2024-03-04 - Project Page
All in an Aggregated Image for In-Image Learning arXiv 2024-02-28 Star -
Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners CVPR 2024 2024-02-27 Star Project Page
TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages arXiv 2024-02-25 - -
LLMBind: A Unified Modality-Task Integration Framework arXiv 2024-02-22 - -
✨ ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model (ALLaVA) arXiv 2024-02-18 Star Demo Page
Dataset
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model arXiv 2024-02-06 Star -
MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices arXiv 2023-12-28 Star -
Gemini: A Family of Highly Capable Multimodal Models arXiv 2023-12-19 - Project Page
✨ Osprey: Pixel Understanding with Visual Instruction Tuning CVPR 2024 2023-12-15 Star -
✨ VILA: On Pre-training for Visual Language Models (NVIDIA, MIT) CVPR 2024 2023-12-12 Star -
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models arXiv 2023-12-11 Star Project Page
Prompt Highlighter: Interactive Control for Multi-Modal LLMs CVPR 2024 2023-12-07 Star Project Page
APoLLo : Unified Adapter and Prompt Learning for Vision Language Models EMNLP 2023 2023-12-04 Star Project Page
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models arXiv 2023-11-28 Star Project Page
Dataset
PG-Video-LLaVA: Pixel Grounding Large Video-Language Models arXiv 2023-11-22 Star -
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions arXiv 2023-11-21 Star Project Page
LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge CVPR 2024 2023-11-20 Star Project Page
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection arXiv 2023-11-16 Star -
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration arXiv 2023-11-07 Star -
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning arXiv 2023-10-14 Star Project Page
Ferret: Refer and Ground Anything Anywhere at Any Granularity (Ferret) ICLR 2024 2023-10-11 Star -
✨ Improved Baselines with Visual Instruction Tuning (LLaVA-1.5) arXiv 2023-10-05 Star Project Page
Aligning Large Multimodal Models with Factually Augmented RLHF (LLaVA-RLHF, MMHal-Bench (hallucination)) arXiv 2023-09-25 Star Project Page
Dataset
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning ICLR 2024 2023-09-14 Star -
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond arXiv 2023-08-24 Star Project Page
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages (VisCPM-Chat/Paint) ICLR 2024 2023-08-23 Star -
SVIT: Scaling up Visual Instruction Tuning arXiv 2023-07-09 Star Dataset
Kosmos-2: Grounding Multimodal Large Language Models to the World (Kosmos-2, GrIT Dataset) arXiv 2023-06-26 Star Demo
Dataset
M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning arXiv 2023-06-07 - Project Page
Dataset
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning NeurIPS 2023 2023-05-11 Star -
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans arXiv 2023-05-08 Star -
VPGTrans: Transfer Visual Prompt Generator across LLMs NeurIPS 2023 2023-05-02 Star Project Page
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality arXiv 2023-04-27 Star -
✨ MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models ICLR 2024 2023-04-20 Star Project Page
✨ Visual Instruction Tuning (LLaVA) NeurIPS 2023 2023-04-17 Star Project Page
Dataset
Language Is Not All You Need: Aligning Perception with Language Models (Kosmos-1) NeurIPS 2023 2023-02-27 Star -
Multimodal Chain-of-Thought Reasoning in Language Models arXiv 2023-02-02 Star -
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models ICML 2023 2023-01-30 Star -
Flamingo: a Visual Language Model for Few-Shot Learning NeurIPS 2022 2022-04-29 Star -

Unified Understanding and Generation

Title Venue Date Code Supplement
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding arXiv 2024-12-12 coming soon -
Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation (TencentARC) arXiv 2024-12-05 Star -
Liquid: Language Models are Scalable Multi-modal Generators (Bytedance) arXiv 2024-12-05 Star arXiv
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation (ByteDance) arXiv 2024-12-04 Star Project Page
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation (by deepseek) - 2024-10-17 Star -
✨ Emu3: Next-Token Prediction is All You Need arXiv 2024-09-27 Star Project Page
✨ Show-o: One Single Transformer to Unify Multimodal Understanding and Generation arXiv 2024-08-22 Star Project Page
An Image is Worth 32 Tokens for Reconstruction and Generation (TiTok, by ByteDance) arXiv 2024-06-11 Star Project Page
✨ Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models arXiv 2024-05-27 Star Project Page
Collections
VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing - 2024-04-25 Star Project Page
YouTube
Wechat
✨ SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation arXiv 2024-04-22 Star -
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling arXiv 2024-02-19 Star Project Page
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization arXiv 2024-02-05 Star Project Page
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action arXiv 2023-12-28 Star Project Page
Generative Multimodal Models are In-Context Learners (Emu2) CVPR 2024 2023-12-20 Star Project Page
CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation arXiv 2023-11-30 Star Project Page
LLMGA: Multimodal Large Language Model based Generation Assistant arXiv 2023-11-27 Star Project Page
VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation arXiv 2023-12-14 Star -
Kosmos-G: Generating Images in Context with Multimodal Large Language Models ICLR 2024 2023-10-04 Star Project Page
✨ MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens arXiv 2023-10-03 Star Project Page
DreamLLM: Synergistic Multimodal Comprehension and Creation ICLR 2024 2023-09-20 Star Project Page
NExT-GPT: Any-to-Any Multimodal LLM arXiv 2023-09-11 Star Project Page
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization (LaVIT) ICLR 2024 2023-09-09 Star -
Planting a SEED of Vision in Large Language Model ICLR 2024 2023-07-16 Star Project Page
Generative Pretraining in Multimodality (Emu1) ICLR 2024 2023-07-11 Star -
Generating Images with Multimodal Language Models (GILL) NeurIPS 2023 2023-05-26 Star Project Page
Any-to-Any Generation via Composable Diffusion (CoDi-1) NeurIPS 2023 2023-05-19 Star Project Page
Grounding Language Models to Images for Multimodal Inputs and Outputs (FROMAGe) ICML 2023 2023-01-31 Star Project Page

Image Understanding Benchmark

Title Venue Date Code Supplement
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning arXiv 2024-06-18 - Dataset
LOVA3: Learning to Visual Question Answering, Asking and Assessment arXiv 2024-05-23 Star -
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI arXiv 2024-04-24 Star Project Page
Dataset
BLINK: Multimodal Large Language Models Can See but Not Perceive arXiv 2024-04-18 Star Project Page
Dataset
Ferret: Refer and Ground Anything Anywhere at Any Granularity (Ferret-Bench) ICLR 2024 2023-10-11 Star -
Aligning Large Multimodal Models with Factually Augmented RLHF (LLaVA-RLHF, MMHal-Bench (hallucination)) arXiv 2023-09-25 Star Project Page
Dataset
Affective Visual Dialog: A Large-Scale Benchmark for Emotional Reasoning Based on Visually Grounded Conversations (AffectVisDial) ECCV 2024 2023-08-30 Star Project Page
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension CVPR 2024 2023-07-30 Star -

Video Understanding Benchmark

Title Venue Date Code Supplement
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models arXiv 2024-10-30 Star Dataset
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark (AuroraCap, VDC) arXiv 2024-10-24 Star Project Page
Dataset
Dataset
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models arXiv 2024-10-14 Star Project Page
Dataset
E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding NeurIPS 2024 2024-09-26 Star Project Page
Dataset
Dataset
Tarsier: Recipes for Training and Evaluating Large Video Description Models (Tarsier, Dream1k) (ByteDance) arXiv 2024-07-30 Star Dataset
VideoVista: A Versatile Benchmark for Video Understanding and Reasoning (VideoVista) arXiv 2024-06-17 Star Project Page
Dataset
VELOCITI: Can Video-Language Models Bind Semantic Concepts through Time? arXiv 2024-06-16 Star Project Page
Dataset
MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding arXiv 2024-06-06 Star Dataset
Dataset
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis (Video-MME) arXiv 2024-05-31 Star Project Page
Dataset
TempCompass: Do Video LLMs Really Understand Videos? arXiv 2024-03-01 Star Project Page
Dataset
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark (MVBench, VideoChat2) CVPR 2024 Highlight 2023-11-28 Star Dataset
EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding NeurIPS 2023 2023-08-17 Star Project Page
Dataset
Perception Test: A Diagnostic Benchmark for Multimodal Video Models (Perception Test, by Google DeepMind) NeurIPS 2023 2023-05-23 Star Project Page
Dataset

Audio

Title Venue Date Code Supplement
SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities EMNLP 2023 (Findings) 2023-05-18 Star Project Page

Multimodal Dialogue

Title Venue Date Code Supplement
DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation arXiv 2024-03-13 Star -
STICKERCONV: Generating Multimodal Empathetic Responses from Scratch ACL 2024 Main 2024-01-20 Star Project Page
Dataset
Wechat
zhihu
VDialogUE: A Unified Evaluation Benchmark for Visually-grounded Dialogue arXiv 2023-09-14 - -
PaCE: Unified Multi-modal Dialogue Pre-training with Progressive and Compositional Experts ACL 2023 2023-05-24 Star Project Page
TikTalk: A Multi-Modal Dialogue Dataset for Real-World Chitchat ACM MM 2023 2023-01-14 Star Dataset
MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation ACL 2023 2022-11-10 Star Dataset
Multimodal Dialogue Response Generation (Divter) ACL 2022 2021-10-16 - -
Maria: A Visual Experience Powered Conversational Agent ACL 2021 2021-05-27 Star -
Multi-Modal Open-Domain Dialogue EMNLP 2021 2020-10-02 - -
Open Domain Dialogue Generation with Latent Images AAAI 2021 2020-04-04 - -
Learning to Respond with Stickers: A Framework of Unifying Multi-Modality in Multi-Turn Dialog WWW 2020 2020-03-10 Star -

Multimodal Learning

Title Venue Date Code Supplement
Video as the New Language for Real-World Decision Making arXiv 2024-02-27 - -
Tokenize Anything via Prompting arXiv 2023-12-14 Star -
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment ICLR 2024 2023-10-03 Star -
ImageBind: One Embedding Space To Bind Them All CVPR 2023 2023-05-09 Star Project Page
Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks CVPR 2023 2022-11-17 Star -
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks (BEiT-3) CVPR 2023 2022-08-22 Star -
BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers arXiv 2022-08-12 Star -
BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning AAAI 2023 2022-06-17 Star Dataset
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework ICML 2022 2022-02-07 Star -
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation ICML 2022 2022-01-28 Star -
Uni-Perceiver: Pre-Training Unified Architecture for Generic Perception for Zero-Shot and Few-Shot Tasks CVPR 2022 2021-12-02 Star -
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation (ALBEF) NeurIPS 2021 2021-07-16 Star Project Page
BEiT: BERT Pre-Training of Image Transformers ICLR 2022 2021-06-15 Star -
Learning Transferable Visual Models From Natural Language Supervision ICML 2021 2021-02-26 Star Blog
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision ICML 2021 2021-02-05 Star -

Image Generation

Title Venue Date Code Supplement
✨ OmniGen: Unified Image Generation arXiv 2024-09-17 Star Project Page
Demo
Wechat
✨ Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens (by Kaiming He, DeepMind, MIT) arXiv 2024-10-17 - -
Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers (Lumina-T2X, Flag-DiT) (Text2Any) arXiv 2024-05-09 Star YouTube
Wechat
FreeU: Free Lunch in Diffusion U-Net (FreeU, by Ziwei Liu) CVPR 2024 Oral 2023-09-20 Star Project Page
YouTube
Demo
Lazy Diffusion Transformer for Interactive Image Editing arXiv 2024-04-18 - Project Page
Salient Object-Aware Background Generation using Text-Guided Diffusion Models CVPR 2024 Workshop 2024-04-15 Star -
HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing arXiv 2024-04-15 Star Project Page
Dataset
Demo
UNIAA: A Unified Multi-modal Image Aesthetic Assessment Baseline and Benchmark (UNIAA-LLaVA, UNIAA-Bench) arXiv 2024-04-15 - -
PMG: Personalized Multimodal Generation with Large Language Models WWW 2024 2024-04-07 - -
Identity Decoupling for Multi-Subject Personalization of Text-to-Image Models arXiv 2024-04-05 Star Project Page
Concept Weaver: Enabling Multi-Concept Fusion in Text-to-Image Models CVPR 2024 2024-04-05 - -
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction (VAR) arXiv 2024-04-03 Star Project Page
PixArt-ÎŁ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation (HuaWei, Enze Xie) arXiv 2024-03-07 Star Project Page
Multi-LoRA Composition for Image Generation arXiv 2024-02-26 Star Project Page
PIXART-δ: Fast and Controllable Image Generation with Latent Consistency Models (HuaWei, Enze Xie) arXiv 2024-01-10 Star Project Page
Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model AAAI 2024 2023-12-19 Star -
SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models (Tencent Xintao Wang) arXiv 2023-12-11 Star Project Page
InstructAny2Pix: Flexible Visual Editing via Multimodal Instruction Following arXiv 2023-12-11 Star Project Page
Emu Edit: Precise Image Editing via Recognition and Generation Tasks arXiv 2023-11-16 - Project Page
BeautifulPrompt: Towards Automatic Prompt Engineering for Text-to-Image Synthesis EMNLP 2023 2023-11-12 Star zhihu
AnyText: Multilingual Visual Text Generation And Editing ICLR 2024 2023-11-06 Star -
EasyGen: Easing Multimodal Generation with a Bidirectional Conditional Diffusion Model and LLMs arXiv 2023-10-13 Star -
Mini-DALLE3: Interactive Text to Image by Prompting Large Language Models arXiv 2023-10-11 Star Project Page
PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis (HuaWei, Enze Xie) ICLR 2024 Spotlight 2023-09-30 Star Project Page
Dataset
Usage Diffusers
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models arXiv 2023-08-13 Star Project Page
Kosmos-G: Generating Images in Context with Multimodal Large Language Models arXiv 2023-10-04 Star Project Page
Improving Image Generation with Better Captions (DALL-E 3) OpenAI 2023 - -
Scaling up GANs for Text-to-Image Synthesis (GigaGAN) CVPR 2023 2023-05-09 Star Project Page
Adding Conditional Control to Text-to-Image Diffusion Models (ControlNet) ICCV 2023 2023-02-10 Star -
Scalable Diffusion Models with Transformers (DiT) ICCV 2023 2022-12-19 Star Project Page
InstructPix2Pix: Learning to Follow Image Editing Instructions CVPR 2023 2022-11-17 Star Project Page
All are Worth Words: A ViT Backbone for Diffusion Models (U-ViT, first Diffsuion Transformer) (RUC, Chongxuan Li) CVPR 2023 2022-09-25 Star -
DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation CVPR 2023 2022-08-25 Star Project Page
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding (Imagen) NeurIPS 2022 2022-05-23 Star Project Page
Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2) OpenAI 2022-04-13 Star -
High-Resolution Image Synthesis with Latent Diffusion Models (LDM, Stable Diffusion) CVPR 2022 2021-12-20 Star -
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models ICML 2022 2021-12-20 Star -
NĂśWA: Visual Synthesis Pre-training for Neural visUal World creAtion ECCV 2022 2021-11-24 Star -
SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations ICLR 2022 2021-08-02 Star Project Page
CogView: Mastering Text-to-Image Generation via Transformers NeurIPS 2021 2021-05-26 Star -
Zero-Shot Text-to-Image Generation (DALL-E 1) ICML 2021 2021-02-24 Star Project Page
Taming Transformers for High-Resolution Image Synthesis (VQ-GAN) CVPR 2021 2020-12-17 Star Project Page

Video Generation

Title Venue Date Code Supplement
LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment arXiv 2024-12-06 Star Project Page
MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions (Mira) arXiv 2024-07-08 Star -
VIMI: Grounding Video Generation through Multi-modal Instruction arXiv 2024-07-08 Star Project Page
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation arXiv 2024-07-02 - -
Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers (Lumina-T2X, Flag-DiT) (Text2Any) 2024-05-09 Star YouTube
Wechat
StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text (Long Video Generation) arXiv 2024-03-21 Star Project Page
YouTube
Demo
Wechat
AnyV2V: A Plug-and-Play Framework For Any Video-to-Video Editing Tasks arXiv 2024-03-21 Star Project Page
Demo
Demo Page
Wechat
FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation (FRESCO) (NTU, Ziwei Liu) CVPR 2024 2024-03-19 Star Project Page
Latte: Latent Diffusion Transformer for Video Generation (Latte) (NTU, Ziwei Liu) arXiv 2024-01-05 Star Project Page
FreeInit: Bridging Initialization Gap in Video Diffusion Models (FreeInit) (NTU, Ziwei Liu) arXiv 2023-12-12 Star Project Page
YouTube
Demo
VideoBooth: Diffusion-based Video Generation with Image Prompts (VideoBooth) (NTU, Ziwei Liu) arXiv 2023-12-01 Star Project Page
VBench: Comprehensive Benchmark Suite for Video Generative Models [Benchmark] (VBench) (NTU, Ziwei Liu) CVPR 2024 2023-11-29 Star Project Page
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets (SVD) arXiv 2023-11-25 Star Project Page
SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction (NTU, Ziwei Liu) ICLR 2024 2023-10-31 Star Project Page
FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling (FreeNoise) (NTU, Ziwei Liu) ICLR 2024 2023-10-23 Star Project Page
LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models (LaVie) (NTU, Ziwei Liu) 2023-09-26 Star Project Page

Multimodal Dataset

Title Venue Date Annotation Source
Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data arXiv 2024-10-24 - Dataset
LVD-2M: A Long-take Video Dataset with Temporally Dense Captions NeurIPS 2024 2024-10-14 Star Project Page
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens arXiv 2024-06-17 Star Collections
Blog
MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions (Mira) arXiv 2024-07-08 Video Generation Star
GUIDE: A Guideline-Guided Dataset for Instructional Video Comprehension IJCAI 2024 2024-06-26 - Project Page
CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation arXiv 2024-06-15 Star -
TextSquare: Scaling up Text-Centric Visual Instruction Tuning arXiv 2024-04-19 Visual Instruction Tuning -
HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing arXiv 2024-04-15 Instruction Image Editing Star
Project Page
Dataset
Demo
AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception (AesExpert, AesMMIT Dataset) arXiv 2024-04-15 Aesthetic Multi-Modality Instruction Tuning Star
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers CVPR 2024 2024-02-29 video-caption Star
Project Page
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model arXiv 2024-02-18 GPT4V-synthesized Data Star
Demo Page
Dataset
STICKERCONV: Generating Multimodal Empathetic Responses from Scratch ACL 2024 Main 2024-01-20 Multimodal Empathetic Dialogue Star
Project Page
Dataset
Wechat
zhihu
SVIT: Scaling up Visual Instruction Tuning arXiv 2023-07-09 Instruction Tuning Star
Dataset
Kosmos-2: Grounding Multimodal Large Language Models to the World (Kosmos-2, GrIT Dataset) arXiv 2023-06-26 Grounded image-text pairs Star
Demo
Dataset
M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning arXiv 2023-06-07 Instruction Tuning Project Page
Dataset
Visual Instruction Tuning (LLaVA) NeurIPS 2023 2023-04-17 Instruction Tuning Star
Project Page
Dataset
Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text NeurIPS D&B 2023 2023-04-14 Interleaved Image-Text Star
TikTalk: A Multi-Modal Dialogue Dataset for Real-World Chitchat ACM MM 2023 2023-01-14 Multimodal Dialogue Star
MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation ACL 2023 2022-11-10 Multimodal Dialogue Star
LAION-5B: An open large-scale dataset for training next generation image-text models NeurIPS 2022 2022-10-16 Image-Text Pairs Project Page
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs NeurIPS Workshop 2021 2021-11-03 Image-Text Pairs Project Page
MMConv: An Environment for Multimodal Conversational Search across Multiple Domains ACM SIGIR 2021 2021-07 Multimodal Dialogue Star
PhotoChat: A Human-Human Dialogue Dataset With Photo Sharing Behavior For Joint Image-Text Modeling ACL 2021 2021-07-06 Open-domain Multimodal Dialogue Star
Image-Chat: Engaging Grounded Conversations ACL 2020 2018-11-02 Multimodal Dialogue Project Page

Multimodal Summary

Title Venue Date Latest Update
From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding (TikTok) arXiv 2024-09-27 -
Video Diffusion Models: A Survey
Star
arXiv 2024-05-06 -
Theoretical research on generative diffusion models: an overview arXiv 2024-04-13 -
A Review of Multi-Modal Large Language and Vision Models arXiv 2024-03-28 -
The (R)Evolution of Multimodal Large Language Models: A Survey arXiv 2024-02-19 -
MM-LLMs: Recent Advances in MultiModal Large Language Models arXiv 2024-01-24 2024-02-20
Multimodal Large Language Models: A Survey IEEE BigData 2023 2023-11-22 -
Multimodal Foundation Models: From Specialists to General-Purpose Assistants CVPR 2023 2023-09-18 -
Understanding Deep Learning - 2023 -
Large Multimodal Models: Notes on CVPR 2023 Tutorial CVPR 2023 2023-06-26 -
A Survey on Multimodal Large Language Models arXiv 2023-06-23 2024-04-01
Multimodal Deep Learning arXiv 2023-01-12 -
Diffusion Models: A Comprehensive Survey of Methods and Applications ACM Computing Surveys 2022-09-02 2024-02-06
Multimodal Learning with Transformers: A Survey IEEE TPAMI 2023 2022-01-13
2023-05-10
Multimodal Machine Learning: A Survey and Taxonomy IEEE PAMI 2019 2017-05-26 2017-08-01

Paper Notes

here