his is a repository for organizing articles related to Multimodal Large Language Models, Large Language Models, and Diffusion Models; Most papers are linked to my reading notes. Feel free to visit my personal homepage and contact me for collaboration and discussion.
I'm a third-year Ph.D. student at the State Key Laboratory of Pattern Recognition, the University of Chinese Academy of Sciences, advised by Prof. Tieniu Tan. I have also spent time at Microsoft, advised by Prof. Jingdong Wang, alibaba DAMO Academy, work with Prof. Rong Jin.
We have presented a comprehensive survey on the evaluation of large multi-modality models, jointly with Opencompass Team and LMMs-Lab 🔥🔥🔥
- Our benchmark MME-RealWorld has been released, the most difficult and largest pure manual annotation image perception benchmark so far. [Code] [Reading Notes]
- Our model SliME has been released, a high-resolution MLLM that can also be extend to video analysis. [Code] [Reading Notes]
- Our paper Debiasing Multimodal Large Language Models has been released. [Code] [Reading Notes]
- Awesome-Multimodal-Large-Language-Models
- Table of Contents (ongoing)
- Survey and Outlook
- Multimodal Large Language Models
- BenchMark and Dataset
- Unify Multimodal Understanding and Generation
- Alignment With Human Preference (MLLM)
- Alignment With Human Preference (LLM)
- 万字长文总结多模态大模型最新进展(Modality Bridging篇)
- 万字长文总结多模态大模型最新进展(Video篇)
- Aligning Large Language Models with Human
- (Meta,Stanford) Apollo: An Exploration of Video Understanding in Large Multimodal Models(什么是MLLM视频理解的关键因素)
- (Shanghai AI Lab) Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling(InternVL2.5技术细节-让开源多模态模型再进一步)
- (NVIDIA) NVLM: Open Frontier-Class Multimodal LLMs(三种不同的特征融合框架深度探索)
- (Allen Institute for AI) Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models(本文的改进集中在数据侧,包括了一些数据合成的方法,开放了更高质量得多模态数据等)
- (MixtralAI) Pixtral 12B(12B接近Qwen2-VL 72B和Llama-3.2 90B水平)
- (Rhymes AI) Aria: An Open Multimodal Native Mixture-of-Experts Mode(细粒度混合专家(MoE)架构)
- (Apple) MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning(apple:多模态大模型炼丹指南)
- (Hugging Face) Building and better understanding vision-language models: insights and future directions(Hugging Face:探索多模态大模型的最佳技术路线)
- (Alibaba) Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution(精细的动态分辨率策略+多模态旋转位置嵌入)
- LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture(在单个A100 80GB GPU上可以处理近千张图像)
- MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?(最难多模态Benchmark. QwenVL-2第一但未及格!)
- VITA: Towards Open-Source Interactive Omni Multimodal LLM(VITA : 首个开源支持自然人机交互的全能多模态大语言模型)
- Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models(高效处理高分辨率图像的多模态大模型)
- Matryoshka Multimodal Models(如何在正确回答视觉问题的同时使用最少的视觉标记?)
- Chameleon: Mixed-Modal Early-Fusion Foundation Models(meta: 所有模态都回到token regreesion以达到灵活的理解/生成)
- Flamingo: a Visual Language Model for Few-Shot Learning(LLM每一层创建额外的block处理视觉信息)
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models(q-former融合视觉-语言信息)
- InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning(qformer+instruction tuning)
- Visual Instruction Tuning(MLP对齐特征,gpt4v生成instruction tuning数据)
- Improved Baselines with Visual Instruction Tuning(对于llava数据集以及模型大小的初步scaling)
- LLaVA-NeXT: Improved reasoning, OCR, and world knowledge(分辨率*4,数据集更大)
- Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models(一种端到端的优化方案,通过轻量级适配器连接图像编码器和LLM)
- MIMIC-IT: Multi-Modal In-Context Instruction Tuning( MIMIC-IT包含多个图片或视频的输入数据,并支持多模态上下文信息)
- LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding(使用公开可用的OCR工具在LAION数据集的422K个文本丰富的图像上收集结果)
- SVIT: Scaling up Visual Instruction Tuning(一个包含420万个视觉指导调整数据点的数据集)
- Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond(cross attention对齐特征,更大的第一阶段训练数据)
- NExT-GPT: Any-to-Any Multimodal LLM(端到端通用的任意对任意MM-LLM(Multimodal-Large Language Model)系统)
- InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition(视觉信息的压缩采样)
- CogVLM: Visual Expert for Pretrained Language Models(在LLM的各层添加visual expert,它具有独立的QKV和FFN相关的参数)
- OtterHD: A High-Resolution Multi-modality Model(专门设计用于以细粒度精度解释高分辨率视觉输入)
- Monkey : Image Resolution and Text Label Are Important Things for Large Multi-modal Models(Monkey模型提出了一种有效地提高输入分辨率的方法,最高可达 896 x 1344 像素)
- LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models(LLaMA-VID赋予现有框架支持长达一小时的视频,并通过额外的上下文标记推动了它们的上限)
- MoE-LLaVA: Mixture of Experts for Large Vision-Language Models(解决了多模态稀疏学习中的性能下降问题)
- LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images(高效处理任何纵横比和高分辨率的图像)
- Yi-VL(Yi-VL采用了LLaVA架构,经过全面的三阶段训练过程,以将视觉信息与Yi LLM的语义空间良好对齐:)
- Mini-Gemini(双视觉编码器,使用低分辨率的视觉编码器特征作为query,将高分辨率特征作为key 和value进行token mining)
- Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding(采用了一组动态视觉tokens来统一表示图像和视频。使模型能够高效利用有限数量的视觉tokens,同时捕捉图像所需的空间细节和视频所需的全面时间关系。)
- VILA: On Pre-training for Visual Language Models(交错的预训练数据是有益的,而单纯的图像-文本对并非最佳选择。)
- ST-LLM: Large Language Models Are Effective Temporal Learners(ST-LLM提出了一种动态掩码策略,并设计了定制的训练目标。此外,针对特别长的视频,设计了一个全局-局部输入模块,以平衡效率和效果。)
- Video-LLaVA: Learning United Visual Representation by Alignment Before Projection(用视频特有的encoder提升视频理解能力而非image encoder)
- MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?(最难多模态Benchmark. QwenVL-2第一但未及格!)
- MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark(MMMU的进阶版,更注重图像的感知对问题的影响)
- From Pixels to Prose: A Large Dataset of Dense Image Captions(1600万生成的image-text pair,利用尖端的视觉语言模型(Gemini 1.0 Pro Vision)进行详细和准确的描述。)
- ShareGPT4Video: Improving Video Understanding and Generation with Better Captions(40k from gpt4-v, 4814k生成于自己训练的模型)
- OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents(141 million web pages extracted from Common Crawl, 353 million associated images, and 115 billion text tokens)
- Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning(在数据层面,以细粒度片段级更正的形式收集人类反馈;在方法层面,我们提出了密集直接偏好优化(DDPO))
- Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model(在数据层面, 通过代码作为媒介合成抽象图表,并且 benchmarking 了当前多模态模型在抽象图的理解上的不足.)
- Chameleon: Mixed-Modal Early-Fusion Foundation Models(Meta FAIR:“早期融合”的方法使得模型能够跨模态推理和生成真正的混合文档。)
- Show-o: One Single Transformer to Unify Multimodal Understanding and Generation(NUS&ByteDance:文本作为离散标记进行自回归建模,而连续图像像素则使用去噪扩散建模。)
- Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model(Meta:采用了文本的下一个标记预测和图像的扩散作为目标函数,在不增加计算成本的前提下,实现了更好的模态整合与生成效果。)
- VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation(清华&MIT:统一视频理解与生成)
- MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts(META:MOE是混合模态理解/生成的最佳选择)
- MIO: A Foundation Model on Multimodal Tokens(01AI: 四模态理解/生成大一统)
- Harmonizing Visual Text Comprehension and Generation(ECNU&ByteDance:结合视觉编码器、LLM、图像解码器实现多模态输入输出)
- SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation (Tencent AI Lab:采用预训练的视觉分词器(如ViT)来统一图像理解和生成任务)
- NExT-GPT: Any-to-Any Multimodal LLM(NUS:使用预训练的编码器、扩散解码器和LLM,结合模态对齐训练和Lora指令微调实现any2any模态任务)
- Any-to-Any Generation via Composable Diffusion(Microsoft:组合各种模态的扩散模型,实现多模态并行生成)
- X-VILA: Cross-Modality Alignment for Large Language Model(Nvidia&HKUST:将单模编码器与大型语言模型(LLM)的输入对齐,以及将单模扩散解码器与LLM的输出对齐,实现跨模态的理解、推理和生成)
- DreamLLM: Synergistic Multimodal Comprehension and Creation(XJU&IIISCT:解决MLLMs在多模态理解与创造中的协同问题,直接在原始多模态空间中采样,生成语言和图像后验)
- Jointly Training Large Autoregressive Multimodal Models(Meta AI:融合了现有的文本和图像生成模型,并引入了一种专门的、数据高效的指令调整策略)
- VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation(XJU&Tencent AI Lab:使用一个新的图像分词器-解码器框架将原始图像转换为连续的视觉嵌入序列,使用NTP训练目标实现图像文本统一预训练)
- Emu:Generative pretraining in multimodality(BAAI&THU:一个基于Transformer的多模态基础模型采用统一的自回归训练目标,通过预测多模态序列中的下一个元素(无论是文本标记还是视觉嵌入)进行训练)
- Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization(PKU&快手:将视频分解为关键帧和运动向量,视频、图像和文本数据统一为1D离散标记)
- Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models(CUHK:使用视觉双编码器处理高分辨率图像,文本自回归生成,图像使用扩散模型生成)
- World Model on Million-Length Video And Language With Blockwise RingAttention(UC Berkeley:使用VQGAN将图像/视频离散化,理解生成统一为NTP任务,使用RingAttention、渐进式训练等技术将上下文窗口扩大到1M tokens)
- Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action(AI2&UIUC:将不同模态的输入和输出(如图像、文本、音频、动作等)标记化(tokenize)到一个共享的语义空间中,然后使用单一的编码器-解码器变换器模型进行处理)
- AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling(复旦:使用离散的标记来表示不同的模态(如图像、音乐、语音和文本))
- Write and Paint: Generative Vision-Language Models are Unified Modal Learners(HKUST&ByteDance:结合前缀语言建模和前缀图像建模的Dacinci模型)
- Gemini: A family of highly capable multimodal models(Google Gemini Team:解决跨图像、音频、视频和文本理解的任务中的高级推理和语言理解问题)
- Minigpt-5: Interleaved vision-and-language generation via generative vokens(UCSC:引入生成性视觉标记(Generative Vokens))
- Mm-interleaved: Interleaved image-text generative modeling via multi-modal feature synchronizer(Shanghai AI Lab:集成图像编码器、大型语言模型(LLM)和图像解码器)
- OMCAT: Omni Context Aware Transformer(NVIDIA:跨模态时间理解,利用RoTE(Rotary Time Embeddings)通过嵌入绝对和相对时间信息到音频和视觉特征中)
- Baichuan-Omni Technical Report(百川&西湖大学&浙大:全模态模型)
- Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation(DeepSeek-AI&HKU:针对多模态理解和多模态生成解耦视觉编码)
- Emu3: Next-Token Prediction is All You Need(BAAI:视觉标记离散化,使用DPO进行对齐)
- VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing(NUS&NTU:离散文本和连续信号的混合指令传递方法,像素级时空视觉-语言对比学习)(Neurips2024)
- (Apple) Understanding Alignment in Multimodal LLMs: A Comprehensive Study(通过独立分析各个因素,探索不同的对齐方法对MLLMs性能的影响)
- Aligning Large Multimodal Models with Factually Augmented RLHF
- CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs(使用预训练的 CLIP 模型对 LVLM 自生成的标题进行排序,以构建 DPO 的正负对)
- ODE: Open-Set Evaluation of Hallucinations in Multimodal Large Language Models(选择了一种动态生成方法来创建一个 open-set benchmark,引入了开放集动态评估协议(ODE),专门用于评估 MLLM 中的对象存在幻觉)
- Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization(本文将消除幻觉视为一种模型偏好,使模型偏向于无幻觉输出,于是提出了一种对幻觉敏感的多模态DPO 策略 —— HA-DPO。我们还引入了句子级幻觉比率(SHR),它不受固定类别和范围的限制,为多模态幻觉提供了广泛、细粒度和定量的测量)
- Detecting and Preventing Hallucinations in Large Vision Language Models(为了便于自动检测幻觉,我们首先使用 InstructBLIP 的 VQA 响应构建了一个多样化的人工标记数据集 M-HalDetect,专注于在详细图像描述的子句级别上进行细粒度注释。在这个数据集上训练不同密度(句子级,子句子级)的多个奖励模型,用于幻觉检测。我们也使用细粒度直接偏好优化(FDPO)直接优化 InstructBLIP)
- RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness(同一个大模型生成多个回复,将回复按句拆分,之后转化为问句让开源模型回复准确度,将所有准确度相加,得到偏好数据,用于迭代DPO)
- Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement(我们提出了 Self-Improvement Modality Alignment(SIMA),旨在通过自我完善机制进一步改善 LVLM 内视觉模态和语言模态之间的对齐)
- MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models(将无关的单图像数据拼接为序列、网格、画中画数据,通过注意力值在正确目标上的多少来选择偏好数据,经过过滤得到数据,用于DPO)
- CHiP: Cross-modal Hierarchical Direct Preference Optimization for Multimodal LLMs(为了使视觉信息对齐,引入了分层文本偏好优化模块,分别为回复级、片段级、token级偏好优化;同时引入了视觉偏好优化)
- 3D-CT-GPT++: Enhancing 3D Radiology Report Generation with Direct Preference Optimization and Large Vision-Language Models(将无关的单图像数据拼接为序列、网格、画中画数据,通过注意力值在正确目标上的多少来选择偏好数据,经过过滤得到数据,用于DPO)
- MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine(首先通过对比学习来微调数学特定的视觉编码器,随后将该编码器与LLM对齐,之后,采用MAVIS-Instruct进行指令调整,最后,在MAVIS-Instruct中使用带有注释的CoT基本原理的DPO)
- HomieBot: an Adaptive System for Embodied Mobile Manipulation in Open Environments(由100个复杂的日常任务组成,从Replica Challenge中抽取了100个不同的片段来构建场景并设计任务,只使用Replica Challenge的配置文件来构造场景。手动控制机器人完成所有任务,将执行过程分解为几个子任务,最终得到966个子任务。使用GPT-4将最终任务的文本描述和每个子任务的分析重新生成三次,将它们重写为具有相同含义但不同表达的文本,得到3720个SFT数据。通过替换部分内容得到10104个DPO数据)
- InteractiveCOT: Aligning Dynamic Chain-of-Thought Planning for Embodied Decision-Making(首先使用开源数据集LEVI-Project/sft-data对llava-v1.6-mistral-7b进行sft微调,然后使用模型与环境进行交互,在这些交互过程中优化其CoT能力,并在训练期间实时监控性能)
- vVLM: Exploring Visual Reasoning in VLMs against Language Priors(通过扰动来破坏图像,同时保持文本(问题和答案)不变,从而构建被选中和被拒绝的偏好对。应用于图像的扰动包括语义编辑、高斯模糊和像素化)
- AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Optimization(通过PGD等迭代优化获得对抗图像(对抗性图像是通过在原始图像中引入微小的、几乎难以察觉的扰动来生成的),用原始图像与对抗图像生成对应的描述文本作为偏好数据进行DPO,同时引入了对抗性图像优化)
- Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization(首先在大型音频数据集上进行训练音频对齐器实现音频模态对齐,然后进行audio-visual SFT,之后应用基于mrDPO的RL,最后重生微调)
- Aligning Visual Contrastive learning models via Preference Optimization(Step 1: Response generation. Step 2: Scoring. Step 3: Reward Preference. Iterative Improvement.)
- SQuBa: Speech Mamba Language Model with Querying-Attention for Efficient Summarization(两阶段训练过程。在对准阶段,只有projector使用ASR任务进行训练。在微调阶段,LLM backbone and the projector都接受summarization任务的训练。微调结束后进行离线自生成DPO。)
- ChatGLM-Math:Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline(ChatGLM-Math: Self-Critique迭代对齐显著提升数学能力)
- Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization(大语言模型的多目标对齐)
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model(直接偏好优化克服RLHF不稳定的问题)
- KTO: Model Alignment as Prospect Theoretic Optimization(不需要成对数据的偏好优化)
- Direct Preference Optimization with an Offset(带偏移的DPO, 要求首选响应和不受欢迎响应之间的可能性差异大于一个偏移值)
- Contrastive preference learning: Learning from human feedback without reinforcement learning(对比偏好学习(CPL)算法,该算法用于从偏好中学习最优策略而无需学习奖励函数,从而避免了对RL的需求)
- Statistical Rejection Sampling Improves Preference Optimization(使用拒绝抽样从目标最优策略中获取偏好数据,从而更准确地估计最优策略)
- Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study(在所有实验中,PPO始终优于DPO。特别是在最具挑战性的代码竞赛任务中,PPO实现了最先进的结果)
- Fine-tuning Aligned Language Models Compromises Safety(微调对齐的语言模型会损害安全性)
- ChatGLM-Math: Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline(reward model, Rejective Fine-tuning, then DPO迭代提升模型数学性能)
- SimPO: Simple Preference Optimization with a Reference-Free Reward(length reg+去掉ref model)
- towards analyzing and understanding the limitations of dpo: a theoretical perspective(DPO的实际优化过程对SFT后的LLMs对齐能力的初始条件为什么敏感)
- Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level(表明迭代 DPO (iDPO)可以通过精心设计将 7B 模型的 LC win rate 增强到 GPT-4 水平)
- Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs(出了一种有效且经济的 pipeline 来收集成对数学问题偏好数据。引入了 Step-DPO,最大化下一个推理步骤正确的概率,最小化其错误的概率)
- A Novel Soft Alignment Approach for Language Models with Explicit Listwise Rewards(通过在现有强大的LLM的指导下对比多个数据点,将生成建模问题转化为分类任务。SPO损失可以看作是k类交叉熵损失,带有更强大的教师LLM提供的软标签)
- Montessori-Instruct: Generate Influential Training Data Tailored for Student Learning(教师模型根据使用Self-Instruct生成数据集,然后收集这些数据点的本地数据对学生模型的影响,收集到的数据偏好形成偏好数据集,然后用DPO更新教师模型,该过程可以迭代多轮,以根据学生更新的偏好不断改进教师)
- Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts(作者认为相似的问题生成的答案应该也可以用来偏好学习,于是借助对比矩阵来研究此问题,提出了3种可适用的算法)