Skip to content

OLA-VLM: Elevating Perception in Multimodal LLMs with Auxiliary Embedding Distillation, arXiv 2024

Notifications You must be signed in to change notification settings

SHI-Labs/OLA-VLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OLA-VLM

Framework: PyTorch YouTube

Jitesh Jain*, Zhengyuan Yang, Humphrey Shi, Jianfeng Gao, Jianwei Yang

*Work done during an internship at Microsoft Research, Redmond    Equal Advising

[Project Page] | [arXiv] [Model Checkpoints] [Video] [BibTeX]

This repo contains the code for our paper OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation.

We propose distilling target visual information into the intermediate representations of the LLM from a set of target encoders. We adopt a predictive embedding optimization approach at selected LLM layers during training to minimize the embedding losses along with the next token prediction (NTP) objective, resulting in a vision-centric approach to training the Multimodal Large Language Model.

Contents

  1. Installation Instructions
  2. Demo
  3. Getting Started
  4. Results
  5. Citation

News

Installation Instructions

Note: We trained all our models on AMD MI300x GPUs. However, in this repo, we provide instructions for Nvidia GPUs considering their wider usage.

  • Clone this repository.

    git lfs install
    git clone https://github.com/SHI-Labs/OLA-VLM
    cd OLA-VLM
  • Setup conda environment with the base dependencies.

    conda create -n ola_vlm -y
    conda activate ola_vlm
    pip install -e .
    pip install flash-attn --no-build-isolation
    pip install scikit-learn icecream datasets pytorch-fid lpips opencv-python-headless
    pip install setuptools==61.0.0
    pip install -e lmms-eval/
    pip install huggingface_hub==0.24.7
    pip install transformers==4.41.1

Demo

You can use the Gradio interface to interact with OLA-VLM locally. The demo also supports visualizing the respresentations from the slected intermediate LLM layers (embedding loss positions).

# install demo-specific libraries
pip install -e .["demo"]

# start the demo
CUDA_VISIBLE_DEVICES=0 python demo.py --model-path shi-labs/pretrain_dsg_OLA-VLM-CLIP-ViT-Llama3-8b --PT-model-path shi-labs/pretrain_dsg_OLA-VLM-CLIP-ViT-Llama3-8b

Getting Started

Note: We provide the guide to integrating the embeddding losses from OLA-VLM into any custom MLLM in Custom_MLLM.md

Training

  • Please see Training.md for training commands and dataset preparation.
  • We train all our models using 16 192G MI300X AMD GPUs.

Evaluation

Please see Evaluation.md for evaluation commands

Probing

Please see Probing.md for probing commands.

Results

Method Training Stages LLM Base Encoder CV-Bench MMStar RWQA OK-VQA Checkpoint
OLA-VLM PT + IFT Phi3-4k-mini CLIP-ViT-L 62.5 36.0 58.0 56.4 ckpt
OLA-VLM PT + IFT Phi3-4k-mini CLIP-ConvNeXT-XXL 63.9 38.4 58.4 56.5 ckpt
OLA-VLM PT + IFT Llama3-8b CLIP-ViT-L 61.4 39.5 57.9 56.6 ckpt
OLA-VLM PT + IFT Llama3-8b CLIP-ConvNeXT-XXL 61.5 38.5 55.0 59.0 ckpt
OLA-VLM PT + VPT + IFT Llama3-8b CLIP-ConvNeXT-XXL 64.6 40.6 62.9 61.1 ckpt

Citation

If you found OLA-VLM useful in your research, please consider starring ⭐ us on GitHub and citing 📚 us in your research!

@article{jain2024ola_vlm,
      title={{OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation}},
      author={Jitesh Jain and Zhengyuan Yang and Humphrey Shi and Jianfeng Gao and Jianwei Yang},
      journal={arXiv},
      year={2024}
}

Acknowledgement

We thank the authors of LLaVA-1.5, OneFormer, Depth-Anything v2, and unCLIP-SD for open-sourcing their codebase and checkpoints. We are grateful to the authors of cambrian and MMStar for releasing their code for CV-Bench and MMStar evaluation, respectively.

About

OLA-VLM: Elevating Perception in Multimodal LLMs with Auxiliary Embedding Distillation, arXiv 2024

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published