MLM Filter

Official implementation of our paper "Finetuned Multimodal Language Models are High-Quality Image-Text Data Filters".

Release

[10/24/2024] 🔥 We released two new MLM-Filter models based on llama3, mlm-filter-llama-3-8b and mlm-filter-llama-3.2-3b. The LLaVA codebase is upgraded to Weizhi's customized new version LLaVA-Video-Llama-3.
[2/25/2024] 🔥 We released Finetuned Multimodal Language Models are High-Quality Image-Text Data Filters. We propose to adopt fine-tuned Multimodal Language Model as effective and efficient data filters to select high-quality image-text pairs from large-scale web-crawled iamge-text data. Checkout the paper.

Project Structure

LLaVA-Video-Llama-3: codebase for fine-tuning MLM as Data Filter
mlm_filter_scoring_single_image.py: Sample code for perform quality score generation on a single image-text pair
mlm_filter_scoring_datacomp_batch_inference.py: Sample code for perform large-scale quality score generation on Webdataset format image-text data
mlm_filter_scoring_datacomp_batch_inference_llama_3.py: Sample code for perform large-scale quality score generation on Webdataset format image-text data for llama3 based MLM-Filter models
run_inference.sh: Sample code for perform large-scale quality score generation on Webdataset format image-text data on machines with 8 GPUs

Install

We highly suggest you to use python==3.10, i.e.,

conda create -n mlm_filter python=3.10

Then install the dependencies for quality score generation:

pip install -e LLaVA-Unified

Quality Score Generation

Inference on Single Image

python mlm_filter_scoring_single_image.py --image-path /path/to/image --caption "text caption"

Parameters to note:

--metric: quality scoring metric for generation, select among image_text_matching, object_detail_fulfillment, caption_text_quality, semantic_understanding, all
--image-path: path to image file or image url
--caption: text caption

Inference on Webdataset Large-Scale Data

bash run_inference.sh ${GPU_START_ID} ${Metric} ${Model_Path} ${Data_Path} ${Tars_Per_GPU} ${Num_GPU}

Parameters to note:

GPU_START_ID: for large-scale score generation using multi-machines, specify the index of machines
Metric: quality scoring metric for generation, select among image_text_matching, object_detail_fulfillment, caption_text_quality, semantic_understanding, all
Model_Path: path to the mlm filter model checkpoint
Data_Path: path to the webdataset image-text tars
Tars_Per_GPU: the number of webdataset image-text tars for a single-gpu to inference on
Num_GPU: the number of GPUs for one machine, e.g. 1, 8, 16

Fine-Tuning MLM as Data Filter

Prepare data

Please download the 50k multimodal instructions and save it to ./data/mlm_filter_instruct_50k_gpt4v_cc12m_4k.json.

Please download the images from constituting datasets:

COCO: train2017
GQA: images
OCR-VQA: download script, we save all files as .jpg
TextVQA: train_val_images
VisualGenome: part1, part2
CC12M: unzip images.zip -C data/images, the images are available at Huggingface Data Repo.

After downloading all of them, organize the data as follows in ./data/images,

├── coco
│   └── train2017
├── gqa
│   └── images
├── ocr_vqa
│   └── images
├── textvqa
│   └── train_images
└── vg
│   ├── VG_100K
│   └── VG_100K_2
└── cc12m

As several images from OCR-VQA data urls are no longer available, you can also try to run the check_missed_image.py for filtering unavailable images from instruction dataset.

Start training!

You may download LLaVA's pretrained projectors in Model Zoo.

Visual instruction tuning takes around 4 hours for LLaVA-v1.5-13B on 8x A100 (80G) with sampled 50k instruction dataset.

Training script with DeepSpeed ZeRO-3: LLaVA_ft/scripts/v1_5/finetune.sh.

We open-source our fine-tuned MLM Data Filters at MLM-Filter-GPT4V and MLM-Filter-GPT4.

Our Best CLIP Model on DataComp-Medium

We also open-sourced our pre-trained CLIP-ViT-B/32 checkppint under the DataComp-Medium Benchmark Controlled Setting in weizhiwang/clip_datacomp_medium_itm_th_66_AND_odf_th_20_gpt4v. Our best model is trianed on the data filtered by both the ITM and ODF Quality Scores.

License

Usage and License Notices: The data and checkpoint are intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

Contacts

For any question or issue, please feel free to contact weizhiwang@ucsb.edu or submit github issues.

Citation

Please cite our paper if you find this repository interesting or helpful in your research:

@article{mlm-filter,
    title={Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters}, 
    author={Wang, Weizhi and Mrini, Khalil and Yang, Linjie and Kumar, Sateesh and Tian, Yu and Yan, Xifeng and Wang, Heng},
    publisher={arXiv preprint arXiv:2403.02677},
    year={2024},
}

Credits

MLM-Filter is developed based on

Vicuna: foudation language model for LLaVA
LLaVA: the codebase for fine-tuning LLaVA as image-text data filters
DataComp: the codebase for data filtering and CLIP pre-training

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
LLaVA-Unified		LLaVA-Unified
assets		assets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
mlm_filter_scoring_datacomp_batch_inference.py		mlm_filter_scoring_datacomp_batch_inference.py
mlm_filter_scoring_datacomp_batch_inference_llama_3.py		mlm_filter_scoring_datacomp_batch_inference_llama_3.py
mlm_filter_scoring_single_image.py		mlm_filter_scoring_single_image.py
run_inference.sh		run_inference.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MLM Filter

Release

Project Structure

Install

Quality Score Generation

Inference on Single Image

Inference on Webdataset Large-Scale Data

Fine-Tuning MLM as Data Filter

Our Best CLIP Model on DataComp-Medium

License

Contacts

Citation

Credits

About

Releases

Packages

Languages

License

Victorwz/MLM_Filter

Folders and files

Latest commit

History

Repository files navigation

MLM Filter

Release

Project Structure

Install

Quality Score Generation

Inference on Single Image

Inference on Webdataset Large-Scale Data

Fine-Tuning MLM as Data Filter

Our Best CLIP Model on DataComp-Medium

License

Contacts

Citation

Credits

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages