Official implementation of our paper "Finetuned Multimodal Language Models are High-Quality Image-Text Data Filters".
- [10/24/2024] 🔥 We released two new MLM-Filter models based on llama3, mlm-filter-llama-3-8b and mlm-filter-llama-3.2-3b. The LLaVA codebase is upgraded to Weizhi's customized new version LLaVA-Video-Llama-3.
- [2/25/2024] 🔥 We released Finetuned Multimodal Language Models are High-Quality Image-Text Data Filters. We propose to adopt fine-tuned Multimodal Language Model as effective and efficient data filters to select high-quality image-text pairs from large-scale web-crawled iamge-text data. Checkout the paper.
- LLaVA-Video-Llama-3: codebase for fine-tuning MLM as Data Filter
- mlm_filter_scoring_single_image.py: Sample code for perform quality score generation on a single image-text pair
- mlm_filter_scoring_datacomp_batch_inference.py: Sample code for perform large-scale quality score generation on Webdataset format image-text data
- mlm_filter_scoring_datacomp_batch_inference_llama_3.py: Sample code for perform large-scale quality score generation on Webdataset format image-text data for llama3 based MLM-Filter models
- run_inference.sh: Sample code for perform large-scale quality score generation on Webdataset format image-text data on machines with 8 GPUs
We highly suggest you to use python==3.10, i.e.,
conda create -n mlm_filter python=3.10
Then install the dependencies for quality score generation:
pip install -e LLaVA-Unified
python mlm_filter_scoring_single_image.py --image-path /path/to/image --caption "text caption"
Parameters to note:
--metric
: quality scoring metric for generation, select amongimage_text_matching
,object_detail_fulfillment
,caption_text_quality
,semantic_understanding
,all
--image-path
: path to image file or image url--caption
: text caption
bash run_inference.sh ${GPU_START_ID} ${Metric} ${Model_Path} ${Data_Path} ${Tars_Per_GPU} ${Num_GPU}
Parameters to note:
GPU_START_ID
: for large-scale score generation using multi-machines, specify the index of machinesMetric
: quality scoring metric for generation, select amongimage_text_matching
,object_detail_fulfillment
,caption_text_quality
,semantic_understanding
,all
Model_Path
: path to the mlm filter model checkpointData_Path
: path to the webdataset image-text tarsTars_Per_GPU
: the number of webdataset image-text tars for a single-gpu to inference onNum_GPU
: the number of GPUs for one machine, e.g. 1, 8, 16
- Prepare data
Please download the 50k multimodal instructions and save it to ./data/mlm_filter_instruct_50k_gpt4v_cc12m_4k.json
.
Please download the images from constituting datasets:
- COCO: train2017
- GQA: images
- OCR-VQA: download script, we save all files as
.jpg
- TextVQA: train_val_images
- VisualGenome: part1, part2
- CC12M:
unzip images.zip -C data/images
, the images are available at Huggingface Data Repo.
After downloading all of them, organize the data as follows in ./data/images
,
├── coco
│ └── train2017
├── gqa
│ └── images
├── ocr_vqa
│ └── images
├── textvqa
│ └── train_images
└── vg
│ ├── VG_100K
│ └── VG_100K_2
└── cc12m
As several images from OCR-VQA data urls are no longer available, you can also try to run the check_missed_image.py
for filtering unavailable images from instruction dataset.
- Start training!
You may download LLaVA's pretrained projectors in Model Zoo.
Visual instruction tuning takes around 4 hours for LLaVA-v1.5-13B on 8x A100 (80G) with sampled 50k instruction dataset.
Training script with DeepSpeed ZeRO-3: LLaVA_ft/scripts/v1_5/finetune.sh
.
We open-source our fine-tuned MLM Data Filters at MLM-Filter-GPT4V and MLM-Filter-GPT4.
We also open-sourced our pre-trained CLIP-ViT-B/32 checkppint under the DataComp-Medium Benchmark Controlled Setting in weizhiwang/clip_datacomp_medium_itm_th_66_AND_odf_th_20_gpt4v. Our best model is trianed on the data filtered by both the ITM and ODF Quality Scores.
Usage and License Notices: The data and checkpoint are intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.
For any question or issue, please feel free to contact weizhiwang@ucsb.edu or submit github issues.
Please cite our paper if you find this repository interesting or helpful in your research:
@article{mlm-filter,
title={Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters},
author={Wang, Weizhi and Mrini, Khalil and Yang, Linjie and Kumar, Sateesh and Tian, Yu and Yan, Xifeng and Wang, Heng},
publisher={arXiv preprint arXiv:2403.02677},
year={2024},
}
MLM-Filter is developed based on