We use the following hyperparameters for training ConvLLaVA.
Hyperparameters | Stage 1 | Stage 2 | Stage 3 |
---|---|---|---|
Learning Rate | 3e-4 | 2e-5 | 2e-5 |
Batch Size | 256 | 256 | 128 |
Epochs | 1 | 1 | 1 |
Warmup Ratio | 0.03 | 0.03 | 0.03 |
Weight Decay | 0 | 0 | 0 |
Optimizer | AdamW | AdamW | AdamW |
We use captions from ShareGPT4V-PT, ShareGPT4V, ALLAVA.
We use ShareGPT4V-PT, ShareGPT4V, ALLAVA and a part of VFLAN.
We use LLaVA-1.5 sft 665k dataset. We would update the results when LLaVA-NExT released.
First, download all images and instrcution files.
- ALLaVA: images
- COCO: train2017
- LLaVA: llava
- WebData: images. Only for academic usage.
- SAM: images. We only use 000000~000050.tar for now. If you find it is slow for you to donnload in China, please refer to opendatalab to download it.
- GQA: images
- OCR-VQA: download script. We save all files as
.jpg
- TextVQA: trainvalimages
- VisualGenome: part1, part2
- vflan: vflan
Then, organize the data as follows:
ShareGPT4V
├── ...
├── data
│ ├── allava
│ │ ├── allava_laion
│ │ │ ├── images
│ │ │ ├── ALLaVA-Caption-LAION-4V.json
│ │ │ ├── ALLaVA-Instruct-LAION-4V.json
│ │ ├── allava_vflan
│ │ │ ├── ALLaVA-Caption-VFLAN-4V.json
│ │ │ ├── ALLaVA-Instruct-VFLAN-4V.json
│ ├── coco
│ │ ├── train2017
│ ├── llava
│ │ ├── llava_v1_5_mix665k.json
│ ├── sam
│ │ ├── images
│ ├── gqa
│ │ ├── images
│ ├── ocr_vqa
│ │ ├── images
│ ├── textvqa
│ │ ├── train_images
│ ├── vg
│ │ ├── VG_100K
│ │ ├── VG_100K_2
│ ├── vflan
│ │ ├── images_191task_1k
│ │ ├── annotation_191-task_1k.json
│ ├── sharegpt4v
│ │ ├── share-captioner_coco_lcs_sam_1246k_1107.json
│ │ ├── sharegpt4v_instruct_gpt4-vision_cap100k.json
│ ├── share_textvqa
│ │ ├── images
│ ├── web-celebrity
│ │ ├── images
│ ├── web-landmark
│ │ ├── images
│ ├── wikiart
│ │ ├── images
├── ...
If you find download ocrvqa images slow. You could refer to this issue. Use multiprocessing to speed up:
import concurrent.futures
def download_image(k):
ext = os.path.splitext(data[k]['imageURL'])[1]
outputFile = 'images/%s%s' % (k, ext)
# Only download the image if it doesn't exist
if not os.path.exists(outputFile):
ureq.urlretrieve(data[k]['imageURL'], outputFile)
if download == 1:
# Create the directory if it doesn't exist
if not os.path.exists('./images'):
os.mkdir('./images')
# Create a thread pool and download the images in parallel
with concurrent.futures.ThreadPoolExecutor() as executor:
executor.map(download_image, data.keys())
For ocrvqa, some git images should be transfered to jpg. You could follow bwloe code:
import os
from PIL import Image
def convert_gif_to_jpg(folder_path):
for filename in os.listdir(folder_path):
if filename.endswith('.gif'):
file_path = os.path.join(folder_path, filename)
with Image.open(file_path) as img:
jpg_filename = os.path.splitext(filename)[0] + '.jpg'
jpg_path = os.path.join(folder_path, jpg_filename)
img.convert('RGB').save(jpg_path, 'JPEG', quality=95)
print(f'Converted {filename} to {jpg_filename}')
folder_path = 'path_to_your_folder'
convert_gif_to_jpg(folder_path)
You could modify the file data.py to add the datasets. Replace with the true path:
def build_sharegpt4v(tokenizer, data_args):
data_path = 'path_to_sharegpt4v_pt.json'
image_folder = 'folder_to_sharegpt4v_pt'
dataset = SampleDataset(data_path, tokenizer, data_args,
image_folder)
return dataset