Skip to content

Latest commit



142 lines (116 loc) · 5.16 KB

File metadata and controls

142 lines (116 loc) · 5.16 KB


We use the following hyperparameters for training ConvLLaVA.

Hyperparameters Stage 1 Stage 2 Stage 3
Learning Rate 3e-4 2e-5 2e-5
Batch Size 256 256 128
Epochs 1 1 1
Warmup Ratio 0.03 0.03 0.03
Weight Decay 0 0 0
Optimizer AdamW AdamW AdamW

Projector Initialzation

We use captions from ShareGPT4V-PT, ShareGPT4V, ALLAVA.

Vision Language Pretraining

We use ShareGPT4V-PT, ShareGPT4V, ALLAVA and a part of VFLAN.

Instrcution Tuning

We use LLaVA-1.5 sft 665k dataset. We would update the results when LLaVA-NExT released.

Prepare Images

First, download all images and instrcution files.

Then, organize the data as follows:

├── ...
├── data
│   ├── allava
│   │   ├── allava_laion
│   │   │   ├── images
│   │   │   ├── ALLaVA-Caption-LAION-4V.json
│   │   │   ├── ALLaVA-Instruct-LAION-4V.json
│   │   ├── allava_vflan
│   │   │   ├── ALLaVA-Caption-VFLAN-4V.json
│   │   │   ├── ALLaVA-Instruct-VFLAN-4V.json
│   ├── coco
│   │   ├── train2017
│   ├── llava
│   │   ├── llava_v1_5_mix665k.json
│   ├── sam
│   │   ├── images
│   ├── gqa
│   │   ├── images
│   ├── ocr_vqa
│   │   ├── images
│   ├── textvqa
│   │   ├── train_images
│   ├── vg
│   │   ├── VG_100K
│   │   ├── VG_100K_2
│   ├── vflan
│   │   ├── images_191task_1k
│   │   ├── annotation_191-task_1k.json
│   ├── sharegpt4v
│   │   ├── share-captioner_coco_lcs_sam_1246k_1107.json
│   │   ├── sharegpt4v_instruct_gpt4-vision_cap100k.json
│   ├── share_textvqa
│   │   ├── images
│   ├── web-celebrity
│   │   ├── images
│   ├── web-landmark
│   │   ├── images
│   ├── wikiart
│   │   ├── images
├── ...

If you find download ocrvqa images slow. You could refer to this issue. Use multiprocessing to speed up:

import concurrent.futures
def download_image(k):
    ext = os.path.splitext(data[k]['imageURL'])[1]
    outputFile = 'images/%s%s' % (k, ext)

    # Only download the image if it doesn't exist
    if not os.path.exists(outputFile):
        ureq.urlretrieve(data[k]['imageURL'], outputFile)

if download == 1:
    # Create the directory if it doesn't exist
    if not os.path.exists('./images'):

    # Create a thread pool and download the images in parallel
    with concurrent.futures.ThreadPoolExecutor() as executor:, data.keys())

For ocrvqa, some git images should be transfered to jpg. You could follow bwloe code:

import os
from PIL import Image

def convert_gif_to_jpg(folder_path):
    for filename in os.listdir(folder_path):
        if filename.endswith('.gif'):
            file_path = os.path.join(folder_path, filename)
            with as img:
                jpg_filename = os.path.splitext(filename)[0] + '.jpg'  
                jpg_path = os.path.join(folder_path, jpg_filename)
                img.convert('RGB').save(jpg_path, 'JPEG', quality=95)
                print(f'Converted {filename} to {jpg_filename}')

folder_path = 'path_to_your_folder'

Data Configuration

You could modify the file to add the datasets. Replace with the true path:

def build_sharegpt4v(tokenizer, data_args):
    data_path = 'path_to_sharegpt4v_pt.json'
    image_folder = 'folder_to_sharegpt4v_pt'
    dataset = SampleDataset(data_path, tokenizer, data_args,
    return dataset