xturing_LLAVA

Visual Language Model (VLM) with Quantized LLM

This project was carried out in Yonsei AI(YAI) as YAICON

Background

LLM(Large Language Model) has been popular recently and it has showen that it can generate text very well and help us in real life a lot. And now there are several researches that tries to combine visual encoder and LLM. There are models like Flamingo, CLIP, Kosmos, BLIP, and LLaVA. However, these models are too large that hard to train or inference on small hardware. Thus, we aim to minimize gpu usage for VLM model. We especially try to implement LLaVA because of its simple model architecture and high performance. We also found there are various research for quantizing LLM. So we applied those techniques to reduce LLM and connect it with visual encoder so that the total model is runnable in single gpu. We tested that this is runnable even in colab.

Reference

Here is the list of researches/codes that we used to implement this code:

LLaVA
transformers:huggingface codes
GPTQ: Large Language Model Quantization
LoRA: efficient Finetuning method
xturing: library of efficient LLM <-- we clone this and modify a bit.

Pretrained weight

TBD

How to start

git clone https://github.com/ta3h30nk1m/xturing_LLAVA.git
pip install -r ./xturing_LLAVA/requirements.txt

Datasets

We used the same datasets as LLaVA which are CC3M and LLAVA_instruct datasets.

Training

python simple_train.py --dataset ./dataset_folder --weights_path ./checkpoint --first_stage True --output --./output_path 
                       --epochs 1 --bs 32 --lr 1e-3

--dataset: dataset folder path
--weights_path: (optional) if pretrained weights exist. Otherwise, default LLaMA checkpoint is called
--first_stage: True for only train projector layer that connect visual encoder and LLM. False for train both projector and LLM
--output: specify checkpoint output path
--lr: learning_rate
--bs: batch_size
--epochs: total epochs

in order to change hyperparameters, go to ./xturing_LLAVA/config/finetuning_config.yaml and change config of 'llama_lora_int4'

Generation

python simple_generate.py --weights_path ./checkpoint --image_file ./image.png --text "input text to the model"

--weights_path: model checkpoint path
--image_file: input image file
--text: input text

Todo

Second stage Training
RLHF
Make Command selecting Vision Tower (CLIP, DeepFloyd, ...)
Gradio WebUI(chat)
autoGPT (not GPT4 api but xturing-LLAVA)
Refactoring Codes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

xturing_LLAVA

Visual Language Model (VLM) with Quantized LLM

This project was carried out in Yonsei AI(YAI) as YAICON

Background

Reference

Pretrained weight

How to start

Datasets

Training

Generation

Todo

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 236 Commits
xturing		xturing
README.md		README.md
requirements.txt		requirements.txt
simple_generate.py		simple_generate.py
simple_train.py		simple_train.py

ta3h30nk1m/xturing_LLAVA

Folders and files

Latest commit

History

Repository files navigation

xturing_LLAVA

Visual Language Model (VLM) with Quantized LLM

This project was carried out in Yonsei AI(YAI) as YAICON

Background

Reference

Pretrained weight

How to start

Datasets

Training

Generation

Todo

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages