This project was carried out in Yonsei AI(YAI) as YAICON
LLM(Large Language Model) has been popular recently and it has showen that it can generate text very well and help us in real life a lot. And now there are several researches that tries to combine visual encoder and LLM. There are models like Flamingo, CLIP, Kosmos, BLIP, and LLaVA. However, these models are too large that hard to train or inference on small hardware. Thus, we aim to minimize gpu usage for VLM model. We especially try to implement LLaVA because of its simple model architecture and high performance. We also found there are various research for quantizing LLM. So we applied those techniques to reduce LLM and connect it with visual encoder so that the total model is runnable in single gpu. We tested that this is runnable even in colab.
Here is the list of researches/codes that we used to implement this code:
- LLaVA
- transformers:huggingface codes
- GPTQ: Large Language Model Quantization
- LoRA: efficient Finetuning method
- xturing: library of efficient LLM <-- we clone this and modify a bit.
TBD
git clone https://github.com/ta3h30nk1m/xturing_LLAVA.git
pip install -r ./xturing_LLAVA/requirements.txt
We used the same datasets as LLaVA which are CC3M and LLAVA_instruct datasets.
python simple_train.py --dataset ./dataset_folder --weights_path ./checkpoint --first_stage True --output --./output_path
--epochs 1 --bs 32 --lr 1e-3
-
--dataset: dataset folder path
-
--weights_path: (optional) if pretrained weights exist. Otherwise, default LLaMA checkpoint is called
-
--first_stage: True for only train projector layer that connect visual encoder and LLM. False for train both projector and LLM
-
--output: specify checkpoint output path
-
--lr: learning_rate
-
--bs: batch_size
-
--epochs: total epochs
in order to change hyperparameters, go to ./xturing_LLAVA/config/finetuning_config.yaml and change config of 'llama_lora_int4'
python simple_generate.py --weights_path ./checkpoint --image_file ./image.png --text "input text to the model"
- --weights_path: model checkpoint path
- --image_file: input image file
- --text: input text
- Second stage Training
- RLHF
- Make Command selecting Vision Tower (CLIP, DeepFloyd, ...)
- Gradio WebUI(chat)
- autoGPT (not GPT4 api but xturing-LLAVA)
- Refactoring Codes