Jinheng Xie1*
Weijia Mao1*
Zechen Bai1*
David Junhao Zhang1*
Weihao Wang2
Kevin Qinghong Lin1
Yuchao Gu1
Zhijie Chen2
Zhenheng Yang2
Mike Zheng Shou1
1 Show Lab, National University of Singapore 2 Bytedance
-
[2024-10-15] Update Arxiv paper to include new features and experimental results.
- Support image generation in a resolution of 512x512.
- Improve the multimodal understanding capabilities of purely discrete Show-o.
- Improve the performance on the GenEval benchmark.
- Explore the impact of dataset scale and image resolution on multimodal understanding capabilities of discrete image tokens. For more information, please refer to the paper.
- We release the weight of Show-o before fine-tuning on LLaVA instructional tuning datasets. You can fine-tune it following the configurations in
./configs
.
-
[2024-09-12] Arxiv paper updated to include preliminaries about discrete diffusion.
-
[2024-09-03] We deploy an online demo on Hugging Face Space. 🤗 Have fun!
-
[2024-09-02] We release the training code for pre-training and instruction tuning! 🔥🔥
-
[2024-09-01] Add FlexAttention implementation for accleration. Thanks to @Horace for providing examples.
-
[2024-08-28] We maintain a repo of Awesome Unified Multimodal Models. If you are interested in unified models, star and watch it to get latest updates!
-
[2024-08-27] Add integration to Hugging Face! Thanks to @NielsRogge.
-
[2024-08-26] We build two community platforms to facilitate discussion, request and collaboration! Reach us with Discord and WeChat!
-
[2024-08-23] We release the inference code of Show-o (1.3B) for multimodal understanding and generation including image captioning, visual question answering (VQA), text-to-image generation, text-guided inpainting and extrapolation.
Below is a characteristics comparison among understanding only, generation only, and unified (understanding & generation) models. Vision
and Language
indicate the representations from specific input modalities. In this context, Diffusion
represents both continuous and discrete diffusion.
Below is an overview of Show-o. The input data, regardless of its modalities, is tokenized and then prompted into a formatted input sequence. Show-o processes text tokens autoregressively with causal attention and image tokens in (discrete) denoising diffusion modeling via full attention, and then generates the desired output. Specifically, Show-o is capable of handling image captioning, visual question answering, text-to-image generation, text-guided inpainting/extrapolation, and mixed modality generation.
- Release the inference code.
- Release the training code.
- Support image generation in a resolution of 512x512.
- Scale up the model size (based on LLaMA3) and increase the number of training data.
The Show-o checkpoints can be found on Hugging Face:
- showlab/show-o-512x512
- showlab/show-o-w-clip-vit-512x512
- showlab/show-o-512x512-wo-llava-tuning
- showlab/show-o
- showlab/show-o-w-clip-vit
- showlab/magvitv2
- Journeydb-Annotation
First, set up the environment:
pip3 install -r requirements.txt
Login your wandb account on your machine or server.
wandb login <your wandb keys>
Inference demo for Multimodal Understanding and you can view the results on wandb.
option (c)
python3 inference_mmu.py config=configs/showo_demo_w_clip_vit_512x512.yaml \
max_new_tokens=100 \
mmu_image_root=./mmu_validation question='Please describe this image in detail. *** Do you think the image is unusual or not?'
or option (a)
python3 inference_mmu.py config=configs/showo_demo_512x512.yaml \
max_new_tokens=100 \
mmu_image_root=./mmu_validation question='Please describe this image in detail. *** Do you think the image is unusual or not?'
Inference demo for Text-to-Image Generation and you can view the results (in a resolution of 512x512) on wandb.
python3 inference_t2i.py config=configs/showo_demo_512x512.yaml \
batch_size=1 validation_prompts_file=validation_prompts/showoprompts.txt \
guidance_scale=5 generation_timesteps=50 \
mode='t2i'
Inference demo for Text-guided Inpainting and you can view the results (in a resolution of 256x256) on wandb.
python3 inference_t2i.py config=configs/showo_demo.yaml \
batch_size=1 \
guidance_scale=1.75 generation_timesteps=16 \
mode='inpainting' prompt='A blue sports car with sleek curves and tinted windows, parked on a bustling city street.' \
image_path=./inpainting_validation/bus.jpg inpainting_mask_path=./inpainting_validation/bus_mask.webp
Inference demo for Text-guided Extrapolation and you can view the results (in a resolution of 256x256) on wandb.
python3 inference_t2i.py config=configs/showo_demo.yaml \
batch_size=1 \
guidance_scale=1.75 generation_timesteps=16 \
mode='extrapolation' extra_direction='left *** left *** left *** right *** right *** right' offset=0 prompt='a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees.' \
image_path=./inpainting_validation/alpine_lake.jpg
Prepare your training data and change the data path in configs/xx.yaml
.
Note that, our training process is based on accelerate
. Please ensure to config your accelerate
for distributed training. We provide config examples below for (distributed) training on a single GPU or multiple GPUs.
├── accelerate_configs/
| ├── multi_nodes (6x8 GPUs)
| | ├—— ...
| ├── 1_gpu.yaml
| └── 8_gpu_deepspeed_zero2.yaml
Stage 1 - Pre-training on ImageNet-1K dataset. Change the data path to ImageNet-1K in configs/showo_pretraining_stage1.yaml
. Note that, we use the internal packages to process the RefinedWeb dataset, and you must manually comment the code part related to language modeling in training/train.py
or write a new dataloder.
accelerate launch --config_file path/to/your/accelerate_config --main_process_port=8888 training/train.py config=configs/showo_pretraining_stage1.yaml
Once trained, the checkpoint
folder is structured as follows:
├── show-o-training-stage1/
| ├── ...
| ├── checkpoint-500000
| └── config.yaml
A bit cumbersome. Just create a new output folder (edited in the yaml config) for stage 2, copy the latest checkpoint
of stage 1 to this folder, and rename it to checkpoint-0
. It will be automatically resumed for next stage training. Apply same procedures for the resume
training in the following stages.
├── show-o-training-stage2/
| └── checkpoint-0
Stage 2 - Pre-training on Image-Text dataset. The default dataloader is based on WebDataset
. Change the data path in configs/showo_pretraining_stage2.yaml
.
accelerate launch --config_file path/to/your/accelerate_config --main_process_port=8888 training/train.py config=configs/showo_pretraining_stage2.yaml
Stage 3 - Pre-training on High-quality Image-Text dataset. Change the data path in configs/showo_pretraining_stage3.yaml
Copy the pre-trained weights to the output_dir
(specified in the config)
├── show-o-training-stage3/
| └── checkpoint-0
accelerate launch --config_file path/to/your/accelerate_config --main_process_port=8888 training/train.py config=configs/showo_pretraining_stage3.yaml
[Option a] Stage 3 - Instruction tuning on LLaVA dataset (llava-pretrain). Change the data path in llava/llava_data_vq_unified.py
.
accelerate launch --config_file path/to/your/accelerate_config --main_process_port=8888 training/train.py config=configs/showo_instruction_tuning_1.yaml
[Option a] Stage 3 - Instruction tuning on LLaVA dataset (llava-tuning). Change the data path in llava/llava_data_vq_unified.py
.
accelerate launch --config_file path/to/your/accelerate_config --main_process_port=8888 training/train.py config=configs/showo_instruction_tuning_2.yaml
[Option c] Stage 3 - Instruction tuning on LLaVA dataset (llava-pretrain) with CLIP-ViT. Change the data path in llava/llava_pretrain_data.py
.
accelerate launch --config_file path/to/your/accelerate_config --main_process_port=8888 training/train_w_clip_vit.py config=configs/showo_instruction_tuning_1_w_clip_vit.yaml
[Option c] Stage 3 - Instruction tuning on LLaVA dataset (llava-tuning) with CLIP-ViT. Change the data path in llava/llava_instuct_data.py
.
accelerate launch --config_file path/to/your/accelerate_config --main_process_port=8888 training/train_w_clip_vit.py config=configs/showo_instruction_tuning_2_w_clip_vit.yaml
We welcome your bravo new ideas and contributions! If you would like to see any new features in Show-o, or you want to contribute to this project, please fill in this form!
Pending Requested Features
- Mixed-modal generation
- Support training on more datasets
- Visual tokenizer training
Find more at Contributing and Roadmap.
Welcome to discuss with us and continuously improve the user experience of Show-o. Reach us with this Discord channel or the WeChat QR code below!
To cite the paper and model, please use the below:
@article{xie2024showo,
title={Show-o: One Single Transformer to Unify Multimodal Understanding and Generation},
author={Xie, Jinheng and Mao, Weijia and Bai, Zechen and Zhang, David Junhao and Wang, Weihao and Lin, Kevin Qinghong and Gu, Yuchao and Chen, Zhijie and Yang, Zhenheng and Shou, Mike Zheng},
journal={arXiv preprint arXiv:2408.12528},
year={2024}
}
This work is heavily based on open-muse, Phi-1.5, muse-maskgit-pytorch, maskgit, taming-transformers, transformers, accelerate, diffusers, and webdataset. Thanks to all the authors for their great work.