Yunjie Tian*1, Tianren Ma*1, Lingxi Xie2, Jihao Qiu1, Xi Tang1, Yuan Zhang1, Jianbin Jiao1, Qi Tian2, Qixiang Ye1
1 University of Chinese Academy of Sciences, 2 HUAWEI Inc.
Paper: (arXiv 2401.13307)
In this study, we establish a baseline for a new task named multimodal multi-round referring and grounding (MRG), opening up a promising direction for instance-level multimodal dialogues. We present a new benchmark and an efficient vision-language model for this purpose. The new benchmark, named CB-300K, spans challenges including multi-round dialogue, complex spatial relationships among multiple instances, and consistent reasoning, which are beyond those shown in existing benchmarks. The proposed model, named ChatterBox, utilizes a two-branch architecture to collaboratively handle vision and language tasks. By tokenizing instance regions, the language branch acquires the ability to perceive referential information. Meanwhile, ChatterBox feeds a query embedding in the vision branch to a token receiver for visual grounding. A two-stage optimization strategy is devised, making use of both CB-300K and auxiliary external data to improve the model's stability and capacity for instance-level understanding. Experiments show that ChatterBox outperforms existing models in MRG both quantitatively and qualitatively, paving a new path towards multimodal dialogue scenarios with complicated and precise interactions.
The architecture of the ChatterBox model.
Key Contributions:
- CB-300K - We establish the CB-300K benchmark to facilitate the research in multi-round referring and grounding.
- Chatterbox Model - We establish the ChatterBox model in a dual-branch architecture to solve multi-round referring and grounding problem.
- Clone this repository and navigate to ChatterBox folder
git clone https://github.com/sunsmarterjie/ChatterBox
cd ChatterBox
- Install Packages
conda create -n chatterbox python=3.11.5
conda activate chatterbox
pip install --upgrade pip # enable PEP 660 support
pip install -r requirements.txt
pip install deepspeed==0.11.1
unzip mmcv-1.4.7.zip
cd mmcv-1.4.7/
MMCV_WITH_OPS=1 pip install -e .
cd ../model/GroundingDINO/ops
python setup.py build install
We build visual branch of ChatterBox using GroundingDINO and DINO, we provide GroundDINO version now.
- Prepare datasets/models:
Download CB-300K, VG, COCO2017, COCO2014, RefCOCO, RefCOCO+, RefCOCOg, Flickr30K, OpenSource, clip-vit-large-patch14, LLaVA-Instruct-150K, llava-llama-2-13b, CB-materials, groundingdino_swinb.
├── datasets
| ├── CB-300K
| | ├── CB-MRG
| | ├── CB-LC
│ │ └── ...
| ├── VG
| | ├── VG_100K
| | ├── VG_100K_2
│ │ └── ...
│ ├── MSCOCO2017
| | ├── train2017
│ │ └── ...
│ ├── MSCOCO2014
| | ├── train2014
│ │ └── ...
│ ├── Flickr30K
| | ├── flickr30k-images
│ │ └── ...
│ ├── llava_instruct_150k.json
| ├── CB_materials
| ├── CB-refcoco-GND
| ├── CB-coco-GND
| ├── CB-refcoco-REF
│ └── ...
│── clip-vit-large-patch14
| ├── config.json
│ └── ...
│── llava-llama-2-13b-chat-lightning-preview
| ├── config.json
│ └── ...
│── OpenSource
| ├── finetune_refcoco_train.json
| ├── finetune_refcoco+_train.json
│ └── ...
├── groundingdino_swinb_cogcoor.pth
- Train ChatterBox on 8xA800 GPUs (80GB).
python startup_stage1.py # stage1
python startup_stage2.py # stage2
See details at evaluation.
Coming soon
If this project has been helpful or if you've used our dataset, please cite:
@article{tian2024chatterbox,
title={ChatterBox: Multi-round Multimodal Referring and Grounding},
author={Tian, Yunjie and Ma, Tianren and Xie, Lingxi and Qiu, Jihao and Tang, Xi and Zhang, Yuan and Jiao, Jianbin and Tian, Qi and Ye, Qixiang},
journal={arXiv preprint arXiv:2401.13307},
year={2024}
}
This project is based on LLaVA (paper, code), LISA (paper, code), GPT4RoI (paper, code), thanks for their excellent works.