We will release our benchmark code soon.
- [16/06/2024] 📄 Paper on arxiv has released!
- Updates & News
- Contents
- Dataset: GUI-World
- GUI-Vid: A GUI-Oriented VideoLLM
- Contribution
- Acknowledgments
- Citation
GUI-World introduces a comprehensive benchmark for evaluating MLLMs in dynamic and complex GUI environments. It features extensive annotations covering six GUI scenarios and eight types of GUI-oriented questions. The dataset assesses state-of-the-art ImageLLMs and VideoLLMs, highlighting their limitations in handling dynamic and multi-step tasks. It provides valuable insights and a foundation for future research in enhancing the understanding and interaction capabilities of MLLMs with dynamic GUI content. This dataset aims to advance the development of robust GUI agents capable of perceiving and interacting with both static and dynamic GUI elements.
GUI-World is splited to train and test set, which can be accessed from huggingface.
GUI-Vid is a VideoLLM finetuned from Videochat2. You can reproduce our experiment results following these instructions: Prepare the Environment
git clone https://github.com/Dongping-Chen/GUI-World.git
cd GUI-World/GUI_Vid
conda create -n gui python=3.9
conda activate gui
pip install -r requirements.txt
GUI-Oriented Finetuning
- Download [GUI-World] and modify the root path in
GUI_Vid/configs/instruction_data.py
, which is the root dir for your download GUI-World. - Set
vit_blip_model_path
,llama_model_path
andvideochat2_model_path
inscripts/videochat_vicuna/config_7b_stage3.py
, these checkpoints can be download from GUI-Vid.
# Vicuna
bash scripts/videochat_vicuna/run_7b_stage3.sh
Inference with GUI-Vid
You can first download checkpoint from Huggingface. You also need to set the config according to the guidance in Videochat2.
Then, set the model_path
in scripts/demo_local.py
. Use the following script to inference a GUI video:
python demo_local.py \
--ckpt_path <path to GUI-Vid> \
--keyframe 8 \
--video_path <path to your video> \
--qs <your query>
Contributions to this project are welcome. Please consider the following ways to contribute:
- Proposing new features or improvements
- Benchmark other mainstream MLLMs
Many thanks to Yinuo Liu, Zhengyan Fu, Shilin Zhang, Yu, Tianhe Gu, Haokuan Yuan, and Junqi Wang for their invalueble effort in this project. This project is based on methodologies and code presented in Videochat2.
@misc{chen2024guiworld,
title={GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents},
author={Dongping Chen and Yue Huang and Siyuan Wu and Jingyu Tang and Liuyi Chen and Yilin Bai and Zhigang He and Chenlong Wang and Huichi Zhou and Yiqiang Li and Tianshuo Zhou and Yue Yu and Chujie Gao and Qihui Zhang and Yi Gui and Zhen Li and Yao Wan and Pan Zhou and Jianfeng Gao and Lichao Sun},
year={2024},
eprint={2406.10819},
archivePrefix={arXiv},
}