Skip to content

Dongping-Chen/GUI-World

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 

Repository files navigation

IconGUI-World: A Dataset for GUI-Orientated Multimodal Large Language Models

Paper Dataset Website

git-last-commit GitHub commit activity GitHub top language

Updates & News

We will release our benchmark code soon.

  • [16/06/2024] 📄 Paper on arxiv has released!

Contents

Dataset: GUI-World

Overview

GUI-World introduces a comprehensive benchmark for evaluating MLLMs in dynamic and complex GUI environments. It features extensive annotations covering six GUI scenarios and eight types of GUI-oriented questions. The dataset assesses state-of-the-art ImageLLMs and VideoLLMs, highlighting their limitations in handling dynamic and multi-step tasks. It provides valuable insights and a foundation for future research in enhancing the understanding and interaction capabilities of MLLMs with dynamic GUI content. This dataset aims to advance the development of robust GUI agents capable of perceiving and interacting with both static and dynamic GUI elements.

How to use GUI-World

GUI-World is splited to train and test set, which can be accessed from huggingface.

GUI-Vid: A GUI-Oriented VideoLLM

GUI-Vid is a VideoLLM finetuned from Videochat2. You can reproduce our experiment results following these instructions: Prepare the Environment

git clone https://github.com/Dongping-Chen/GUI-World.git
cd GUI-World/GUI_Vid
conda create -n gui python=3.9
conda activate gui
pip install -r requirements.txt

GUI-Oriented Finetuning

  • Download [GUI-World] and modify the root path in GUI_Vid/configs/instruction_data.py, which is the root dir for your download GUI-World.
  • Set vit_blip_model_path, llama_model_path and videochat2_model_path in scripts/videochat_vicuna/config_7b_stage3.py, these checkpoints can be download from GUI-Vid.
# Vicuna
bash scripts/videochat_vicuna/run_7b_stage3.sh

Inference with GUI-Vid You can first download checkpoint from Huggingface. You also need to set the config according to the guidance in Videochat2. Then, set the model_path in scripts/demo_local.py. Use the following script to inference a GUI video:

python demo_local.py \
--ckpt_path <path to GUI-Vid> \
--keyframe 8 \
--video_path <path to your video> \
--qs <your query> 

Contribution

Contributions to this project are welcome. Please consider the following ways to contribute:

  • Proposing new features or improvements
  • Benchmark other mainstream MLLMs

Acknowledgments

Many thanks to Yinuo Liu, Zhengyan Fu, Shilin Zhang, Yu, Tianhe Gu, Haokuan Yuan, and Junqi Wang for their invalueble effort in this project. This project is based on methodologies and code presented in Videochat2.

Citation

@misc{chen2024guiworld,
      title={GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents}, 
      author={Dongping Chen and Yue Huang and Siyuan Wu and Jingyu Tang and Liuyi Chen and Yilin Bai and Zhigang He and Chenlong Wang and Huichi Zhou and Yiqiang Li and Tianshuo Zhou and Yue Yu and Chujie Gao and Qihui Zhang and Yi Gui and Zhen Li and Yao Wan and Pan Zhou and Jianfeng Gao and Lichao Sun},
      year={2024},
      eprint={2406.10819},
      archivePrefix={arXiv},
}

About

The Official Code Repository for GUI-World.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published