AgentStudio is a trinity of environments, tools, and benchmarks for general virtual agents to interact with any computer software. AgentStudio targets the desiderata for robust, general, and open-ended virtual agents by providing:
- A lightweight, interactive environment with highly generic observation and action spaces, e.g., video observations and GUI/API actions
- Tools for creating online benchmark tasks, annotating GUI elements, and labeling actions in videos
- Online benchmark tasks that evaluate both GUI interactions and function calling with auto-evaluation and language feedback
- Three benchmark datasets: GroundUI, IDMBench, and CriticBench, for fundamental agent abilities, including GUI grounding, learning from videos, and success detection
Comparisons with existing work:
- Oct 3, 2024: Released the arXiv paper v2 and a full version of AgentStudio, including comprehensive documentation, complete tasks, and datasets!!
- Aug 18, 2024: Major update to clean up the codebase and datasets.
- Mar 30, 2024: Released the beta version of AgentStudio.
Install requirements:
apt-get install gnome-screenshot xclip xdotool # If using Ubuntu 22.04
conda create --name agent-studio python=3.11 -y
conda activate agent-studio
pip install -e '.[client]'
All confidential API keys should be stored in agent_studio/config/api_key.json
, e.g., OpenAI API key, Claude API key, Gemini API key, etc. We have provided an example config in agent_studio/config/api_key_template.json
.
AgentStudio provides the most generic observation and action spaces, which significantly expands the task space, allowing for developing and evaluating agents in real-world settings. We introduce a benchmark suite consisting of 205 tasks. These tasks span API usages such as terminal and Gmail and GUI software like VS Code. Please find more in eval_online_benchmarks/README.md. The task-related files are available at our project page.
To gain deeper insights into agent capabilities beyond the overall performance measured by online benchmark tasks, we develop three datasets using AgentStudio: GroundUI, IDMBench, and CriticBench. These datasets target general UI grounding, learning from videos, and success detection. More details are provided in eval_agent_desiderata/README.md. All data are available at our project page.
To facilitate the development and evaluation of agents within the AgentStudio environment, we provide three tools for:
- Benchmark task creation and validation
- Step-level GUI element annotation
- Trajectory-level video-action recording and refinement
These tools, combined with the realistic environment of AgentStudio, contribute to the generation of rich, structured data for training and evaluating agents. Please refer to docs/annotate_ground_ui.md for the GUI annotation tool, agent_studio/recorder/README.md for the video-action annotation tool, and eval_online_benchmarks/README.md for the task creation/validation.
Contributions and feedback from everyone on how to make this into a better tool are more than welcome. Please check out CONTRIBUTING.md for how to get involved.
We would like to thank the following projects for their inspiration and contributions to the open-source community: Open Interpreter, WebArena, Cradle, Synapse, SeeClick, ScreenAgent, OSWorld, etc.
If you find AgentStudio useful, please cite our paper:
@article{zheng2024agentstudio,
title={AgentStudio: A Toolkit for Building General Virtual Agents},
author={Longtao Zheng and Zhiyuan Huang and Zhenghai Xue and Xinrun Wang and Bo An and Shuicheng Yan},
journal={arXiv preprint arXiv:2403.17918},
year={2024}
}