Skip to content

✨✨Latest Papers and Datasets on Mobile and PC GUI Agent

Notifications You must be signed in to change notification settings

aialt/awesome-mobile-agents

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 

Repository files navigation

Recent Trends in Multimodal Mobile Agents: A Survey

Example Image

Static Datasets and Benchmarks

Dataset Templates Attach Task Reward Platform
RICOSCA 259k - Grounding - Android
ANDROIDHOWTO 10k - Extraction - Android
PixelHelp 187 - Apps - Android
Screen2Words 112k XML Summarization - Android
META-GUI 1,125 - Apps+Web - Android
MoTIF 4,707 - Apps - Android
UGIF 4184 XML Grounding - Android
AitW 1000k - Apps+Web - Android
AitZ 2504 - Apps+Web - Android
AMEX 3k XML Apps+Web - Android
Ferret-UI 120k - Apps - IOS
GUI-World 12k - Apps+Web - Multi Platforms
Mobile3M 3M - Apps - Android
Odyssey 7735 - Apps+Web - Multi Platforms
Androidcontrol 15283 - Apps+Web - Android
ScreenSpot - - Apps+Web - Multi Platforms
MobileViews-600K 600k - Apps - Android

Interactive Datasets and Benchmarks

Dataset Templates Attach Task Reward Platform
MiniWoB++ 114 - Web (synthetic) Sparse Rewards -
AndroidEnv 100 - Apps Sparse Rewards Android
AppBuddy 35 - Apps Sparse Rewards Android
Mobile-Env 224 XML Apps+Web Dense Rewards Android
AndroidArena 221 XML Apps+Web Sparse Rewards Android
AndroidWorld 116 - Apps+Web Sparse Rewards Android
DroidTask 158 XML Apps+Web - Android
B-MoCA 60 XML Apps+Web - Android
Mobile-Bench 832 XML Apps+Web - Android
MobileAgentBench 100 - Apps+Web Dense Rewards Android
SPA-BENCH 340 - Apps+Web Dense Rewards Android
CRAB 23 - Apps+Web - Android + Linux

Comparison of various platforms based on templates, attach information, tasks, rewards, and supported platforms. In particular, the reward mechanisms are categorized as Sparse Rewards and Dense Rewards. Sparse Rewards are given only when the agent reaches a specific goal or completes the task, making learning more difficult due to the lack of immediate feedback. On the other hand, Dense Rewards provides feedback after each step or action, helping the agent learn the correct strategy more quickly.

For general OS systems, see the section on General OS Systems.

Mobile Agents

Method Input Type Model Training Memory Multi-agents
Prompt-based Methods
ResponsibleTA (Zhang et al., 2023c) Image&Text GPT-4 None
DroidGPT (Wen et al., 2023b) Text ChatGPT None
AppAgent (Yang et al., 2023) Image&Text GPT-4 None
MobileAgent (Wang et al., 2024b) Image&Text GPT-4 None
MobileAgent v2 (Wang et al., 2024a) Image&Text GPT-4 None
AutoDroid (Wen et al., 2024) Image&Text GPT-4 None
AppAgent V2 (Li et al., 2024) Image&Text GPT-4 None
VLUI (Lee et al., 2024) Image&Text GPT4 None
Training-based Methods
MiniWob (Liu et al., 2018) Image DOMNET RL-based
MetaGUI (Sun et al., 2022) Image&Text VLM Pre-trained
CogAgent (Hong et al., 2023) Image&Text CogVLM Pre-trained
AutoGUI (Zhang and Zhang, 2023) Image&Text MMT5 Finetune
ResponsibleTA (Zhang et al., 2023c) Image&Text VLM Finetune
UI-VLM (Dorka et al., 2024) Image&Text LLaMA Finetune
Coco-Agent (Ma et al., 2024) Image&Text MMT5 Finetune
DigiRL (Bai et al., 2024) Image&Text MMT5 RL-based
SphAgent (Chai et al., 2024) Image&Text VLM Finetune
Octopus v2 (Chen and Li, 2024) Text Gemma Finetune
Octo-planner (Chen et al., 2024c) Text Gemma Finetune
MobileVLM (Wu et al., 2024) Image&Text Qwen-VL Finetune
OdysseyAgent (Lu et al., 2024) Image&Text Qwen-VL Finetune

Comparison of Mobile Agents: A Detailed Overview of Input Types, Models, Training Methods, Memory Capabilities, and Multi-agent Support.

Base Model

Prompt Based Framework

LLM-SFT Based Framework

LLM-RL Based Framework

UI understanding and Automation

Dataset and Benchmark

2017

2022

2023

2024

Web & PC & OS DataSet

Web & PC & OS Framework

To Do List

  • Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms
  • OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
  • Infering Alt-text For UI Icons With Large Language Models During App Development
  • TinyClick: Single-Turn Agent for Empowering GUI Automation

Citation

 @article{wu2024foundations,
  title={Foundations and Recent Trends in Multimodal Mobile Agents: A Survey},
  author={Wu, Biao and Li, Yanda and Fang, Meng and Song, Zirui and Zhang, Zhiwei and Wei, Yunchao and Chen, Ling},
  journal={arXiv preprint arXiv:2411.02006},
  year={2024}
}

Star History

Star History Chart

Appendix

General OS Systems

Dataset Templates Attach Task Reward Platform
Static Dataset
RICOSCA (Deka et al., 2017) 259k - Grounding - Android
ANDROIDHOWTO (Deka et al., 2017) 10k - Extraction - Android
PixelHelp (Li et al., 2020a) 187 - Apps - Android
WebSRC (Chen et al., 2021) 400k HTML Web - Windows
Screen2words (Wang et al., 2021) 112k XML Summarization - Android
META-GUI (Lee et al., 2021) 1,125 - Apps+Web - Android
MoTIF (Wang et al., 2022) 4,707 - Apps - Android
UGIF (Venkatesh et al., 2022) 4184 XML Grounding - Android
WebUI (Wu et al., 2023) 400k HTML Web - Windows
Mind2Web (Deng et al., 2024) 2,350 HTML Web - Windows
AitW (Rawles et al., 2024b) 30k - Apps+Web - Android
AitZ (Zhang et al., 2024b) 2504 - Apps+Web - Android
AMEX (Chai et al., 2024) 3k XML Apps+Web - Android
Ferret-UI (You et al., 2024) 120k HTML Apps - Multi-Platforms
OmniAct (Kapoor et al., 2024) 9802 Org/Seg Web - Windows
WebLINX (Roßner et al., 2020) 2,337 HTML Web - Windows
ScreenAgent (Niu et al., 2024) 3005 HTML Web - Windows
GUI-World (Chen et al., 2024a) 12k - Apps+Web - Multi Platforms
Mobile3M (Chen et al., 2024a) 3M - Apps - Android
Interactive Environment
MiniWoB++ (Liu et al., 2018) 114 - Web (synthetic) HTML/JS state -
AndroidEnv (Toyama et al., 2021) 100 - Apps Device state Android
WebShop (Yao et al., 2022a) 12k - Web Product Attrs Match Windows
WebArena (Zhou et al., 2023) 241 HTML Web url/text-match Windows
Mobile-Env (Zhang et al., 2023a) 224 XML Apps+Web Intermediate state Android
VisualWebArena (Koh et al., 2024) 314 HTML Web url/text/image-match Windows
Ferret-UI (You et al., 2024) 314 HTML Web url/text/image-match Windows
AndroidArena (Wang et al., 2024c) 221 XML Apps+Web Device state Android
AndroidWorld (Rawles et al., 2024a) 116 - Apps+Web Device state Android
OSWorld (Xie et al., 2024) 369 - Web Device/Cloud state Linux
DroidTask (Wen et al., 2024) 158 XML Apps+Web - Android

Comparison of various platforms based on parallelization, templates, tasks per template, rewards, and supported OS.

About

✨✨Latest Papers and Datasets on Mobile and PC GUI Agent

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published