Curated list of papers + libraries related to computer GUI use via LLMs.
Highly opinionated, focus on quality vs quantity.
- Try computer use on your Mac in one click.
- WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning (Tsinghua U) (11/24)
- Anthropic Claude Computer Use API (Anthropic) (10/24)
- OmniParser for Pure Vision Based GUI Agent (code) (Microsoft) (08/24)
- ECLAIR: Enterprise sCaLe AI for woRkflows(code) (Stanford U) (05/24)
- OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments (code) (HKU) (05/24)
- Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs (code) (Apple) (04/24)
- SeeAct: GPT-4V(ision) is a Generalist Web Agent, if Grounded (code) (OSU) (01/24)
- CogAgent: A Visual Language Model for GUI Agents (ZhiPu)(12/23)
- AppAgent: Multimodal Agents as Smartphone Users (code) (TenCent) (12/23)
- SoM : Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V (code) (Microsoft) (10/23)