Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refine ux again #150

Merged
merged 57 commits into from
Nov 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
a032854
yet another way to kill timedout jobs
recursix Nov 6, 2024
ac1a461
Improve timeout handling in task polling logic
recursix Nov 6, 2024
6384b85
Merge branch 'dev' into clean-pipeline
recursix Nov 6, 2024
b0594ab
Merge branch 'dev' into clean-pipeline
recursix Nov 7, 2024
e9fffc2
Merge branch 'dev' into clean-pipeline
recursix Nov 7, 2024
290b88d
Add method to override max_steps in Study class
recursix Nov 7, 2024
3f05803
add support for tab visibility in observation flags and update relate…
recursix Nov 8, 2024
2fe585f
fix tests
recursix Nov 8, 2024
4a8cbb2
black
recursix Nov 8, 2024
17fc3d1
Improve timeout handling in task polling logic
recursix Nov 6, 2024
1e07d3e
yet another way to kill timedout jobs (#108)
recursix Nov 6, 2024
63d8deb
Add method to override max_steps in Study class
recursix Nov 7, 2024
b88a058
add support for tab visibility in observation flags and update relate…
recursix Nov 8, 2024
e97d023
fix tests
recursix Nov 8, 2024
ccd7b8b
black
recursix Nov 8, 2024
1aa4916
black
gasse Nov 8, 2024
c990e76
recursix Nov 8, 2024
8de36e2
Fix sorting bug.
recursix Nov 8, 2024
c4e8acb
fix test
recursix Nov 8, 2024
c9f184c
black
recursix Nov 8, 2024
2ab4f3a
Merge branch 'fix-tabs' of github.com:ServiceNow/AgentLab into fix-tabs
recursix Nov 9, 2024
b465e63
Merge branch 'dev' into fix-gradio
recursix Nov 9, 2024
d0fcb39
Merge branch 'fix-tabs' into fix-gradio
recursix Nov 9, 2024
3a96d56
tmp
recursix Nov 9, 2024
2b4775b
Merge branch 'dev' into clean-pipeline
recursix Nov 13, 2024
a16aea0
add error report, add cum cost to summary and ray backend by default
recursix Nov 13, 2024
a18e8e5
displaying exp names in ray dashboard (#123)
ThibaultLSDC Nov 14, 2024
a7d6467
enabling chat o_0 (#124)
ThibaultLSDC Nov 15, 2024
bd12318
from previous
recursix Nov 15, 2024
6a50756
Merge branch 'dev' into Study-to-multi-eval
recursix Nov 15, 2024
50d4571
sequential studies
recursix Nov 15, 2024
d0919dc
little bug
recursix Nov 18, 2024
0e2b752
more flexible requirement
recursix Nov 18, 2024
041fd68
imrove readme
recursix Nov 18, 2024
f3c031d
Merge branch 'main' into refine-ux
recursix Nov 18, 2024
085ca51
Merge branch 'main' into refine-ux
recursix Nov 20, 2024
79ac418
Enhance agent configuration and logging in study setup
recursix Nov 22, 2024
654a8d7
Merge branch 'main' into refine-ux
recursix Nov 22, 2024
f4f9e25
get_text was added by mistake
recursix Nov 22, 2024
8677f48
Update README and Jupyter notebook with improved documentation and re…
recursix Nov 22, 2024
ab949e6
Update requirements to include Jupyter support for black
recursix Nov 22, 2024
2d079a9
Merge branch 'main' into refine-ux
recursix Nov 22, 2024
c244b4e
Update README.md
gasse Nov 22, 2024
adc13a4
Fix formatting and improve clarity in README.md
recursix Nov 23, 2024
4183f6b
Fix formatting and improve clarity in README.md
recursix Nov 23, 2024
6273e34
Update README.md to enhance visuals and improve navigation
recursix Nov 23, 2024
561951b
Add badges to README.md for PyPI, GitHub stars, and CI status
recursix Nov 23, 2024
450e0ba
Add video demonstration to AgentXray section in README.md
recursix Nov 23, 2024
73c8193
test video
recursix Nov 23, 2024
e288f78
xray video test
recursix Nov 23, 2024
8472a89
Update AgentXray section in README.md with new asset link
recursix Nov 23, 2024
026da43
minor
recursix Nov 23, 2024
5d976c6
fix setup link ... again
recursix Nov 23, 2024
d6229ac
remove upper case letter before getting the benchmark
recursix Nov 23, 2024
e5769c4
minor
recursix Nov 23, 2024
d1b7efa
Update ReproducibilityAgent link in README.md for better accessibility
recursix Nov 23, 2024
a5aa28b
Merge branch 'main' into refine-ux
recursix Nov 25, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 30 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@



<a href="https://github.com/user-attachments/assets/c2bc0b80-89da-4afb-9120-2feb018df19d"> <img
src="https://github.com/user-attachments/assets/c2bc0b80-89da-4afb-9120-2feb018df19d" width="800"
/> </a>
Expand All @@ -13,30 +14,40 @@
[🤖 Make Your Own Agent](#-implement-a-new-agent) &nbsp;&nbsp;|&nbsp;&nbsp;
[↻ Reproducibility](#-reproducibility) &nbsp;&nbsp;|&nbsp;&nbsp;

[![pypi](https://badge.fury.io/py/agentlab.svg)](https://pypi.org/project/agentlab/)
[![PyPI - License](https://img.shields.io/pypi/l/agentlab?style=flat-square)]([https://opensource.org/licenses/MIT](http://www.apache.org/licenses/LICENSE-2.0))
[![PyPI - Downloads](https://img.shields.io/pypi/dm/agentlab?style=flat-square)](https://pypistats.org/packages/agentlab)
[![GitHub star chart](https://img.shields.io/github/stars/ServiceNow/AgentLab?style=flat-square)](https://star-history.com/#ServiceNow/AgentLab)
[![Code Format](https://github.com/ServiceNow/AgentLab/actions/workflows/code_format.yml/badge.svg)](https://github.com/ServiceNow/AgentLab/actions/workflows/code_format.yml)
[![Tests](https://github.com/ServiceNow/AgentLab/actions/workflows/unit_tests.yml/badge.svg)](https://github.com/ServiceNow/AgentLab/actions/workflows/unit_tests.yml)


[🛠️ Setup](#%EF%B8%8F-setup-agentlab) &nbsp;&nbsp;|&nbsp;&nbsp;
[🤖 Assistant](#-ui-assistant) &nbsp;&nbsp;|&nbsp;&nbsp;
[🚀 Launch Experiments](#-launch-experiments) &nbsp;&nbsp;|&nbsp;&nbsp;
[🔍 Analyse Results](#-analyse-results) &nbsp;&nbsp;|&nbsp;&nbsp;
&nbsp;&nbsp;|&nbsp;&nbsp;
[🤖 Build Your Agent](#-implement-a-new-agent) &nbsp;&nbsp;|&nbsp;&nbsp;
[↻ Reproducibility](#-reproducibility)


<video controls style="max-width: 700px;">
<source src="https://github.com/ServiceNow/BrowserGym/assets/26232819/e0bfc788-cc8e-44f1-b8c3-0d1114108b85" type="video/mp4">
Your browser does not support the video tag.
</video>

https://github.com/ServiceNow/BrowserGym/assets/26232819/e0bfc788-cc8e-44f1-b8c3-0d1114108b85

AgentLab is a framework for developing and evaluating agents on a variety of
[benchmarks](#🎯-supported-benchmarks) supported by
[benchmarks](#-supported-benchmarks) supported by
[BrowserGym](https://github.com/ServiceNow/BrowserGym).

AgentLab Features:
* Easy large scale parallel [agent experiments](#🚀-launch-experiments) using [ray](https://www.ray.io/)
* Easy large scale parallel [agent experiments](#-launch-experiments) using [ray](https://www.ray.io/)
* Building blocks for making agents over BrowserGym
* Unified LLM API for OpenRouter, OpenAI, Azure, or self hosted using TGI.
* Prefered way for running benchmarks like WebArena
* Various [reproducibility features](#reproducibility-features)
* Unified LeaderBoard (soon)

## 🎯 Supported Benchmarks

| Benchmark | Setup <br> Link | # Task <br> Template| Seed <br> Diversity | Max <br> Step | Multi-tab | Hosted Method | BrowserGym <br> Leaderboard |
|-----------|------------|---------|----------------|-----------|-----------|---------------|----------------------|
| [WebArena](https://webarena.dev/) | [setup](https://github.com/ServiceNow/BrowserGym/blob/main/browsergym/webarena/README.md) | 812 | None | 30 | yes | self hosted (docker) | soon |
Expand All @@ -45,10 +56,11 @@ AgentLab Features:
| [WorkArena](https://github.com/ServiceNow/WorkArena) L3 | [setup](https://github.com/ServiceNow/WorkArena?tab=readme-ov-file#getting-started) | 341 | High | 50 | no | demo instance | soon |
| [WebLinx](https://mcgill-nlp.github.io/weblinx/) | - | 31586 | None | 1 | no | self hosted (dataset) | soon |
| [VisualWebArena](https://github.com/web-arena-x/visualwebarena) | [setup](https://github.com/ServiceNow/BrowserGym/blob/main/browsergym/visualwebarena/README.md) | 910 | None | 30 | yes | self hosted (docker) | soon |
| [Assistant Bench](https://assistantbench.github.io/) | [setup](https://github.com/ServiceNow/BrowserGym/blob/main/browsergym/assistantbench/README.md) | 214 | None | 30 | yes | live web | soon |
| [AssistantBench](https://assistantbench.github.io/) | [setup](https://github.com/ServiceNow/BrowserGym/blob/main/browsergym/assistantbench/README.md) | 214 | None | 30 | yes | live web | soon |
| [GAIA](https://huggingface.co/spaces/gaia-benchmark/leaderboard) (soon) | - | - | None | - | - | live web | soon |
| [Mind2Web-live](https://huggingface.co/datasets/iMeanAI/Mind2Web-Live) (soon) | - | - | None | - | - | live web | soon |
| [MiniWoB](https://miniwob.farama.org/index.html) | [setup](https://github.com/ServiceNow/BrowserGym/blob/main/browsergym/miniwob/README.md) | 125 | Medium | 10 | no | self hosted (static files) | soon |

## 🛠️ Setup

```bash
Expand All @@ -61,7 +73,7 @@ playwright install
```

Make sure to prepare the required benchmark according to instructions provided in the [setup
column](#🎯-supported-benchmarks).
column](#-supported-benchmarks).

```bash
export AGENTLAB_EXP_ROOT=<root directory of experiment results> # defaults to $HOME/agentlab_results
Expand All @@ -86,6 +98,7 @@ export AZURE_OPENAI_ENDPOINT=<your endpoint> # if using azure models
</details>

## 🤖 UI-Assistant

Use an assistant to work for you (at your own cost and risk).

```bash
Expand Down Expand Up @@ -178,23 +191,15 @@ result_df = inspect_results.load_result_df("path/to/your/study")


### AgentXray
Inspect the behaviour of your agent using xray. You can load previous or ongoing experiments. The refresh mechanism is currently a bit clunky, but you can refresh the page, refresh the experiment directory list and select again your experiment to see an updated version of your currently running experiments.

https://github.com/user-attachments/assets/06c4dac0-b78f-45b7-9405-003da4af6b37

In a terminal, execute:
```bash
agentlab-xray
```

**⚠️ Note**: Gradio is still in developement and unexpected behavior have been frequently noticed. Version 5.5 seems to work properly so far. If you're not sure that the proper information is displaying, refresh the page and select your experiment again.


<video controls style="max-width: 800px;">
<source src="https://github.com/user-attachments/assets/06c4dac0-b78f-45b7-9405-003da4af6b37" type="video/mp4">
Your browser does not support the video tag.
</video>


You will be able to select the recent experiments in the directory `AGENTLAB_EXP_ROOT` and visualize
You can load previous or ongoing experiments in the directory `AGENTLAB_EXP_ROOT` and visualize
the results in a gradio interface.

In the following order, select:
Expand All @@ -206,14 +211,18 @@ In the following order, select:
Once this is selected, you can see the trace of your agent on the given task. Click on the profiling
image to select a step and observe the action taken by the agent.


**⚠️ Note**: Gradio is still in developement and unexpected behavior have been frequently noticed. Version 5.5 seems to work properly so far. If you're not sure that the proper information is displaying, refresh the page and select your experiment again.


## 🤖 Implement a new Agent

Get inspiration from the `MostBasicAgent` in
[agentlab/agents/most_basic_agent/most_basic_agent.py](src/agentlab/agents/most_basic_agent/most_basic_agent.py).
For a better integration with the tools, make sure to implement most functions in the
[AgentArgs](src/agentlab/agents/agent_args.py#L5) API and the extended `bgym.AbstractAgentArgs`.

If you think your agent should be included directly in AgenLab, let use know and it can be added in
If you think your agent should be included directly in AgenLab, let us know and it can be added in
agentlab/agents/ with the name of your agent.

## ↻ Reproducibility
Expand Down Expand Up @@ -243,7 +252,7 @@ dynamic benchmarks.
* **Reproduced results in the leaderboard**. For agents that are repdocudibile, we encourage users
to try to reproduce the results and upload them to the leaderboard. There is a special column
containing information about all reproduced results of an agent on a benchmark.
* **ReproducibilityAgent**: You can run this agent on an existing study and it will try to re-run
* **ReproducibilityAgent**: [You can run this agent](src/agentlab/agents/generic_agent/reproducibility_agent.py) on an existing study and it will try to re-run
the same actions on the same task seeds. A vsiual diff of the two prompts will be displayed in the
AgentInfo HTML tab of AgentXray. You will be able to inspect on some tasks what kind of changes
between to two executions. **Note**: this is a beta feature and will need some adaptation for your
Expand Down
4 changes: 2 additions & 2 deletions src/agentlab/experiments/study.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ def make_study(
agent_args = [agent_args]

if isinstance(benchmark, str):
benchmark = bgym.DEFAULT_BENCHMARKS[benchmark]()
benchmark = bgym.DEFAULT_BENCHMARKS[benchmark.lower()]()

if "webarena" in benchmark.name and len(agent_args) > 1:
logger.warning(
Expand Down Expand Up @@ -220,7 +220,7 @@ def __post_init__(self):
"""Initialize the study. Set the uuid, and generate the exp_args_list."""
self.uuid = uuid.uuid4()
if isinstance(self.benchmark, str):
self.benchmark = bgym.DEFAULT_BENCHMARKS[self.benchmark]()
self.benchmark = bgym.DEFAULT_BENCHMARKS[self.benchmark.lower()]()
if isinstance(self.dir, str):
self.dir = Path(self.dir)
self.make_exp_args_list()
Expand Down