-
Notifications
You must be signed in to change notification settings - Fork 44.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Help us build challenges! #3835
Comments
The section titled "others": should at least mention "planning", which is severely lacking at the moment. I bet all of us can come up with dozens of ai_settings files (objectives) where it fails to plan properly, or where it fails to recognize dependencies in between tasks, but also recognizing that it previously succeeded/completed a task and should proceed: #3593 (comment) For starters, we probably need to have challenges that involve:
(not getting into async/concurrency for now) |
Yeah this is a great suggestion @Boostrix can I talk to you in discord ? my discord is merwanehamadi. |
Boostrix has preference for staying off of Discord based on my prior interactions with them |
@Boostrix can you think of a challenge we could build in order to test planning skills ? |
@Androbin has suggested a very nice memory challenge involving files that are read in the wrong order. More details coming soon hopefully. |
For starters, I would consider a plan to be multi-objective task with multiple dimensions of leeway and constraints that the agent needs to explore. So I suppose anything involving detangling dependencies (or lack thereof) should work to get this started. That would involve organizing steps but also coordinating in between steps. In #3593, @dschonholtz is juggling some nice ideas. And given the current state of things, it might actually be good to have working/non-working examples of plans for the agent, so that we can see which approach(es) look promising. And in fact, GPT itself seems pretty good at coming up with potential candidates:
PS: I would suggest to extend the logging feature so that we can optionally log resource utilization to CSV files for these challenges - that way, we can trivially plot the performance of different challenges over time (different versions of Auto-GPT) - and in conjunction with suppport for #3466 (constraint awareness), we could also log how many steps [thinking] were needed for each challenge/benchmark (plan/task) EDIT: To promote this effort a little more, I would suggest to add a few of these issues to the MOTD displayed when starting Auto-GPT, so that more users are encouraged to get involved in participating - we could randomly alternate between a handful of relevant issues and invite folks to particpate.
One of the most impressive examples posted around here is this one by @adam-paterson |
@Boostrix great suggestions! How do you measure success in a very deterministic way for these items ? The example of regression you gave (entire website) is great! I guess we could use selenium to test the website. It looks like a great project. |
I was thinking to use a simple test case for starters, one where we ask the "planner" to coordinate a meeting between N different people (say 3-4) who are constrained by availability (date/time) - later on, N can be increased, with more constraints added, such as certain people never being available on the same date. So, the input for the unit test will be date/time for now (for each participant). To verify if the the solution is valid, we merely need to execute the unit test for each participant which will tell us if that participant is available - that way, we're reducing the problem to running tests. Once we have that fleshed out/working for 3-4 participants, it would make sense to add more complexity to it by adding more options and constraints, including dependencies between these options and constraints. We could then adapt this framework for other more complex examples (see above).
For API costs/token (budget) we have several open PRs, number of steps taken is part of at least 2 PRs that I am aware of.
I didn't even think about using Selenium, I was thinking of treating the final HTML like any XHTML/XML document and query it from Python to see if it's got the relevant tags and attributes as requested by the specs. Personally, I find the whole Selenium stuff mainly useful for highly dynamic websites, static HTML can probably be queried "as is" from Python - no need for Selenium ? That being said, we could also use @adam-patersons example and add an outer agent to mutate his ai_settings / project.txt file to tinker with 3-5 variations of each step (file names, technology stack, functionality etc) - that way, you end up with a ton of regression tests at the mere cost of a few nested loops.The "deliverables" portion of his specs is such succinct that we could probably use it "as is" to create a corresponding python unit test via Auto-GPT (I have used this for several toy projects, and it's working nicely). |
FWIW, I can get it to bail out rather quickly by letting an outer agent mutate the following directive: list of X open source software packages in the area of {audio, video, office, games, flight simulation}, [released under the {GPL, BSD ...], [working on {Windows, Mac OSX, Linux}] Which is pretty interesting once you think about it, since this is the sort of stuff that LLM's like GPT are generally good at - and in fact, GPT can answer any of these easily, just not in combination with Auto-GPT, it seems like some sort of "intra-llm-regression" due to the interaction with the AI agent mechanism, |
@Boostrix I would agree that our current prompting artificially limits GPT-4 abilities. The issue I see is that we actively discourage long-form chain-of-thought reasoning. |
The idea to use dynamic prompting sounds rather promising: |
Here's another good description by @hashratez of an intra-llm-regression that should be suitable to benchmark the system against GPT itself:
|
we should update the list of potential challenges to add a new category for "experiential tests" - i.e. tests where the agent should be able to apply its experience of previously completing a task, but fails to do so. The most straightforward example I can think of is it editing/updating a file successfully and 5 minutes later it wants to use interactive editors like nano, vim etc - that's a recurring topic here. So we should add a benchmark to see how good an agent is at applying experience to future tasks. A simple test case would be telling it to update a file with some hand holding, and afterwards leaving out the hand holding and counting how many times it succeeds or not (3/10 attempts). Being able to apply past experience is an essential part of learning. Here's another suggestion for a new category: "Tool use". Likewise, after disabling download_file it should be able figure out using python or the shell to download stuff. There are often several ways to "skin a cat" - when disabling git operations or the shell, the agent must be capable of figuring out alternatives, the number of steps it needs to do so tells us just how effective the agent is. From a pytest standpoint, we would ideally be able to disable some BIFs and then run a test to see how many steps the test needs to complete - if, over time, that number increases, the agent is performing worse. We can also ask the agent to perform tasks for which it does not have any BIFs at all, such as doing mathematical calculations #3412 and then count the number of steps it needs to come up with a solution using different options (python or shell). |
CI pipeline 4 times faster thanks to @AndresCdo and parallelized tests ! |
Suggestion for new challenge type:
This may involve tinkering with different variable/argument substitutions: #3904 (comment) Basically, we need to keep track of commands that previously worked/failed and types of arguments/params that are known to work - including some optional timeout option to retry a command once in a while. Also, commands really should get access to the error/exception string - because at that point, the LLM can actually help us! |
@Boostrix yeah that sounds useful. We just have to be careful about inserting plausible mistakes |
@Boostrix could you create an issue for the 2 challenges you mentioned above and label it "challenge" ? |
Which ones exactly ? Coming up with a challenge where we mutate a URL should be a pretty straightforward way for the agent tofail, which it still can fix - using URL validation/patching. Most obvious example would be adding whitespaces into the URL without escaping. |
all the ones you suggest, we don't have next step for them, can I just have the link of the issues where the challenge idea is written so I put them in this epic ? |
New CI pipeline ready: now you can test challenges by creating a Pull Request. |
thanks @AndresCdo for #3868 |
thank you @PortlandKyGuy for your work! #4261 |
thank you @erik-megarad #4469 |
thank you @dschonholtz @gravelBridge #4286 |
This issue has automatically been marked as stale because it has not had any activity in the last 50 days. You can unstale it by commenting or removing the label. Otherwise, this issue will be closed in 10 days. |
This issue was closed automatically because it has been stale for 10 days with no activity. |
unless I am mistaken, this should not be closed or "staled" at all, I believe this remains relevant or has something changed over the course of the last couple of months that I missing entirely ? |
Summary 💡
Challenges are tasks Auto-GPT is not able to achieve. Challenges will help us improve Auto-GPT
We need your help to build these challenges and the ecosystem around them
Here is a breakdown of the help we need.
A-Challenge Creation
1-Submit challenges within existing categories.
Memory
[Challenge Creation] Memory Challenge C #3838 => @dschonholtzInformation Retrieval
[Challenge Creation] Information Retrieval Challenge B #3837 => @PortlandKyGuyResearch
Psychological
Psychological challenge: Sally and Anne's Test. #3871 => unassignedDebug Code
Create Debug Code Challenge A #3836 @gravelBridgeAdaptability
Website Navigation Challenge
Self Improvement (Solve challenges automatically)
Automated Challenge Creation
Basic Abilities
This model's maximum context length is 8191 tokens, however you requested 21485 tokens (21485 in your prompt; 0 for the completion). Please reduce your prompt; or completion length. #4233 create a BIG FILE challenged => unassigned2-Design brand new challenge categories
DM me if interested (discord below)
Challenges Auto-GPT can already perform
We currently have what we call regression tests. You can imagine regression tests as challenges Auto-GPT is already able to perform.
UX experience around challenges
Improve logs/DEBUG folder to help people beat challenges more easily
The logs/DEBUG folder allows everyone to understand what Auto-GPT is doing at every cycle
We need to:
Log Self Feedback in logs/Debug folder #3842 => @AndresCdo done thank you !Log User input in logs/Debug folder #3843 => @AndresCdo done thank you !"Fix Auto-GPT" challenges !
The vision of the fix auto-gpt challenges is to give the tools for the community to report wrong behaviors and create challenges around them.
We need to:
Make it easy for people to change the prompt
Build the CI pipeline !
DM me on discord if you have devops experience and want to help us build the pipeline that will allow people to submit their challenges and see how they perform!
Pytest Imrovements
Run Pytest Parallel mode to increase speed #3863 => @AndresCdoChallenges refactorization
Challenges should not fail if X number of seconds, but only if X number of cycles is done #4161 => @merwanehamadiGenerating cassettes locally is not necessary #4189 => unassignedCI pipeline improvements
Find a solution to make isort and black NOT slow down challenge contribution #4163 => canceledWe shouldn't have to merge 2 PRs when a cassettes changes #4196 => @merwanehamadi-
allow ci pipeline to make calls to api providers => @merwanehamadiVCR shouldn't record 500, 408, 429, when running challenges #4461 => @erik-megaradCI pipeline cache dependencies #4258 => @merwanehamadi-Run the test suite every 4 hours on master, without using VCR #4590
#####################
(issue to create:
)
Discord (merwanehamadi)
Join Auto-GPT's discord channel: https://discord.gg/autogpt
The text was updated successfully, but these errors were encountered: