Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added -accurate, reflective mouse click mode #57

Merged
merged 22 commits into from
Dec 3, 2023

Conversation

klxu03
Copy link
Contributor

@klxu03 klxu03 commented Dec 2, 2023

Implemented -accurate, reflective mouse click mode.

If you add the -accurate tag when running operate, it will enable accurate mode. Now, whenever the model tries to click on anything, it will send another additional request to GPT giving it a chance to adjust its initial percentage guesses.

I first extract out the initial guess of where the model tries to click, and then I take a screenshot of a smaller 200 x 200 pixel rectangle around the initial guess. I then upsample the image by doubling the dimensions so it's bigger to GPT (all done in capture_mini_screenshot_with_cursor) and then ask the model to refine its guess, giving it the previous image/message as context as well as the previous X Y coordinate guess it attempted and give it a chance to add and subtract minute percentages (accurate_mode_double_check).

Locally, this configuration has significantly improved the accuracy of clicking on my desktop configuration (two monitors). Currently I've only implemented it for Linux, but if this approach is liked it can easily be adapted to other OS.

Idea could be further improved if the add_grid_to_image is further improved such that in the accurate_mode adding grid case to the mini screenshot, if instead of (25%, 25%) in the first intersection, instead it was (-3%, -5%) for example so it was just the relative percentage change. This will probably allow the model to have an easier time adding and subtracting the proper amount in the refined click.

Also I almost arbitrarily chose 200 x 200 as the rectangle size. I just noticed that on my desktop, the model would often be wrong by more than 100 pixels away, but less than 200 pixels so I just chose that as the size.

I would be happy to improve my code, or explain anything as wanted!

PS: I also added poetry support. But I can delete this and just add all of those files to .gitignore

@joshbickett
Copy link
Contributor

@klxu03 thank you for this PR. It looks promising. I'll let you know if I have any questions!

@michaelhhogue michaelhhogue added the enhancement New feature or request label Dec 2, 2023
@michaelhhogue
Copy link
Collaborator

@klxu03 Using mss for screenshots on Linux appears to be a much better solution, especially for just getting the active monitor.

@klxu03
Copy link
Contributor Author

klxu03 commented Dec 2, 2023

@klxu03 Using mss for screenshots on Linux appears to be a much better solution, especially for just getting the active monitor.

Yeah earlier I went and just did mss, but then I noticed the clicking percentage system was configured globally so they didn't align. But if you think this is worth exploring, it's probably not bad to do a conversion (like scaling GPTs output divide by two and just adding 50% to the percentage)

@michaelhhogue
Copy link
Collaborator

@klxu03 Ah okay I see the removal of mss now. mss is probably the best multi-platform screenshot solution and the easiest way to just get the active monitor. So looking into the conversion in order to adopt mss could definitely help!

@klxu03
Copy link
Contributor Author

klxu03 commented Dec 2, 2023

@klxu03 Ah okay I see the removal of mss now. mss is probably the best multi-platform screenshot solution and the easiest way to just get the active monitor. So looking into the conversion in order to adopt mss could definitely help!

Yeah makes sense, I'll probably try it another time (in a diff PR). BTW, I didn't do the most prompt engineering for the -accurate, so definitely feel free to change it.

Curious, is my code readable enough so you can follow the logic of what is happening? Are there any glaring architectural/design choices that you guys didn't like in this PR?

@joshbickett
Copy link
Contributor

@klxu03 Just reviewing this PR now.. project grew more than expected and it has been busy.

Code looks good at a high level but my pip install . was breaking. Maybe something to do with the new pyproject.toml. I honestly don't know poetry well. We probably should be compatible, but main concern is that this same thing would break for other users.

Here's a ChatGPT thread about it: https://chat.openai.com/share/994b3d20-5bc4-4954-8abe-f53ddabc90ca

I deleted the pyproject.toml and it runs for me now so I'll test the -accuracy mode. One thought, maybe we could move the poetry support to another PR and keep this one just accuracy. Anyway, I'll have additional input soon

@joshbickett
Copy link
Contributor

joshbickett commented Dec 3, 2023

I like what I see so far.

The 200x200 may be a little small of a mini screenshot. It appears to me GPT-4v can roughly guess which of 4 quadrant of the screen to move into so it may make sense to make the mini_screenshot.jpg the 1 of 4 quadrants that is correct.

We could imagine a system where we loop over this function breaking the screenshot into ever smaller quadrant until we have just the right button to click.. but maybe that's for another PR.

Anyway, going to keep reviewing the PR and will share more thoughts

@klxu03
Copy link
Contributor Author

klxu03 commented Dec 3, 2023

I like what I see so far.

The 200x200 may be a little small of a mini screenshot. It appears to me GPT-4v can roughly guess which of 4 quadrant of the screen to move into so it may make sense to make the mini_screenshot.jpg the 1 of 4 quadrants that is correct.

We could imagine a system where we loop over this function breaking the screenshot into ever smaller quadrant until we have just the right button to click.. but maybe that's for another PR.

Anyway, going to keep reviewing the PR and will share more thoughts

I love that idea! Like a sniper slowly scoping in. The idea was like scoping into like 400 x 400, adjust, then another 200 x 200, adjust, and then another 100 x 100? I feel like we could adjust the -accurate to also take in a number, like the number of scopes of precision you'd like. -accurate = 3, means 3 layers of scoping.

I coded this mini screenshot system in a way such that the scoping amounts can all be variables, it's only dependent on ACCURATE_PIXEL_COUNT, which can easily just be a param passed in.

Also yeah, I will just delete poetry. 100% not important

@joshbickett
Copy link
Contributor

joshbickett commented Dec 3, 2023

Ok great. If you can push up your poetry changes I think this is ready to merge into the main project.

I think this is a good architecture. It doesn't always perform well for me, but I think we can iterate it to improve performance.

-accurate = 3, means 3 layers of scoping sounds like a good approach. We can do this with a later PR.

@klxu03
Copy link
Contributor Author

klxu03 commented Dec 3, 2023

Ok great. If you can push up your poetry changes I think this is ready to merge into the main project.

I think this is a good architecture. It doesn't always perform well for me, but I think we can iterate it to improve performance.

-accurate = 3, means 3 layers of scoping sounds like a good approach. We can do this with a later PR.

Awesome, sounds good! I just cleaned up the repo. Going to do a quick round of testing to make sure it all works

Update: it works

@joshbickett
Copy link
Contributor

Ok, looks good! I'm going to merge it. Can you create a new PR for one thing I noticed?

Can you add more "app prints" so it shows a log of click logic? Something like this below. Does that make sense?

[Self-Operating Computer] [Act] CLICK
[Self-Operating Computer] [Act] CLICK REFLECTION

@joshbickett joshbickett merged commit 51d9993 into OthersideAI:main Dec 3, 2023
@klxu03
Copy link
Contributor Author

klxu03 commented Dec 3, 2023

Ok, looks good! I'm going to merge it. Can you create a new PR for one thing I noticed?

Can you add more "app prints" so it shows a log of click logic? Something like this below. Does that make sense?

[Self-Operating Computer] [Act] CLICK
[Self-Operating Computer] [Act] CLICK REFLECTION

Yup! This makes sense

@joshbickett
Copy link
Contributor

Anyway, I think you get the vision. The ideas we've discussed are all great. If you want to iterate on what you've built so far that'd be great!!

@klxu03
Copy link
Contributor Author

klxu03 commented Dec 3, 2023

Anyway, I think you get the vision. The ideas we've discussed are all great. If you want to iterate on what you've built so far that'd be great!!

of course! happy to contribute and improve :)

@joshbickett
Copy link
Contributor

Also the quick start section in README.MD could you create a "additional features" section or something and add this -accurate flag detail?

@klxu03
Copy link
Contributor Author

klxu03 commented Dec 3, 2023

Also the quick start section in README.MD could you create a "additional features" section or something and add this -accurate flag detail?

for sure! i just twitter DMed an additional thought of later full blown converting accuracy to a pure classification problem I had after this thread!

@michaelhhogue
Copy link
Collaborator

@klxu03 Just wanted to comment that I've tested accurate mode on Linux and it's working great. I'm noticing significant improvements already. However, I've noticed that it sometimes seems to prioritize looking at the mini-screenshot over the whole screen. So it sometimes gets "stuck" in the 200 x 200 area of the screen where the previous guess was.

Tomorrow I'm going to look into refining the accurate mode vision prompt a bit to reduce how often it gets stuck in the 200 x 200 box.

Great work on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants