-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added -accurate, reflective mouse click mode #57
Conversation
… case needed again later
@klxu03 thank you for this PR. It looks promising. I'll let you know if I have any questions! |
@klxu03 Using mss for screenshots on Linux appears to be a much better solution, especially for just getting the active monitor. |
Yeah earlier I went and just did mss, but then I noticed the clicking percentage system was configured globally so they didn't align. But if you think this is worth exploring, it's probably not bad to do a conversion (like scaling GPTs output divide by two and just adding 50% to the percentage) |
@klxu03 Ah okay I see the removal of mss now. mss is probably the best multi-platform screenshot solution and the easiest way to just get the active monitor. So looking into the conversion in order to adopt mss could definitely help! |
Yeah makes sense, I'll probably try it another time (in a diff PR). BTW, I didn't do the most prompt engineering for the -accurate, so definitely feel free to change it. Curious, is my code readable enough so you can follow the logic of what is happening? Are there any glaring architectural/design choices that you guys didn't like in this PR? |
@klxu03 Just reviewing this PR now.. project grew more than expected and it has been busy. Code looks good at a high level but my Here's a ChatGPT thread about it: https://chat.openai.com/share/994b3d20-5bc4-4954-8abe-f53ddabc90ca I deleted the |
I like what I see so far. The 200x200 may be a little small of a mini screenshot. It appears to me GPT-4v can roughly guess which of 4 quadrant of the screen to move into so it may make sense to make the We could imagine a system where we loop over this function breaking the screenshot into ever smaller quadrant until we have just the right button to click.. but maybe that's for another PR. Anyway, going to keep reviewing the PR and will share more thoughts |
I love that idea! Like a sniper slowly scoping in. The idea was like scoping into like 400 x 400, adjust, then another 200 x 200, adjust, and then another 100 x 100? I feel like we could adjust the -accurate to also take in a number, like the number of scopes of precision you'd like. -accurate = 3, means 3 layers of scoping. I coded this mini screenshot system in a way such that the scoping amounts can all be variables, it's only dependent on ACCURATE_PIXEL_COUNT, which can easily just be a param passed in. Also yeah, I will just delete poetry. 100% not important |
Ok great. If you can push up your poetry changes I think this is ready to merge into the main project. I think this is a good architecture. It doesn't always perform well for me, but I think we can iterate it to improve performance. -accurate = 3, means 3 layers of scoping sounds like a good approach. We can do this with a later PR. |
Awesome, sounds good! I just cleaned up the repo. Going to do a quick round of testing to make sure it all works Update: it works |
Ok, looks good! I'm going to merge it. Can you create a new PR for one thing I noticed? Can you add more "app prints" so it shows a log of click logic? Something like this below. Does that make sense?
|
Yup! This makes sense |
Anyway, I think you get the vision. The ideas we've discussed are all great. If you want to iterate on what you've built so far that'd be great!! |
of course! happy to contribute and improve :) |
Also the |
for sure! i just twitter DMed an additional thought of later full blown converting accuracy to a pure classification problem I had after this thread! |
@klxu03 Just wanted to comment that I've tested accurate mode on Linux and it's working great. I'm noticing significant improvements already. However, I've noticed that it sometimes seems to prioritize looking at the mini-screenshot over the whole screen. So it sometimes gets "stuck" in the 200 x 200 area of the screen where the previous guess was. Tomorrow I'm going to look into refining the accurate mode vision prompt a bit to reduce how often it gets stuck in the 200 x 200 box. Great work on this! |
Implemented -accurate, reflective mouse click mode.
If you add the -accurate tag when running operate, it will enable accurate mode. Now, whenever the model tries to click on anything, it will send another additional request to GPT giving it a chance to adjust its initial percentage guesses.
I first extract out the initial guess of where the model tries to click, and then I take a screenshot of a smaller 200 x 200 pixel rectangle around the initial guess. I then upsample the image by doubling the dimensions so it's bigger to GPT (all done in capture_mini_screenshot_with_cursor) and then ask the model to refine its guess, giving it the previous image/message as context as well as the previous X Y coordinate guess it attempted and give it a chance to add and subtract minute percentages (accurate_mode_double_check).
Locally, this configuration has significantly improved the accuracy of clicking on my desktop configuration (two monitors). Currently I've only implemented it for Linux, but if this approach is liked it can easily be adapted to other OS.
Idea could be further improved if the add_grid_to_image is further improved such that in the accurate_mode adding grid case to the mini screenshot, if instead of (25%, 25%) in the first intersection, instead it was (-3%, -5%) for example so it was just the relative percentage change. This will probably allow the model to have an easier time adding and subtracting the proper amount in the refined click.
Also I almost arbitrarily chose 200 x 200 as the rectangle size. I just noticed that on my desktop, the model would often be wrong by more than 100 pixels away, but less than 200 pixels so I just chose that as the size.
I would be happy to improve my code, or explain anything as wanted!
PS: I also added poetry support. But I can delete this and just add all of those files to .gitignore