A zero-shot captcha solver built on top of the multi-modal vision and language model CLIP and hosted in a Streamlit application. It is zero-shot as it does not require showing any labelled examples but works out of the box for arbitrary captcha images.
Given a captcha image like the one below and an object to look out for - "chimney" in this case - the goal is to classify each individual image in the 3x3 grid into either being a hit (and therefore containg the object) or not.
Below is a list of steps that the application follows:
- Turn single image of 3x3 images into 9 individual images
- Compute an embedding vector for each of the 9 individual images, and one single embedding vector for the object to search out for
- Compute the pairwise cosine-similarity between the embedding of each of the 9 images and the text embedding
- Cluster the thereby returned similarity scores - one for each image - into "match" or "no match"
- sort the similarity scores and corresponding images in an ascending order
- compute the differences between the current and the next element in this sorted list
- find the largest difference/gap in similarity scores and use that gap to classify all images below as "no match" and all images above as "match"
Currently there are two demos.
I wrapped this project into a streamlit application and hosted it on their servers: https://zero-shot-captcha-solver.streamlit.app/
Find a notebook that runs through the code at the example of one captcha image at demo.ipynb
.
There are two ways to install the required dependencies and run the code locally:
This will install zero-shot-captcha-solver
as a library within the virtual poetry environment.
- Clone the repository
- Install poetry
- From the root of the repository, install
zero-shot-captcha-solver
by executingpoetry install
This won't install zero-shot-captcha-solver
as a library.
- Clone the repository
- Install the (dev) requirements with
pip install -r requirements(-dev).txt
Special thanks to:
- The whole Streamlit team for allowing to host streamlit applications for free on their infrastructures.
The code itself is licenced under the MIT License.