This project allows users to search for arXiv papers either by ID or abstract. The search functionality is powered by a machine learning embedding model and Milvus, a vector database. Gradio is used to create a user-friendly web interface for interaction.
See implemented demo at papermatch.mitanshu.tech
See full explanation at the corresponding blog post: mitanshu.tech/posts/papermatch
- Search by Abstract: Convert the abstract into a vector using the
mixedbread-ai/mxbai-embed-large-v1
model and find similar papers based on cosine similarity. - Search by ID: Retrieve information directly by arXiv ID.
- Top K Results: Display the top K results from Milvus based on similarity.
- Python 3.7+
- Gradio
- Milvus
mixedbread-ai/mxbai-embed-large-v1
(or any compatible embedding model)
-
Clone the repository:
git clone [<repository-url>](https://github.com/mitanshu7/search_arxiv.git) cd search_arxiv
-
Create a virtual environment (optional but recommended):
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install the required packages:
pip install -r requirements.txt
-
Set up Milvus:
- Follow the Milvus installation guide to get Milvus up and running.
- Configure Milvus with your preferred settings.
- Or use
standalone_embed.sh
in this repo made compatible with Fedora.
-
Prepare Milvus:
# Command to prepare Milvus python prepare_milvus.py
-
Setup API key : Get your key from Mixedbread and paste it in
.env
file. See.env.sample
for config. -
Run the Gradio app:
python app.py
-
Interact with the web interface:
- Open your web browser and go to
http://localhost:7860
to access the Gradio interface. - Use the search bar to input arXiv ID or abstract and view the search results.
- Open your web browser and go to
- Embedding Model: The embedding model used is
mixedbread-ai/mxbai-embed-large-v1
.
Here is a basic example of how to use the search feature:
-
Search by Abstract:
- Enter the abstract of the paper in the provided text box.
- The system will convert it to a vector, query Milvus, and return the most relevant papers.
-
Search by ID:
- Input an arXiv ID directly.
- Retrieve and display the corresponding paper details.
- create a file
~/.config/systemd/user/search_arxiv.service
using:nano ~/.config/systemd/user/search_arxiv.service
with the following contents (assuming user=milvus, and using anaconda package manager with env name search_arxiv):
[Unit]
Description=Search ArXiv Web App
After=network.target
[Service]
WorkingDirectory=/home/milvus/search_arxiv/
ExecStart=/bin/bash -c "source /home/milvus/miniforge3/bin/activate search_arxiv && python app.py"
Restart=always
[Install]
WantedBy=default.target
- Issue
systemctl --user daemon-reload
to reload systemd. - issue
systemctl --user start search_arxiv.service
to start the app. - Issue
systemctl --user enable search_arxiv.service
to enable app at start up.
- Automate embedding of new metadata each month.
- Learn incremental databse update.
- Automate setting up of the app.
- Find more sources to integrate.
Feel free to contribute to the project by submitting issues, pull requests, or suggestions.
This project is licensed under the MIT License. See the LICENSE file for details.
For any questions or feedback, please contact mitanshu.sukhwani@gmail.com.