Skip to content

mitanshu7/PaperMatch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PaperMatch: arXiv Search with Embeddings and Milvus

This project allows users to search for arXiv papers either by ID or abstract. The search functionality is powered by a machine learning embedding model and Milvus, a vector database. Gradio is used to create a user-friendly web interface for interaction.

See implemented demo at papermatch.mitanshu.tech

See full explanation at the corresponding blog post: mitanshu.tech/posts/papermatch

Features

  • Search by Abstract: Convert the abstract into a vector using the mixedbread-ai/mxbai-embed-large-v1 model and find similar papers based on cosine similarity.
  • Search by ID: Retrieve information directly by arXiv ID.
  • Top K Results: Display the top K results from Milvus based on similarity.

Requirements

  • Python 3.7+
  • Gradio
  • Milvus
  • mixedbread-ai/mxbai-embed-large-v1 (or any compatible embedding model)

Installation

  1. Clone the repository:

    git clone [<repository-url>](https://github.com/mitanshu7/search_arxiv.git)
    cd search_arxiv
  2. Create a virtual environment (optional but recommended):

    python -m venv venv
    source venv/bin/activate  # On Windows use `venv\Scripts\activate`
  3. Install the required packages:

    pip install -r requirements.txt
  4. Set up Milvus:

    • Follow the Milvus installation guide to get Milvus up and running.
    • Configure Milvus with your preferred settings.
    • Or use standalone_embed.sh in this repo made compatible with Fedora.

Usage

  1. Prepare Milvus:

    # Command to prepare Milvus 
    python prepare_milvus.py
  2. Setup API key : Get your key from Mixedbread and paste it in .env file. See .env.sample for config.

  3. Run the Gradio app:

    python app.py
  4. Interact with the web interface:

    • Open your web browser and go to http://localhost:7860 to access the Gradio interface.
    • Use the search bar to input arXiv ID or abstract and view the search results.

Configuration

  • Embedding Model: The embedding model used is mixedbread-ai/mxbai-embed-large-v1.

Example

Here is a basic example of how to use the search feature:

  1. Search by Abstract:

    • Enter the abstract of the paper in the provided text box.
    • The system will convert it to a vector, query Milvus, and return the most relevant papers.
  2. Search by ID:

    • Input an arXiv ID directly.
    • Retrieve and display the corresponding paper details.

Run at startup (systemd):

  1. create a file ~/.config/systemd/user/search_arxiv.service using: nano ~/.config/systemd/user/search_arxiv.service with the following contents (assuming user=milvus, and using anaconda package manager with env name search_arxiv):
[Unit]
Description=Search ArXiv  Web App
After=network.target

[Service]
WorkingDirectory=/home/milvus/search_arxiv/
ExecStart=/bin/bash -c "source /home/milvus/miniforge3/bin/activate search_arxiv && python app.py"
Restart=always

[Install]
WantedBy=default.target
  1. Issue systemctl --user daemon-reload to reload systemd.
  2. issue systemctl --user start search_arxiv.service to start the app.
  3. Issue systemctl --user enable search_arxiv.service to enable app at start up.

To Do:

  • Automate embedding of new metadata each month.
  • Learn incremental databse update.
  • Automate setting up of the app.
  • Find more sources to integrate.

Contributing

Feel free to contribute to the project by submitting issues, pull requests, or suggestions.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contact

For any questions or feedback, please contact mitanshu.sukhwani@gmail.com.

About

Semantic search app for arxiv abstracts

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published