Skip to content

API for semantic search and retrieval of documents using natural language queries.

Notifications You must be signed in to change notification settings

joshdschneider/doc-retrieval-api

Repository files navigation

Document Retrieval API

This repository contains an API for semantic search and retrieval of documents using natural language queries.

Details

This API allows users to obtain the most relevant document snippets from their data sources, such as files, notes, or emails, by asking questions or expressing needs in natural language. Enterprises can make their internal documents available to their employees using this plugin.

The plugin uses OpenAI's text-embedding-ada-002 embeddings model to generate embeddings of document chunks, and then stores and queries them using a vector database on the backend. A FastAPI server exposes the plugin's endpoints for upserting, querying, and deleting documents. Users can refine their search results by using metadata filters by source, date, author, or other criteria.

API Endpoints

  • /upsert: This endpoint allows uploading one or more documents and storing their text and metadata in the vector database. The documents are split into chunks of around 200 tokens, each with a unique ID. The endpoint expects a list of documents in the request body, each with a text field, and optional id and metadata fields. The metadata field can contain the following optional subfields: source, source_id, url, created_at, and author. The endpoint returns a list of the IDs of the inserted documents (an ID is generated if not initially provided).

  • /upsert-file: This endpoint allows uploading a single file (PDF, TXT, DOCX, PPTX, or MD) and storing its text and metadata in the vector database. The file is converted to plain text and split into chunks of around 200 tokens, each with a unique ID. The endpoint returns a list containing the generated id of the inserted file.

  • /query: This endpoint allows querying the vector database using one or more natural language queries and optional metadata filters. The endpoint expects a list of queries in the request body, each with a query and optional filter and top_k fields. The filter field should contain a subset of the following subfields: source, source_id, document_id, url, created_at, and author. The top_k field specifies how many results to return for a given query, and the default value is 3. The endpoint returns a list of objects that each contain a list of the most relevant document chunks for the given query, along with their text, metadata and similarity scores.

  • /delete: This endpoint allows deleting one or more documents from the vector database using their IDs, a metadata filter, or a delete_all flag. The endpoint expects at least one of the following parameters in the request body: ids, filter, or delete_all. The ids parameter should be a list of document IDs to delete; all document chunks for the document with these IDS will be deleted. The filter parameter should contain a subset of the following subfields: source, source_id, document_id, url, created_at, and author. The delete_all parameter should be a boolean indicating whether to delete all documents from the vector database. The endpoint returns a boolean indicating whether the deletion was successful.

The detailed specifications and examples of the request and response models can be found by running the app locally and navigating to http://0.0.0.0:8000/openapi.json, or in the OpenAPI schema here.

About

API for semantic search and retrieval of documents using natural language queries.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published