Document Retrieval API

This repository contains an API for semantic search and retrieval of documents using natural language queries.

Details

This API allows users to obtain the most relevant document snippets from their data sources, such as files, notes, or emails, by asking questions or expressing needs in natural language. Enterprises can make their internal documents available to their employees using this plugin.

The plugin uses OpenAI's text-embedding-ada-002 embeddings model to generate embeddings of document chunks, and then stores and queries them using a vector database on the backend. A FastAPI server exposes the plugin's endpoints for upserting, querying, and deleting documents. Users can refine their search results by using metadata filters by source, date, author, or other criteria.

API Endpoints

/upsert: This endpoint allows uploading one or more documents and storing their text and metadata in the vector database. The documents are split into chunks of around 200 tokens, each with a unique ID. The endpoint expects a list of documents in the request body, each with a text field, and optional id and metadata fields. The metadata field can contain the following optional subfields: source, source_id, url, created_at, and author. The endpoint returns a list of the IDs of the inserted documents (an ID is generated if not initially provided).
/upsert-file: This endpoint allows uploading a single file (PDF, TXT, DOCX, PPTX, or MD) and storing its text and metadata in the vector database. The file is converted to plain text and split into chunks of around 200 tokens, each with a unique ID. The endpoint returns a list containing the generated id of the inserted file.
/query: This endpoint allows querying the vector database using one or more natural language queries and optional metadata filters. The endpoint expects a list of queries in the request body, each with a query and optional filter and top_k fields. The filter field should contain a subset of the following subfields: source, source_id, document_id, url, created_at, and author. The top_k field specifies how many results to return for a given query, and the default value is 3. The endpoint returns a list of objects that each contain a list of the most relevant document chunks for the given query, along with their text, metadata and similarity scores.
/delete: This endpoint allows deleting one or more documents from the vector database using their IDs, a metadata filter, or a delete_all flag. The endpoint expects at least one of the following parameters in the request body: ids, filter, or delete_all. The ids parameter should be a list of document IDs to delete; all document chunks for the document with these IDS will be deleted. The filter parameter should contain a subset of the following subfields: source, source_id, document_id, url, created_at, and author. The delete_all parameter should be a boolean indicating whether to delete all documents from the vector database. The endpoint returns a boolean indicating whether the deletion was successful.

The detailed specifications and examples of the request and response models can be found by running the app locally and navigating to http://0.0.0.0:8000/openapi.json, or in the OpenAPI schema here.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.well-known		.well-known
datastore		datastore
models		models
server		server
services		services
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document Retrieval API

Details

API Endpoints

About

Releases

Packages

Languages

joshdschneider/doc-retrieval-api

Folders and files

Latest commit

History

Repository files navigation

Document Retrieval API

Details

API Endpoints

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages