Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: simplify document insertion #6

Merged
merged 1 commit into from
Aug 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 7 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,12 @@ RAGLite is a Python package for Retrieval-Augmented Generation (RAG) with SQLite
2. 🔒 Fully local RAG with [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) as an LLM provider and [SQLite](https://github.com/sqlite/sqlite) as a local database
3. 🚀 Acceleration with Metal on macOS and with CUDA on Linux and Windows
4. 📖 PDF to Markdown conversion on top of [pdftext](https://github.com/VikParuchuri/pdftext) and [pypdfium2](https://github.com/pypdfium2-team/pypdfium2)
5. ✂️ Optimal [level 4 semantic chunking](https://medium.com/@anuragmishra_27746/five-levels-of-chunking-strategies-in-rag-notes-from-gregs-video-7b735895694d)
5. ✂️ Optimal [level 4 semantic chunking](https://medium.com/@anuragmishra_27746/five-levels-of-chunking-strategies-in-rag-notes-from-gregs-video-7b735895694d) by solving a [binary integer programming problem](https://en.wikipedia.org/wiki/Integer_programming)
6. 📌 Markdown-based [contextual chunk headings](https://d-star.ai/solving-the-out-of-context-chunk-problem-for-rag)
7. 🌈 [Multi-vector chunk retrieval](https://python.langchain.com/v0.2/docs/how_to/multi_vector/)
8. 🌀 Optimal closed-form linear query adapter by solving an [(orthogonal) Procrustes problem](https://en.wikipedia.org/wiki/Orthogonal_Procrustes_problem)
7. 🌈 Sub-chunk matching with [multi-vector chunk retrieval](https://python.langchain.com/v0.2/docs/how_to/multi_vector/)
8. 🌀 Optimal [closed-form linear query adapter](src/raglite/_query_adapter.py) by solving an [orthogonal Procrustes problem](https://en.wikipedia.org/wiki/Orthogonal_Procrustes_problem)
9. 🔍 [Hybrid search](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) that combines [SQLite's BM25 full-text search](https://sqlite.org/fts5.html) with [PyNNDescent's ANN vector search](https://github.com/lmcinnes/pynndescent)
10. ✍️ Optional support for automatic conversion of any input document to Markdown with [Pandoc](https://github.com/jgm/pandoc)
10. ✍️ Optional support for conversion of any input document to Markdown with [Pandoc](https://github.com/jgm/pandoc)

## Installing

Expand Down Expand Up @@ -49,11 +49,10 @@ my_config = RAGLiteConfig(db_url="sqlite:///raglite.sqlite")

# Index documents:
from pathlib import Path
from raglite import insert_document, update_vector_index
from raglite import insert_document

insert_document(Path("On the Measure of Intelligence.pdf"), config=my_config)
insert_document(Path("Situational Awareness.pdf"), config=my_config)
update_vector_index(config=my_config)
insert_document(Path("Special Relativity.pdf"), config=my_config)

# Search for chunks:
from raglite import hybrid_search, keyword_search, vector_search
Expand All @@ -66,7 +65,7 @@ results_hybrid = hybrid_search(prompt, num_results=5, config=my_config)
# Answer questions with RAG:
from raglite import rag

prompt = "What is a 'SkillProgram'?"
prompt = "What does it mean for two events to be simultaneous?"
stream = rag(prompt, search=hybrid_search, config=my_config)
for update in stream:
print(update, end="")
Expand Down
9 changes: 7 additions & 2 deletions src/raglite/_index.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,8 +54,10 @@ def _create_chunk_records(
return chunk_records


def insert_document(doc_path: Path, *, config: RAGLiteConfig | None = None) -> None:
"""Insert a document into the database."""
def insert_document(
doc_path: Path, *, update_index: bool = True, config: RAGLiteConfig | None = None
) -> None:
"""Insert a document into the database and update the index."""
# Use the default config if not provided.
config = config or RAGLiteConfig()
# Preprocess the document into chunks.
Expand Down Expand Up @@ -96,6 +98,9 @@ def insert_document(doc_path: Path, *, config: RAGLiteConfig | None = None) -> N
continue
session.add(chunk_record)
session.commit()
# Update the vector search chunk index.
if update_index:
update_vector_index(config)


def update_vector_index(config: RAGLiteConfig | None = None) -> None:
Expand Down
Loading