Skip to content
This repository has been archived by the owner on Sep 11, 2024. It is now read-only.

DS4SD/quackling

Repository files navigation

Important

πŸ‘‰ Now part of Docling!

Quackling

Quackling

PyPI version Python Poetry Code style: black Imports: isort Pydantic v2 pre-commit License MIT

Easily build document-native generative AI applications, such as RAG, leveraging Docling's efficient PDF extraction and rich data model β€” while still using your favorite framework, πŸ¦™ LlamaIndex or πŸ¦œπŸ”— LangChain.

Features

  • 🧠 Enables rich gen AI applications by providing capabilities on native document level β€” not just plain text / Markdown!
  • ⚑️ Leverages Docling's conversion quality and speed.
  • βš™οΈ Plug-and-play integration with LlamaIndex and LangChain for building powerful applications like RAG.

Doc-native RAG

Installation

To use Quackling, simply install quackling from your package manager, e.g. pip:

pip install quackling

Usage

Quackling offers core capabilities (quackling.core), as well as framework integration components (quackling.llama_index and quackling.langchain). Below you find examples of both.

Basic RAG

Here is a basic RAG pipeline using LlamaIndex:

Note

To use as is, first pip install llama-index-embeddings-huggingface llama-index-llms-huggingface-api additionally to quackling to install the models. Otherwise, you can set EMBED_MODEL & LLM as desired, e.g. using local models.

import os

from llama_index.core import VectorStoreIndex
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI
from quackling.llama_index.node_parsers import HierarchicalJSONNodeParser
from quackling.llama_index.readers import DoclingPDFReader

DOCS = ["https://arxiv.org/pdf/2206.01062"]
QUESTION = "How many pages were human annotated?"
EMBED_MODEL = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
LLM = HuggingFaceInferenceAPI(
    token=os.getenv("HF_TOKEN"),
    model_name="mistralai/Mistral-7B-Instruct-v0.3",
)

index = VectorStoreIndex.from_documents(
    documents=DoclingPDFReader(parse_type=DoclingPDFReader.ParseType.JSON).load_data(DOCS),
    embed_model=EMBED_MODEL,
    transformations=[HierarchicalJSONNodeParser()],
)
query_engine = index.as_query_engine(llm=LLM)
result = query_engine.query(QUESTION)
print(result.response)
# > 80K pages were human annotated

Chunking

You can also use Quackling as a standalone with any pipeline. For instance, to split the document to chunks based on document structure and returning pointers to Docling document's nodes:

from docling.document_converter import DocumentConverter
from quackling.core.chunkers import HierarchicalChunker

doc = DocumentConverter().convert_single("https://arxiv.org/pdf/2408.09869").output
chunks = list(HierarchicalChunker().chunk(doc))
# > [
# >     ChunkWithMetadata(
# >         path='$.main-text[4]',
# >         text='Docling Technical Report\n[...]',
# >         page=1,
# >         bbox=[117.56, 439.85, 494.07, 482.42]
# >     ),
# >     [...]
# > ]

More examples

LlamaIndex

LangChain

Contributing

Please read Contributing to Quackling for details.

References

If you use Quackling in your projects, please consider citing the following:

@techreport{Docling,
  author = "Deep Search Team",
  month = 8,
  title = "Docling Technical Report",
  url = "https://arxiv.org/abs/2408.09869",
  eprint = "2408.09869",
  doi = "10.48550/arXiv.2408.09869",
  version = "1.0.0",
  year = 2024
}

License

The Quackling codebase is under MIT license. For individual component usage, please refer to the component licenses found in the original packages.