Paper: HiQA: A Hierarchical Contextual Augmentation RAG for Massive Documents QA, arXiv:2402.01767.
HiQA provides a comprehensive toolkit for document processing, enabling the segmentation of documents into sections, enrichment with metadata, and embedding for in-depth analysis. It leverages a multi-route retrieval system to identify relevant knowledge in response to specific queries. This knowledge, along with the query, is then processed by a large language model (LLM) to generate answers. Although document processing incurs some initial costs, this investment significantly improves the quality of the results.
Ensure your environment meets the following prerequisites:
- Python version 3.9
- Install dependencies from
requirements.txt
using the following command:pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
- Set your OpenAI API key in the environment variables.
- To start the demo, execute:
PYTHONUNBUFFERED=1 nohup streamlit run app_streamlit.py --server.port 8080 --server.address 0.0.0.0 > logs/run.log 2>&1 &
Note: Before running the above command, manually create alogs
directory.
To build a dataset, follow these steps:
- Utilize the tools in the
build_tool
directory. - Begin with a PDF file that is text-extractable.
- Step 1: Convert the PDF to a well-formatted markdown file using
pdf2md
, leveraging thegpt-4-turbo-preview (0125)
model. (Note that this process is costly! This step can be processed manually.) - Step 2: Convert the markdown file into a CSV file with
md2csv
, organizing content into sections with hierarchical metadata, and labeling tables. - Step 3: Use
section2embedding
to append embedding vectors to sections. - Step 4: Place all processed CSV files into a dataset directory. Load this dataset in
knowledge_client.py
for querying in theapp_streamlit.py
demo. Note: File names and titles are processed through Named Entity Detection models to generate critical keywords, which are stored inutils.filter.critic_keywords
.
For image processing:
- In
image_service
, execute load, build, and commit operations to create an/indexes
directory for Whoosh. - Use VLM (such as Ollama -> Llama:34b ) to generate descriptions for each extracted image.
- Image searches can be conducted using
app_streamlit.search_images_from_response
.