This is an MVP of a LLM Document Search RAG.
Requirements Doc:
- Scan PDF (pypdf) (AWS Textract)
- Create pages
- Chunk pages (langchain)
- Embeddings (openAI)
- Store in Vector DB (Chroma)
- Test our embeddings (pyTest)
- Retrieve with search query (nistral)
run this command to install dependencies in the requirements.txt
file.
pip install -r requirements.txt
pip install pytest
pip install pyPdf
To Scan all the pdf files in the data folder and put them into the RAG run:
python load_pdf.py
This will scan the pdfs using pypdf through langchain document loader, split the docs into pages and then will chunk it. Chunks are embedded and stored in Chroma
Query the Chroma DB and use Mistral to create an answer
python query_data.py "Your question relevant to the context of the application"
Test Mistral's answers using PyTest
pytest test_cases.py