Skip to content

A tool to deconstruct unstructured data in PDFs into JSON for RAG

License

Notifications You must be signed in to change notification settings

vijdaancoding/wreck-it-rag

Repository files navigation

Wreck-it-RAG

Wreck-it-RAG Logo

The repo is an attempt to create an automated pipleine for extracting infromation from different documents and converting them into JSON

To-Do List

📝 Add OpenAI API Key Support
📝 Switch to Django
📝 Make streamlit editable to choose between OCR or LLM summaries
📝 Concatenate JSON blocks for page-by-page chunking
📝 Use a package manager for requirements.txt
📝 Convert Tables from HTML to JSON
📝 Integrate SQL database to store JSON
📝 Look into Apache Spark or Hadoop

Downloading UNSTRUCTURED.IO Dependancies

Follow UNSTRUCTURED.IO's own installation guide to download all dependancies

Quick Summary of Installation Guide

Windows

1. libmagic-dev

Use WSL to enter the following commands

sudo apt update
sudo apt install libmagic-dev

2. Poppler

Check out the pdf2image docs on how to install Poppler on various devices

3. libreoffice

Check out the official page of libreoffice for download guides.

Once the .msi or .exe file is downloaded follow the on-screen instructions

4. Tesseract

The latest installer for Tesseract on windows can be found here

Make sure to add the C:\Program Files\Tesseract-OCR to your Path.

2. Installing pip Requirements

Enter the following code to install all python libraries

pip install -r requirements.txt

3. Create .env File

Create an .env file with the following variable

GEMINI_API_KEY = your-gemini-api-key-here

4. Run Streamlit App

Run the streamlit app using the following command

streamlit run app.py

About

A tool to deconstruct unstructured data in PDFs into JSON for RAG

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages