Skip to content

Latest commit

 

History

History
69 lines (45 loc) · 1.99 KB

README.md

File metadata and controls

69 lines (45 loc) · 1.99 KB

Wreck-it-RAG

Wreck-it-RAG Logo

The repo is an attempt to create an automated pipleine for extracting infromation from different documents and converting them into JSON

To-Do List

📝 Add OpenAI API Key Support
📝 Switch to Django
📝 Make streamlit editable to choose between OCR or LLM summaries
📝 Concatenate JSON blocks for page-by-page chunking
📝 Use a package manager for requirements.txt
📝 Convert Tables from HTML to JSON
📝 Integrate SQL database to store JSON
📝 Look into Apache Spark or Hadoop

Downloading UNSTRUCTURED.IO Dependancies

Follow UNSTRUCTURED.IO's own installation guide to download all dependancies

Quick Summary of Installation Guide

Windows

1. libmagic-dev

Use WSL to enter the following commands

sudo apt update
sudo apt install libmagic-dev

2. Poppler

Check out the pdf2image docs on how to install Poppler on various devices

3. libreoffice

Check out the official page of libreoffice for download guides.

Once the .msi or .exe file is downloaded follow the on-screen instructions

4. Tesseract

The latest installer for Tesseract on windows can be found here

Make sure to add the C:\Program Files\Tesseract-OCR to your Path.

2. Installing pip Requirements

Enter the following code to install all python libraries

pip install -r requirements.txt

3. Create .env File

Create an .env file with the following variable

GEMINI_API_KEY = your-gemini-api-key-here

4. Run Streamlit App

Run the streamlit app using the following command

streamlit run app.py