Wreck-it-RAG

The repo is an attempt to create an automated pipleine for extracting infromation from different documents and converting them into JSON

To-Do List

📝 Add OpenAI API Key Support
📝 Switch to Django
📝 ~~Make streamlit editable to choose between OCR or LLM summaries~~
📝 Concatenate JSON blocks for page-by-page chunking
📝 ~~Use a package manager for requirements.txt~~
📝 ~~Convert Tables from HTML to JSON~~
📝 Integrate SQL database to store JSON
📝 Look into Apache Spark or Hadoop

Downloading UNSTRUCTURED.IO Dependancies

Follow UNSTRUCTURED.IO's own installation guide to download all dependancies

Quick Summary of Installation Guide

Windows

1. libmagic-dev

Use WSL to enter the following commands

sudo apt update
sudo apt install libmagic-dev

2. Poppler

Check out the pdf2image docs on how to install Poppler on various devices

3. libreoffice

Check out the official page of libreoffice for download guides.

Once the .msi or .exe file is downloaded follow the on-screen instructions

4. Tesseract

The latest installer for Tesseract on windows can be found here

Make sure to add the C:\Program Files\Tesseract-OCR to your Path.

2. Installing pip Requirements

Enter the following code to install all python libraries

pip install -r requirements.txt

3. Create .env File

Create an .env file with the following variable

GEMINI_API_KEY = your-gemini-api-key-here

4. Run Streamlit App

Run the streamlit app using the following command

streamlit run app.py

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
Other		Other
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
app.py		app.py
converter.py		converter.py
extraction_ocr.py		extraction_ocr.py
prompt.py		prompt.py
requirements.txt		requirements.txt
vision_llm_summarizer.py		vision_llm_summarizer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wreck-it-RAG

To-Do List

Downloading UNSTRUCTURED.IO Dependancies

Quick Summary of Installation Guide

Windows

1. libmagic-dev

2. Poppler

3. libreoffice

4. Tesseract

2. Installing pip Requirements

3. Create .env File

4. Run Streamlit App

About

Releases

Packages

Languages

License

vijdaancoding/wreck-it-rag

Folders and files

Latest commit

History

Repository files navigation

Wreck-it-RAG

To-Do List

Downloading UNSTRUCTURED.IO Dependancies

Quick Summary of Installation Guide

Windows

1. libmagic-dev

2. Poppler

3. libreoffice

4. Tesseract

2. Installing pip Requirements

3. Create .env File

4. Run Streamlit App

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages