The repo is an attempt to create an automated pipleine for extracting infromation from different documents and converting them into JSON
📝 Add OpenAI API Key Support
📝 Switch to Django
📝 Make streamlit editable to choose between OCR or LLM summaries
📝 Concatenate JSON blocks for page-by-page chunking
📝 Use a package manager for requirements.txt
📝 Convert Tables from HTML to JSON
📝 Integrate SQL database to store JSON
📝 Look into Apache Spark or Hadoop
Follow UNSTRUCTURED.IO's own installation guide to download all dependancies
Use WSL to enter the following commands
sudo apt update
sudo apt install libmagic-dev
Check out the pdf2image docs on how to install Poppler on various devices
Check out the official page of libreoffice for download guides.
Once the .msi
or .exe
file is downloaded follow the on-screen instructions
The latest installer for Tesseract on windows can be found here
Make sure to add the C:\Program Files\Tesseract-OCR
to your Path.
Enter the following code to install all python libraries
pip install -r requirements.txt
Create an .env file with the following variable
GEMINI_API_KEY = your-gemini-api-key-here
Run the streamlit app using the following command
streamlit run app.py