This RAG Document Processing tool is used to ingest documents into watsonx Discovery and query these documents. This repository provides two tools: the RAG-LLM-Service API and the Document-Processing script.
The RAG-LLM-Service API can be used to ingest files from IBM Cloud Object storage into watsonx Discovery (ingestDocs endpoint) and do RAG on a file (queryLLM endpoint).
The Document-Processing script can be used to perform RAG across all files in a IBM Cloud Object Storage bucket. Additionally, the "answer_processing_instructions.txt" file can be modified to further process the retrieved answers.
The following prerequisites are required to interface with watsonx Discovery and run the document processing script:
- Python3
- IBM Cloud api key
- This must be for the same cloud account that hosts your Cloud Object Storage, watsonx.ai, and watsonx Discovery instances
- watsonx.ai project id
- This can be found in Watson Studio -> select your project -> Manage
- IBM Cloud Object Storage endpoint url, resource instance ID, and bucket name
- Your endpoint url can be found by selecting your Cloud Object Storage resource within IBM Cloud -> select your bucket -> Configuration -> scroll to the "public endpoint"
- Your instance id can be generated by selecting your Cloud Object Storage resource within IBM Cloud -> Service Credentials -> New Credential -> Role: Content Reader
- watsonx Discovery username, password, and url
- watsonx Discovery index name and pipeline name
- These are created as a part of your IngestDocs API call
-
Clone the repo
git clone git@github.com:ibm-ecosystem-engineering/RAG-Document-Processing.git
-
Change directory into RAG-Document-Processing
cd RAG-Document-Processing
-
Create a python virtual environment
python3 -m venv virtual-env source virtual-env/bin/activate pip3 install -r requirements.txt
-
Copy env file to .env
cp env .env
-
Configure parameters in .env based on your prerequisites
-
Change directory in RAG-LLM-Service and spin up the API
cd RAG-LLM-Service python3 app.py
- To access Swagger: http://0.0.0.0:4050/docs
-
Open a new terminal, change directory into the cloned repo, start the virtual environment, and change directories into Dcoument-Processing
cd RAG-Document-Processing source virtual-env/bin/activate cd Document-Processing
-
Configure config files within the config directory
- You can modify doc_processing_config.json to
- Choose what questions to process across documents
- Only process specific documents
- Only process newly added (unprocessed) documents
- You can modify answer_processing_instructions.txt to specify what you want to extract from the watsonx Discovery answer chunks
- You can modify doc_processing_config.json to
-
Run batch Document-Processing
python3 document_processing_script.py
-
When you are done with the tools, run the following command to exit the python virtual environment:
deactivate