RAG Doll

Rag Doll is a chat-with-your-documents style Retrieval Augmented Generation (RAG), which is a specialised use of a Large Language Model (LLM) where items from a knowledge base get added to the prompt for better answers.

There are many RAG implementations out there and I don't proclaim this one to be better than any of the others. Rag Doll does not support multi-modal chat at this time. Maybe later, feel free to suggest a pull request. :-)

The implementation is mostly Python, although the heavy lifting is done by pre-trained machine learning models. You'll want to run this on something with a decent GPU, or you will find this all to be very slow. Rag Doll is broken up into several containers, each with a single responsibility (or as close to that as I could get). Containerising makes it easier to upgrade and improve individual components.

Assistant

The assistant handles queries to the RAG for us. It awaits messages from the user chat queue, queries the knowledge base and builds a prompt for the LLM.

We use OpenAI as the model run-time. OpenAI provides robust capabilities for managing multiple models and handling large model files. It simplifies the integration process by managing registrations and pulling models as needed.

`.env`	default	description
`ASSISTANT_LANGUAGES`	`en`, CHANGEME, CHANGEME	A comma-separated list of ISO 639-1 language codes. Be sure to add a section to the system prompt that describes these languages.
`OPENAI_API_KEY`	CHANGEME	The API key for authenticating with OpenAI services.
`OPENAI_CHAT_MODEL`	`gpt-4o`	The LLM model that is used to handle chat messages. Read more about OpenAI models
`CHROMADB_DISTANCE_CUTOFF`	`1.5`	The minimum vector distance needed for a chunk for the chunk to be included in the prompt as RAG context. Chunks with a higher distance are discarded from the RAG query results.

For the final configuration, be sure to add one each of system prompt, RAG prompt and RAG-less prompt for all langauges in ASSISTANT_LANGUAGES. This gives the system a specific set of prompts for each language. All language codes are ISO 639-1 codes.

Note: The OPENAI_API_KEY does not need to be explicitly called in assistant.py because the openai library automatically reads it from the environment variables when openai.OpenAI() is instantiated.

EPPO Librarian

The EPPO librarian is responsible for getting the EPPO Global Database data sheet data into the vector database. It runs at startup, recreating the data set that is to be used for the retrieval part of the system.

The EPPO Global Database is a collection of technical resources that researchers can use in their work. As quoted from their website: EPPO Global Database is maintained by the Secretariat of the European and Mediterranean Plant Protection Organization (EPPO). The aim of the database is to provide all pest-specific information that has been produced or collected by EPPO. The database contents are constantly being updated by the EPPO Secretariat.

`.env`	default	description
`CHROMADB_COLLECTION_TEMPLATE`	EPPO-datasheets-{}	The template for the names of the ChromaDB collections where each translation of the EPPO datasheets will be stored. This should have one `{}` placeholder.
`EPPO_COUNTRY_ORGANISM_URL`	https://gd.eppo.int/country/{country}/organisms.csv	The URL to the per-country organism list on the EPPO database. Use `{country}` as placeholder for the country to query for.
`EPPO_DATASHEET_URL`	https://gd.eppo.int/taxon/{eppo_code}/datasheet	The URL to the organism datasheet in the EPPO database. Use `{eppo_code}` as placeholder for the EPPO code.
`EPPO_COUNTRIES`	CHANGEME	A comma-separated list of ISO 3166-1 alpha-2 country codes of countries that you are interested in.
`OPENAI_API_KEY`	CHANGEME	The API key for authenticating with OpenAI services.
`OPENAI_CHAT_MODEL`	`gpt-4o`	The LLM model that is used to handle plain text translation
`CHUNK_SIZE`	5	For small data sets, a few sentences will have to do.
`OVERLAP_SIZE`	1	The EPPO librarian uses rooftiling. This is the overlap.
`PLAIN_TEXT_SYSTEM_PROMPT`	CHANGEME	The system prompt for translating the scientific text into plain language.
`PLAIN_TEXT_PROMPT`	CHANGEME	The prompt template for translating the scientific text into plain language. Must have a `{text}` placeholder.

EPPO is not completely clear on what license they expect. They do not restrict accessing the datasheets. They do ask for citation, which we provide.

ChromaDB Vector Database

The vector database takes care of embedding and semantic search on the knowledge base library. Rag doll uses Chroma DB, being lightweigth and easy to interface with.

RabbitMQ Message Queueing

In order to communicate between the services we use message queues. This allows us to organise and scale workloads, while having each component have only a single responsibility.

Queue Message Format

user message:

field	data type	description
`id`	string	Message identification number as the originating platform knows it.
`timestamp`	ISO8601 UTC	Message timestamp as the originating platform knows it.
`platform`	enum: `SLACK`/`WHATSAPP`/`SMS`/`VOICE`	Originating platform. Intended to be able to parse the platform-specific fields.
`from`	platform-specific address	Enough information for the originating platform to be able to route a reply to this message to where the user expects it.
`text`	UTF-8 string	The text as provided by the user.

from field (where platform equals SLACK or WHATSAPP):

platform Slack format...

platform WhatsApp format... E.164 numbers

`.env`	default	description
`RABBITMQ_USER`	rabbit	The user name for RabbitMQ.
`RABBITMQ_PASS`	CHANGEME	The default password for accessing queues. Use a generated string.
`RABBITMQ_QUEUE_USER_CHATS`	user_chats	The queue for chat messages that the user typed.
`RABBITMQ_QUEUE_USER_CHAT_REPLIES`	user_chat_replies	The queue for chat messages that the assisant got from the LLM.
`RABBITMQ_EXCHANGE_USER_CHATS`	user_chats_exchange	The topic exchange that routes messages to queues.
`RABBITMQ_HOST`	rabbitmq	The host that RabbitMQ runs on.
`RABBITMQ_PORT`	5672	The AMQP port of RabbitMQ.
`RABBITMQ_MANAGEMENT_PORT`	15672	The HTTP port for the management web UI of RabbitMQ.

Backend (Fast API)

The backend of this project is built using FastAPI, a modern and high-performance web framework for building APIs with Python 3.12.3. The backend communicates with a PostgreSQL database to manage and store application data. The PostgreSQL database is initialized with predefined scripts located in the ./postgres/docker-entrypoint-initdb.d directory, ensuring that the database schema and initial data are set up automatically. Additionally, a PgAdmin4 service is provided to offer a user-friendly interface for managing the PostgreSQL database. PgAdmin4 is configured to run on port 5050 and can be accessed using the default credentials specified in the environment variables.

`.env`	default	description
`BACKEND_PORT`	5000	The external port used by the Backend
`JWT_SECRET`	CHANGEME	JWT-based auth secret key, used in the process of signing a token
`WEBDOMAIN`	"http://localhost"	The base URL of the web application
`MAGIC_LINK_CHAT_TEMPLATE`	CHANGEME	A template for magic link message, e.g. "You can login into Agriconnect by clicking this link: {magic_link}"
`GOOGLE_APPLICATION_CREDENTIALS_PATH`	CHANGEME	Path to the service account JSON key file location used for authentication and accessing Google Cloud services (Development only)
`GOOGLE_APPLICATION_CREDENTIALS`	CHANGEME	JSON key file name used for authentication and accessing Google Cloud services
`BUCKET_NAME`	CHANGEME	Bucket name for a storage object (offered by Google Cloud)
`TESTING`	`None`	An environment variable used for testing purposes when running backend tests. This variable is automatically set to `1` by `conftest` to mock or skip certain steps related to third-party services. Please note that `TESTING` should not be included in the Docker Compose environment.
`INITIAL_CHAT_TEMPLATE`	CHANGEME	A template for initial chat message, e.g. "Hi {farmer_name}, I'm {officer_name} the extension officer. Welcome to Agriconnect, send us a message here to start chatting." The template should contains `{farmer_name}` and `{officer_name}`
`LAST_MESSAGES_LIMIT`	10	The maximum number of last messages to resend to a user in a chat session.
`ASSISTANT_LAST_MESSAGES_LIMIT`	10	The maximum number of previous chat messages to retrieve and feed into the assistant for generating suggestions.
`NEXT_PUBLIC_VAPID_PUBLIC_KEY`	CHANGEME	The public key for web push notification generated by `web-push`
`NEXT_PUBLIC_VAPID_PRIVATE_KEY`	CHANGEME	The private key for web push notification generated by `web-push`

Chat Session Seeder

Before using the application, you can seed the database with user, client, and chat session data using the chat_session seeder. Follow the instructions below to set up and run the seeder.

Google Sheet Template

Prepare a Google Sheet with the following columns (or you can use this template):

client_phone_number: Phone number of the client (including the + sign).
client_name: Name of the client (can be empty).
linked_to_user_phone_number: Phone number of the user linked to the client (including the + sign).
user_name: Name of the user (can be empty).

Ensure that the Google Sheet is publicly accessible.

Running the Seeder

Save your data in the prepared Google Sheet template.
From the backend directory, run the following command:
```
python -m seeder.chat_session
```
The script will prompt you for the Google Sheet ID, which can be found in the URL of the Google Sheet. Enter the ID and press Enter.
The seeder will process the data and populate your database with the user, client, and chat session information.

Twilio Channel

In the backend, we handle Twilio's send and receive messages through a service called TwilioClient. Currently, we only support WhatsApp text messages.

When started, TwilioClient listens to incoming messages from Twilio using a webhook. TwilioClient will use the frontend port proxy to point to the Twilio callback URL. In Twilio, configure the sandbox webhook URL to be the external URL for your TwilioClient routes.

The TwilioClient connects to the message queue to interact with the rest of the system, notably the assistant. Incoming messages are forwarded to the RABBITMQ_QUEUE_USER_CHATS queue and replies coming from the RABBITMQ_QUEUE_USER_CHAT_REPLIES queue are posted back to the user via Twilio.

`.env`	default	description
`TWILIO_ACCOUNT_SID`	CHANGEME	The Account SID for your Twilio account.
`TWILIO_AUTH_TOKEN`	CHANGEME	Your Twilio authorization token.
`TWILIO_WHATSAPP_NUMBER`	CHANGEME	The Twilio WhatsApp number from your Twilio account in international format.
`VERIFICATION_TEMPLATE_ID_en`	NULL	The Twilio message template ID for the verification message in English. This template should contain two content variables: `{"1": extension_officer_name, "2": verification_link}`. Leave blank for local development.
`VERIFICATION_TEMPLATE_ID_sw`	NULL	The Twilio message template ID for the verification message in Swahili. This template should contain two content variables: `{"1": extension_officer_name, "2": verification_link}`. Leave blank for local development.
`VERIFICATION_TEMPLATE_ID_fr`	NULL	The Twilio message template ID for the verification message in French. This template should contain two content variables: `{"1": extension_officer_name, "2": verification_link}`. Leave blank for local development.
`BROADCAST_TEMPLATE_ID_en`	NULL	The Twilio message template ID for the broadcast message in English. This template should contain two content variables: `{"1": farmer_name, "2": broadcast_message_without_new_line}`. Leave blank for local development.
`BROADCAST_TEMPLATE_ID_sw`	NULL	The Twilio message template ID for the broadcast message in Swahili. This template should contain two content variables: `{"1": farmer_name, "2": broadcast_message_without_new_line}`. Leave blank for local development.
`BROADCAST_TEMPLATE_ID_fr`	NULL	The Twilio message template ID for the broadcast message in French. This template should contain two content variables: `{"1": farmer_name, "2": broadcast_message_without_new_line}`. Leave blank for local development.
`INTRO_TEMPLATE_ID_en`	NULL	The Twilio message template ID for the introduction message in English. This template should contain two content variables: `{"1": farmer_name, "2": extension_officer_name}`. Leave blank for local development.
`INTRO_TEMPLATE_ID_sw`	NULL	The Twilio message template ID for the introduction message in Swahili. This template should contain two content variables: `{"1": farmer_name, "2": extension_officer_name}`. Leave blank for local development.
`INTRO_TEMPLATE_ID_fr`	NULL	The Twilio message template ID for the introduction message in French. This template should contain two content variables: `{"1": farmer_name, "2": extension_officer_name}`. Leave blank for local development.
`CONVERSATION_RECONNECT_TEMPLATE_en`	NULL	The Twilio message template ID for the conversation reconnect message in English. This is used when an officer sends a message to a farmer beyond the 24-hour window. The template should contain one content variable: `{"1": farmer_name}`. Leave blank for local development.
`CONVERSATION_RECONNECT_TEMPLATE_sw`	NULL	The Twilio message template ID for the conversation reconnect message in Swahili. This is used when an officer sends a message to a farmer beyond the 24-hour window. The template should contain one content variable: `{"1": farmer_name}`. Leave blank for local development.
`CONVERSATION_RECONNECT_TEMPLATE_fr`	NULL	The Twilio message template ID for the conversation reconnect message in French. This is used when an officer sends a message to a farmer beyond the 24-hour window. The template should contain one content variable: `{"1": farmer_name}`. Leave blank for local development.

Twilio Message Template

By default, when the app starts, a command is executed to fetch the Twilio message templates. This command generates a JSON file located in the ./sources folder. The purpose of this file is to minimize Twilio API calls when saving message templates into our database as part of the chat history.

One important consideration is that if you update the message template in the Twilio Console, you must also update the Message Template ID environment variable to match the new template ID in the Twilio Console > Content Template Builder. After updating the environment variable, run the following command inside the backend container to refresh the JSON file:

python -m command.get_twilio_message_template

Slack Channel

Slack is one of the messaging platforms that can be used to chat with Rag Doll. Most of the Slack interface code was taken from Getting started with Bolt for Python.

The Slack client in the backend listens to incoming messages using a web hook, which is handled nicely by the Bolt framework.

`.env`	default	description
`SLACK_BOT_TOKEN`	CHANGEME	The token for your Slack bot.
`SLACK_SIGNING_SECRET`	CHANGEME	The signing secret for your Slack bot.

When installing the the backend as Slack App, you can use backend/slackbot-app-manifest.yml as a template. Before using it, change the following values:

`backend/slackbot-app-manifest.yml`	default	description
`description`	CHANGEME	A brief description of the purpose of the bot.
`background_color`	CHANGEME	The 6-digit hex colour code for the Slack bot background.
`display_name`	CHANGEME	The display name of the Slack bot. This is wat people in your workspace will see.
`request_url`	`http://`CHANGEME`/slack/events`	The external URL that Slack's servers will use to call the Slack bot component. Replace CHANGEME with the external IP address you reserved for your Google Cloud VM running the components.

You will also want to upload a nice avatar image to go with your bot.

Frontend (Next JS)

The frontend of this project is developed using React with Next.js. In the development environment, the frontend and backend services are configured to facilitate efficient and streamlined development. The frontend, built with React and Next.js, communicates with the backend API using a proxy setup defined in the next.config.js file. This configuration rewrites requests matching the pattern /api/:path* to be forwarded to the backend service at http://backend:5000/api/:path*. This proxy setup simplifies the API call structure during development, allowing developers to interact with the backend as if it were part of the same application.

`.env`	default	description
`FRONTEND_PORT`	3001	The external port used by the Frontend
`NEXT_PUBLIC_VAPID_PUBLIC_KEY`	CHANGEME	The public key for web push notification generated by `web-push`
`NEXT_PUBLIC_VAPID_PRIVATE_KEY`	CHANGEME	The private key for web push notification generated by `web-push`

Frontend: http://localhost:${FRONTEND_PORT} API Docs: http://localhost:${FRONTEND_PORT}/api/docs#/

/** @type {import('next').NextConfig} */
const nextConfig = {
  async rewrites() {
    return [
      {
        source: "/api/:path*",
        destination: "http://backend:5000/api/:path*", // Proxy to Backend
      },
    ];
  },
};

export default nextConfig;

In the production environment, the interaction between the frontend and backend is handled differently to optimize performance and security. Instead of using the proxy setup defined in the development configuration, the frontend and backend services communicate through an Nginx server. The Nginx configuration, located in the frontend folder, acts as a reverse proxy, efficiently routing requests from the frontend to the backend.

PostgreSQL

This project uses PostgreSQL as the backend database.

`.env`	default	description
`POSTGRES_PORT`	5432	The external port used by the Database
`POSTGRES_PASS`	CHANGEME	The default password for accessing Database
`PGADMIN_PORT`	5050	The external port used by pgadmin page

Google Cloud Deployment

This chapter gives a list of items that you should consider as you deploy the code from this repository. The description assumes you will be deploying to Google Cloud, so if you deploy on a different cloud provider you may see things that are different.

Disk Space

Cached Docker files and images consume a lot of disk space. The stock 10GB disks won't be large enough for Rag Doll, so you probably want to allocate 100GB instead. Depending on how you like to organise disks you can get extra attached storage or just start with larger root disks.

Reserved IP Address

Reserve a static IP address for the webhook calls from Twilio and Slack.

Rag Doll using Docker Compose on a Virtual Machine

Getting Rag Doll running is a two-step process: first set up your .env file. The repository contains a template that you can use. It has reasonably sane defaults for most variables. All that you have to do is add keys and passwords and you should be good to go.

Copy env.template to .env and edit that file with your favourite editor. In the template, search for CHANGEME and replace that placeholder with your own key or generated password. Please do not reuse passwords from other places, but us a password generator. You won't have to type them, so making them strong is just as much and as little work as making them weak.

$ cp env.template .env
$ vi .env

All variables are documented in the component documentation sections above. With .env set up, all but one component of Rag Doll can be started with the following command:

$ docker compose up

Docker Logger Crash Prevention

It is well know that Docker eats disk space relentlessly. One particular problem is that the default logger format for Docker, json-file, does not support log rotation. Instead, switch Docker over to using the local logging driver. That does support log rotation. See Configure logging drivers for instructions.

$ docker info --format '{{.LoggingDriver}}'
json-file
$ sudo vi /etc/docker/daemon.json
{
    "log-driver": "local"
}
$ sudo systemctl restart docker
$ docker info --format '{{.LoggingDriver}}'
local

Name		Name	Last commit message	Last commit date
Latest commit History 161 Commits
.github/workflows		.github/workflows
assistant		assistant
backend		backend
ci		ci
eppo-librarian		eppo-librarian
frontend		frontend
nginx		nginx
packages		packages
postgres		postgres
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
docker-compose-rag-doll-only.yml		docker-compose-rag-doll-only.yml
docker-compose.test.yml		docker-compose.test.yml
docker-compose.yml		docker-compose.yml
env.template		env.template
pylsp.toml		pylsp.toml
readme.md		readme.md
setup-pre-commit.sh		setup-pre-commit.sh
setup.cfg		setup.cfg

`.env`	default	description
`CHROMADB_HOST`	chromadb	The hostname of the vector database container.
`CHROMADB_HOST`	8000	The port that the vector database container listens on.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Doll

Assistant

EPPO Librarian

ChromaDB Vector Database

RabbitMQ Message Queueing

Queue Message Format

Backend (Fast API)

Chat Session Seeder

Google Sheet Template

Running the Seeder

Twilio Channel

Twilio Message Template

Slack Channel

Frontend (Next JS)

PostgreSQL

Google Cloud Deployment

Disk Space

Reserved IP Address

Rag Doll using Docker Compose on a Virtual Machine

Docker Logger Crash Prevention

About

Releases

Packages

Contributors 4

Languages

License

akvo/rag-doll

Folders and files

Latest commit

History

Repository files navigation

RAG Doll

Assistant

EPPO Librarian

ChromaDB Vector Database

RabbitMQ Message Queueing

Queue Message Format

Backend (Fast API)

Chat Session Seeder

Google Sheet Template

Running the Seeder

Twilio Channel

Twilio Message Template

Slack Channel

Frontend (Next JS)

PostgreSQL

Google Cloud Deployment

Disk Space

Reserved IP Address

Rag Doll using Docker Compose on a Virtual Machine

Docker Logger Crash Prevention

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages