Bouzyges (pronounced boo-zee-jes) is a Python program to interactively generate semantic graphs of medical terms utilizing the SNOMED CT attribute-value pairs. The script can be interfaced with a LLM model to generate graphs in automated fashion. End result of the script is a set of SNOMED CT concepts, that serve as the closest possible strict supertypes that together fully capture the meaning of the input term.
In current form, Bouzyges serves as a proof-of-concept of a novel approach to automating ontology mapping and standardization. In the future, possible applications include:
- Mapping of medical terms to SNOMED CT concepts
- SNOMED CT authoring support
- Automated SNOMED CT quality assurance
- Automated creation of custom local Standard concepts in OMOP CDM.
Bouzyges requires Python 3.12 or later. To install the script, clone the repository, initialize a virtual environment and install the required packages:
git clone https://github.com/OHDSI/Bouzyges.git
cd Bouzyges
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Current implementation of Bouzyges relies on Snowstorm REST API to interface with SNOMED CT. To use the API, you need to provide the endpoint and the API key either as environment variables or inside .env
file in the root directory of the project (see below).
Snowstorm version 10 with SNOMED International (July 2024 release) was tested. We recommend using the Docker image provided by SNOMED International to run Snowstorm locally and loading the SNOMED RF2 release archive via Swagger UI.
- Snowstorm GitHub repository
- Using Snowstorm with Docker
- SNOMED International release in RF2 format (hosted by NLM)
Bouzyges relies on outputting LLM prompts and parsing their input; currently, three options are supported:
- Manual input: the user is prompted to input the desired LLM prompt and is expected to provide the input manually. This can be used to debug the script or test different LLMs interactively. To use this, set
PROMPTER_OPTION
constant to"human"
in the body of the script. Better configuration interface is coming soon. - OpenAI: to use this API, you will need to ensure that a valid
OPENAI_API_KEY
is set either as environment variable or (recommended) inenv
file (see below). To use this, setPROMPTER_OPTION
to"openai"
- Azure: Azure OpenAI API is also supported. To use this API, you will need to provide the API information either an by explicitly setting environment variables or (preferred way) inside
.env
file. ThePROMPTER_OPTION
should be set to"azure"
.
It is possible to implement additional API interfaces (e.g. to locally available models) by inheriting from PromptFormat
class to generate prompts in the correct format in inheriting from Prompter
to provide interface to send prompts to the LLM.
To avoid accidental exposure of API keys, we strongly recommend using an .env
file to manage environment variables. Bouzyges will try to automatically load the .env
file in the working directory using the python-dotenv library.
Example content of the file:
# Snowstorm endpoint is always required
# This is example for default local/docker installation is given
export SNOWSTORM_ENDPOINT="https://localhost:8080/"
# OpenAI requirements
# Project API key created at https://platform.openai.com/api-keys
export OPENAI_API_KEY="sk-abc...def"
# Azure OpenAI interface requirements
# Attainable at your organization's infrastructure team
export AZURE_OPENAI_API_KEY="123abcd...789"
export AZURE_OPENAI_API_VERSION="2024-06-01" # Most recent version
export AZURE_OPENAI_ENDPOINT="https://example.openai.azure.com/
Bouzyges will cache all calls to LLM APIs in an SQLite database prompt_cache.db
. Prompts to the same model with the same API options will be reused across runs. Database file can be read and analyzed by any tool supporting sqlite3 APIs. Schema DDL is stored in init_prompt_cache.sql
file.
Warning
Bouzyges is currently in the early development stage and is not yet ready for production use. The script makes a lot of API calls and may consume a LOT of tokens. Currently, processing one concept consumes tokens on magnitude of 150,000 (3 cents with gpt-4o-mini).
Currently, only exemplary usage inside the script is supported; batch loading interface is planned to be implemented very soon. To run the script, execute the following command:
$ python bouzyges.py
The code is not yet licensed and is provided as-is. The code is provided for educational purposes only and is not intended for production use. Please refrain from disributing the code or using it in any commercial or production environment.
- Batch processing interface
- Reproducible run instructions
- Licensing and release preparation
- RAG support with SNOMED authoring documentation
- SNOMED CT API optimization
- OpenAI token consumption profiling
- OpenAI API token consumption optimization
- Automated LLM interface
- SNOMED CT API interface
- SNOMED CT hierarchy traversal