Dashbirth - Data Shack 2024
Team Members Yuqin (Bailey) Bai, Luca Bottani, Sara Merengo, and Li Yao
Project Objectives This project proposes to utilize advanced language models to analyze postpartum surveys completed by thousands of mothers. The aim is to derive actionable insights that can further enhance the TeamBirth program, particularly focusing on reducing maternal mortality rates and addressing mistreatment and disparities in maternal health.
- To deploy language models in summarizing and analyzing postpartum surveys from mothers who have participated in the TeamBirth program.
- To identify areas where TeamBirth can be refined to further reduce maternal mortality and mistreatment.
- To develop targeted interventions based on the insights gleaned from the survey analysis.
Following the Ethical Principles & Guidelines for Research Involving Human Subjects, maintaining the confidentiality of patient data is paramount. In our efforts to utilize advanced language models for analyzing surveys, we are committed to ensuring that no patient-related data is exposed to external APIs, including ChatGPT. We understand the risks associated with data privacy and have implemented robust solutions to safeguard information.
To address different data storage and usage needs while ensuring data privacy, we offer the following solutions:
For a straightforward setup, we provide the option to locally host the dashboard. This setup includes all modules except for the open feedback and chatbot features.
It is highly suggested to store any patient-related data in a dedicated folder outside the Git repository. This practice prevents the accidental upload of confidential data to version control systems where it could be exposed to unauthorized access.
Setup Instructions
- Ensure you have Docker installed on your system.
- Have your data stored securely as suggested, outside of your project’s Git repository.
- Follow the instructions in our Frontend Simple Container Setup to deploy locally.
For enhanced security and control, we recommend deploying our solution using the Google Cloud Platform (GCP). This approach utilizes Google Cloud's robust security measures, including isolated storage and controlled access.
Why Choose Google Cloud Storage (GCS) and GCP Virtual Machine (VM)?
- GCS buckets offer advanced security features that make it a preferable choice over options like Google Drive. GCS provides fine-grained access controls and is designed for high durability and availability.
- GCP VM instances provide a secure environment for hosting our custom language models, eliminating the risks associated with using external APIs.
An interactive dashboard website that TeamBirth can use in order to visualize the data and share it to the hospitals.
- Have Docker installed
-
Store & config your data files in
</absolute/path/to/your/data>
following the instructions -
Build & run container by running the commands with the absolute path to your data folder
cd src/frontend_simple docker build -t frontend_simple -f Dockerfile . docker run -d -p 5001:5000 -v </absolute/path/to/your/data>:/data frontend_simple
-
Go to page
http://localhost:5001/
GCS Buckets
├── data_ma
├── data_nj
├── data_ok
└── data_wa
Create a local src/secrets
folder because we do not include any secure information in Git. Rename and add the GCP service account private key, data-server-account.json
, to this folder.
The Vector DB container is designed to handle vector embeddings for different states, processing data from designated GCS buckets.
generate_embeddings.py
- This script contains the primary logic for generating vector embeddings from data files.
- --name (-n): Flag specifies the name of the GCS bucket from which the script should retrieve data. Default is set to WA.
- --clean_up (-c): Flag controls whether the script should clean up (delete) the local metadata directories after processing is complete.
-
Navigate to Vector DB directory
cd src/vec_db
-
Config the
Dockerfile
for state selection: Modify the last CMD line to pass the --name flag. For example, replaceCMD ["data_wa"]
withCMD ["data_nj"]
for NJ state, orCMD ["data_nj", "-c", "False"]
to preserve metadata. -
Build & run container
./docker-shell.sh
-
Change
-
Build & run container
docker-compose build
This container is configured to serve a Large Language Model (LLM), specifically Llama2-7B-chat, on a Google Cloud Platform (GCP) Virtual Machine (VM). We opted to deploy the LLM in a cloud environment to address data privacy concerns and because hosting it locally is challenging. The model requires 15GB of memory and a T4 GPU to function optimally. Below, you will find instructions on how to set up a similar server on your own.
-
Build Docker Image
Navigate to LLM directory in your terminal and run the following command to build the Docker image:
cd src/llm_server docker build -t teambirth-llm-server .
-
Tag the Docker Image
After the build is complete, tag your Docker image with the Google Cloud Registry path:
docker tag teambirth-llm-server gcr.io/<project_id>/teambirth-llm-server
Replace
<project_id>
with your Google Cloud project ID. -
Authenticate Docker to GCR
Authenticate Docker to the Google Cloud Registry using the following command:
gcloud auth configure-docker
-
Push Docker Image to GCR
Push the Docker image to the Google Cloud Registry:
docker push gcr.io/<project_id>/teambirth-llm-server
-
Create a Google Cloud VM
Navigate to the Google Cloud Console and create a new Virtual Machine instance. Make sure to select the appropriate region, machine type, and other settings according to your requirements.
e.g. Region: us-east1, Zone: us-east1-c, GPU type: NVIDIA T4, Machine Type: n1-standard-4 (4 vCPU, 2 core, 15 GB memory), VM provisioning model: Spot (Save you some money but loss some stability)
-
Install Docker on the VM
SSH into your VM and install Docker:
sudo apt-get update sudo apt-get install docker.io
-
Pull Docker Image from GCR
Pull the Docker image from the Google Cloud Registry to your VM:
docker pull gcr.io/<project_id>/teambirth-llm-server
Now, you need to add other packages and drivers to communicate with the GPUs.
-
Install GPU drivers on VM
Follow the intstruction in the Google Cloud website: https://cloud.google.com/compute/docs/gpus/install-drivers-gpu to install GPU drivers on the VM. You cannot use your GPUs without the driver. Verify the GPU driver install by this command:
sudo nvidia-smi
-
Download the NVIDIA CUDA Toolkit
Follow the instruction in the NVIDIA website to download the NVIDIA CUDA Toolkit (section 2.7) https://docs.nvidia.com/cuda/cuda-installation-guide-linux/
-
Run Docker Container on VM
Finally, run the Docker container on your VM:
docker run -d -p <host_port>:<container_port> gcr.io/<project_id>/teambirth-llm-server
Replace <host_port> and <container_port> with the ports you want to expose on the host and container, respectively.
Make sure we do not have any running containers and clear up an unused images.
- Run
docker container ls
- Stop any container that is running
- Run
docker system prune
- Run
docker image ls