A simple RESTful service for the Meta-Llama-3-8B-Instruct language model.
- A CUDA enabled GPU machine with at least 24 GB of RAM
- Access to LLaMa-3 weights from Huggingface
- Install Docker on the machine https://docs.docker.com/engine/install/ubuntu/
- Check CUDA and NVIDIA Driver versions (Important for the base Docker image)
- Run this on your terminal to check the CUDA version:
nvcc --version
- Run this on your terminal to check the driver version:
nvidia-smi
- Run this on your terminal to check the CUDA version:
- Install NVIDIA Container Tool Kit https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
- Clone this repo to your GPU machine.
- Adapt the base cuda image in the
Dockerfile
based on the installation on your own machine. - Clone the LLaMa-3 weights from Huggingface into
/your/path/to/data/models/
if you want to store the weights locally. Your HF token with write permissions is needed in this step. - (Optional) Change the number of workers in
start_app.sh
if you want to enable multiple workers to handle simultaneous requests. Keep in mind that each worker loads the model into their own memory, so, one needs approximately 20GB * number of workers of GPU RAM available. - In the
app/
folder, rundocker build -t restful-llama-3 .
to build the Docker image.
Run the following command to start the Docker container. Configure the run options as desired. It takes a couple of minutes for the container to start and load the model.
docker run --gpus all -d -it -p 5000:5000 -v /your/path/to/data:/restful-llama-3/data -e GRANT_SUDO=yes --user root --restart always --name restful-llama-3 restful-llama-3
If the container runs with no problems, you should see a welcome message generated by the model on http://localhost:5000/home.
For interacting with the model, you need to send POST requests to http://localhost:5000/chat.
Here is an example with curl:
curl -X POST http://localhost:5000/chat -H 'Content-Type: application/json' -d '{"messages":[{"role":"system","content":"You are a helpful assistant called Llama-3. Write out your answer short and succinct!"}, {"role":"user", "content":"What is the capital of Germany?"}], "temperature": 0.6, "top_p":0.75, "max_new_tokens":256}'
Another simplified example:
curl -X POST http://localhost:5000/chat -H 'Content-Type: application/json' -d '{"messages":[{"role":"user", "content":"Write a short essay about Istanbul."}]}'