Skip to content

A RESTful Python app to interact with Meta's llama-3-8b-instruct model

Notifications You must be signed in to change notification settings

tolgaakar/restful-llama-3-8b

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RESTful-LLaMa-3-8B-app

A simple RESTful service for the Meta-Llama-3-8B-Instruct language model.

Pre-requisites

  1. A CUDA enabled GPU machine with at least 24 GB of RAM
  2. Access to LLaMa-3 weights from Huggingface

Getting Started

  1. Install Docker on the machine https://docs.docker.com/engine/install/ubuntu/
  2. Check CUDA and NVIDIA Driver versions (Important for the base Docker image)
    1. Run this on your terminal to check the CUDA version: nvcc --version
    2. Run this on your terminal to check the driver version: nvidia-smi
  3. Install NVIDIA Container Tool Kit https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

Installation

  1. Clone this repo to your GPU machine.
  2. Adapt the base cuda image in the Dockerfile based on the installation on your own machine.
  3. Clone the LLaMa-3 weights from Huggingface into /your/path/to/data/models/ if you want to store the weights locally. Your HF token with write permissions is needed in this step.
  4. (Optional) Change the number of workers in start_app.sh if you want to enable multiple workers to handle simultaneous requests. Keep in mind that each worker loads the model into their own memory, so, one needs approximately 20GB * number of workers of GPU RAM available.
  5. In the app/ folder, run docker build -t restful-llama-3 . to build the Docker image.

How to start

Run the following command to start the Docker container. Configure the run options as desired. It takes a couple of minutes for the container to start and load the model.

docker run --gpus all -d -it -p 5000:5000 -v /your/path/to/data:/restful-llama-3/data -e GRANT_SUDO=yes --user root --restart always --name restful-llama-3 restful-llama-3

How to use

If the container runs with no problems, you should see a welcome message generated by the model on http://localhost:5000/home.

For interacting with the model, you need to send POST requests to http://localhost:5000/chat.

Here is an example with curl:

curl -X POST http://localhost:5000/chat -H 'Content-Type: application/json' -d '{"messages":[{"role":"system","content":"You are a helpful assistant called Llama-3. Write out your answer short and succinct!"}, {"role":"user", "content":"What is the capital of Germany?"}], "temperature": 0.6, "top_p":0.75, "max_new_tokens":256}'

Another simplified example:

curl -X POST http://localhost:5000/chat -H 'Content-Type: application/json' -d '{"messages":[{"role":"user", "content":"Write a short essay about Istanbul."}]}'

About

A RESTful Python app to interact with Meta's llama-3-8b-instruct model

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published