Examples for serving LLaMa2 on Cloud TPUs with Ray Serve.
Before running, make sure you set up your Ray serve cluster:
ray up -y cluster/serve.yaml
This sample relies on https://github.com/facebookresearch/llama/tree/llama_v2 (but with a few Google Cloud/XLA improvements).
By default, this code will NOT load a checkpoint. Please ensure that you request for access to the checkpoint (and go through the Meta AI License).
Once done, you can upload this to a GCS bucket and set this as your checkpoint path within llama_serve.py
. This should help simplify the setup.
Note: This currently only supports serving the 7B model.
To deploy this model, run:
./scripts/start_gradio.sh
to submit the serve deployment, attach to the GradIO deployment via
$ ray attach -p 8000 cluster/serve.yaml
and go to http://localhost:8000 to view the GradIO deployment.