You can either run Gorilla through our hosted or chat with it using cli. We also provide instructions for evaluating batched prompts. Here, are the instructions to run it locally.
New: We release gorilla-mpt-7b-hf-v0
and gorilla-falcon-7b-hf-v0
- two Apache 2.0 licensed models (commercially usable).
gorilla-7b-hf-v0
is the first set of weights we released 🎉 It chooses from 925 HF APIs in a 0-shot fashion (without any retrieval). Update: We released gorilla-7b-th-v0
with 94 (exhaustive) APIs from Torch Hub and gorilla-7b-tf-v0
with 626 (exhaustive) APIs from Tensorflow. In spirit of openness, we do not filter, nor carry out any post processing either to the prompt nor response 🎁 Keep in mind that the current gorilla-7b-*
models do not have any generic chat capability. We do have a model with all the 1600+ APIs which also has chat capability, which we release slowly to accommodate server demand.
All gorilla weights hosted at https://huggingface.co/gorilla-llm/.
You should install dependencies using the following command:
conda create -n gorilla python=3.10
conda activate gorilla
pip install -r requirements.txt
We release the weights for gorilla-mpt-7b-hf-v0
and gorilla-falcon-7b-hf-v0
on Huggingface. You can directly download them! For the llama-finetuned models we release the weights as a delta to be compliant with the LLaMA model license. You can apply the delta weights using the following commands below:
We release the delta weights of Gorilla to comply with the LLaMA model license. You can prepare the Gorilla weights using the following steps:
- Get the original LLaMA weights using the link here.
- Download the Gorilla delta weights from our Hugging Face.
Run the following python command to apply the delta weights to your LLaMA model:
python3 apply_delta.py
--base-model-path path/to/hf_llama/
--target-model-path path/to/gorilla-7b-hf-v0
--delta-path path/to/models--gorilla-llm--gorilla-7b-hf-delta-v0
Simply run the command below to start chatting with Gorilla:
python3 serve/gorilla_cli.py --model-path path/to/gorilla-7b-{hf,th,tf}-v0
For the falcon-7b model, you can use the following command:
python3 serve/gorilla_falcon_cli.py --model-path path/to/gorilla-falcon-7b-hf-v0
Add "--device mps" if you are running on your Mac with Apple silicon (M1, M2, etc)
After downloading the model, you need to make a jsonl file containing all the question you want to inference through Gorilla. Here is one example:
{"question_id": 1, "text": "I want to generate image from text."}
{"question_id": 2, "text": "I want to generate text from image."}
After that, using the following command to get the results:
python3 gorilla_eval.py --model-path path/to/gorilla-7b-hf-v0
--question-file path/to/questions.jsonl
----answer-file path/to/answers.jsonl
You could use your own questions and get Gorilla responses. We also provide a set of questions that we used for evaluation.
K-quantized gorilla models can be found on Huggingface: Llama-based, MPT-Based, Falcon-Based
K-quantized gorilla-openfunctions-v0
and gorilla-openfunctions-v1
models can be found on Huggingface: gorilla-openfunctions-v0-gguf
, gorilla-openfunctions-v1-gguf
For an in depth walkthrough on how this quantization was done, follow the tutorial in this . This tutorial is a fully self-contained space to see an under-the-hood walkthrough of the quantization pipeline (using llama.cpp) and to test out your own prompts with different quantized versions of Gorilla. The models don't take up local space and utilize a CPU runtime.
Running local inference with Gorilla on a clean interface is simple. Follow the instructions below to set up text-generation-webui, add your desired models, and run inference.
My specs, M1 MacBook Air 2020
Model Name: MacBook Air
Model Identifier: MacBookAir10,1
Model Number: Z125000NMCH/A
Chip: Apple M1
Total Number of Cores: 8 (4 performance and 4 efficiency)
Memory: 16 GB
System Firmware Version: 10151.61.4
OS Loader Version: 10151.61.4
Step 1: Clone text-generation-webui, a Gradio web UI for Large Language Models. It supports transformers, GPTQ, AWQ, EXL2, llama.cpp (GGUF), and Llama models. It hides many complexities of llama.cpp and has a well defined interface that is easy to use.
git clone https://github.com/oobabooga/text-generation-webui.git
Step 2: Follow text-generation-webui instructions to run the application locally.
- Go to the cloned folder
./start_macos.sh
and it will output the following- Open a browser and go to url
http://127.0.0.1:7860/
as an example.
Step 3: Select the quantization method you want to use, download the quantized model and run the inference on the quantized Gorilla models.
- Here, we can go to
Model
and there isDownload model or LoRA
. For example, we want to get the q3_K_M gguf quantized model forgorilla-7b-hf-v1
, you would inputgorilla-llm/gorilla-7b-hf-v1
and filename asgorilla-7b-hf-v1-q3_K_M
and clickDownload
. It would say Downloading file tomodels/
. - After downloading the model, you select the Model,
gorilla-7b-hf-v1-q3_K_M
for demonstration, and clickLoad
. For settings, if you have laptop GPU available, increasingn-gpu-layers
accelerates inference. - After loading, it will give a confirmation message as following.
- Then go to
Chat
page, use default setting for llama based quantized models, - Real-time inference video demo
This section provides a guide for setting up a private inference endpoint for a Gorilla model hosted on Replicate, a cloud platform for running machine learning models. Replicate offers a secure and scalable alternative to the publicly hosted zanino.berkeley.edu endpoint, enabling private and controlled model deployment. Replicate's open source Cog tool is used for containerizing and deploying the Gorilla model, which streamlines the process of turning Gorilla models into scalable, production-ready services.
To install Cog, run the following command:
sudo curl -o /usr/local/bin/cog -L https://github.com/replicate/cog/releases/latest/download/cog_`uname -s`_`uname -m`
sudo chmod +x /usr/local/bin/cog
To configure Gorilla for use with Cog, the cog.yaml
file is used. This file defines system requirements, Python package dependencies, and more. Below is the cog.yaml
file for Gorilla models:
build:
gpu: true
python_version: "3.10"
python_packages:
- "torch==2.0.1"
- "transformers==4.28.1"
- "huggingface-hub==0.14.1"
- "sentencepiece==0.1.99"
- "accelerate==0.19.0"
- "einops"
predict: "predict.py:Predictor"
Note: Cog uses the nvidia-docker base image, which automatically figures out what versions of CUDA and cuDNN to use based on the version of Python and PyTorch that you specify.
predict.py
is used to describe the prediction interface for Gorilla. It includes the implementation of the Predictor
class, which defines how the model is set up and how predictions are generated. Below is the content of the predict.py
file:
from cog import BasePredictor, Input
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
def get_prompt(user_query: str) -> str:
"""
Generates a conversation prompt based on the user's query.
Parameters:
- user_query (str): The user's query.
Returns:
- str: The formatted conversation prompt.
"""
return f"USER: <<question>> {user_query}\nASSISTANT: "
class Predictor(BasePredictor):
def setup(self):
"""
Load the model into memory to make running multiple predictions efficient.
Sets up the device, model, tokenizer, and pipeline for text generation.
"""
# Device setup
self.device = "cuda:0" if torch.cuda.is_available() else "cpu"
self.torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
# Model and tokenizer setup
model_id = "gorilla-llm/gorilla-falcon-7b-hf-v0"
self.tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
self.model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=self.torch_dtype,
low_cpu_mem_usage=True,
trust_remote_code=True
)
# Move model to device
self.model.to(self.device)
# Pipeline setup
self.pipe = pipeline(
"text-generation",
model=self.model,
tokenizer=self.tokenizer,
max_new_tokens=256,
batch_size=16,
torch_dtype=self.torch_dtype,
device=self.device,
)
def predict(self, user_query: str = Input(description="User's query")) -> str:
"""
Run a single prediction on the model using the provided user query.
Parameters:
- user_query (str): The user's query for the model.
Returns:
- str: The model's generated text based on the query.
"""
prompt = get_prompt(user_query)
output = self.pipe(prompt)
return output
To deploy the Gorilla model on Replicate, you first need to build a Docker image using Cog. This image encapsulates the model and its dependencies. Run the following command in your terminal to build the Docker image, replacing <image-name>
with a name of your choice for the image:
cog build -t <image-name>
Once the Docker image is built, the next step is to publish it to Replicate's registry. This will allow you to run the model on Replicate's platform. If you haven't already, log in to your Replicate account via the command line:
cog login
Push the built image to Replicate using the following command. Replace <your-username> with your Replicate username and <your-model-name> with the name you gave your model on Replicate:
cog push r8.im/<your-username>/<your-model-name>
Once the Gorilla model is successfully pushed to Replicate, it will be visible on the Replicate website. To run inference on the hosted Gorilla model, you can use Replicate's Python client library.
First, install the Replicate Python client library:
pip install replicate
Before using the Python client, authenticate by setting your Replicate API token in an environment variable:
export REPLICATE_API_TOKEN=<your-token-here>
After setting up the client and authenticating, you can now run inference using Python. Replace <your-username>, <your-model-name>, and <model-version> with your Replicate username, the model name, and the specific model version you want to use:
import replicate
output = replicate.run(
"<your-username>/<your-model-name>:<model-version>",
input={"user_query": <add-your-query-here>}
)
print(output)