-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Configure Text Generation Models guide #313
Changes from 8 commits
dc48e5b
6556e05
1974b7d
38d8c3e
7958044
3a5ffee
c67a64f
956742b
a40c471
1e2f521
df068a0
56fb763
96cb478
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,160 @@ | ||
# Configure Text Generation Models | ||
|
||
KubeAI supports the following engines for text generation models (LLMs, VLMs, ..): | ||
|
||
- vLLM (Recommended for GPU) | ||
- Ollama (Recommended for CPU) | ||
- Need something else? Please file an issue on [GitHub](https://github.com/substratusai/kubeai). | ||
|
||
There are 2 ways to install a text generation model in KubeAI: | ||
- Use Helm with the `kubeai/models` chart. | ||
- Use `kubectl apply -f model.yaml` to install a Model CRD. | ||
|
||
KubeAI comes with pre-validated and optimized Model configurations for popular text generation models. These models are available in the `kubeai/models` Helm chart and | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Prefer links to GitHub directories. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
are also published as raw manifests in the `manifests/model` directory. | ||
|
||
You can also easily define your own models using the Model CRD directly or by using the `kubeai/models` Helm chart. | ||
|
||
## Install a Text Generation Model using Helm | ||
|
||
KubeAI provides a `kubeai/models` chart that contains the pre-configured models. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is already stated above There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
|
||
You can take a look at all the pre-configured models in the chart's default values file. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Link would be useful There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
|
||
```bash | ||
helm show values kubeai/models | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. They would need to have loaded the chart first right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's a given if you're trying to configure models after you've installed KubeAI. Instlaling KubeAI requires adding the KubeAI helm repo. |
||
``` | ||
|
||
### Install Text Generation Model using CPU | ||
|
||
Enable the `gemma2-2b-cpu` model using the Helm chart: | ||
|
||
```bash | ||
helm upgrade --install --reuse-values kubeai-models kubeai/models -f - <<EOF | ||
catalog: | ||
gemma2-2b-cpu: | ||
enabled: true | ||
engine: OLlama | ||
resourceProfile: cpu:2 | ||
minReplicas: 1 # by default this is 0 | ||
EOF | ||
``` | ||
|
||
### Install Text Generation Model using L4 GPU | ||
|
||
Enable the Llama 3.1 8B model using the Helm chart: | ||
|
||
```bash | ||
helm upgrade --install --reuse-values kubeai-models kubeai/models -f - <<EOF | ||
catalog: | ||
llama-3.1-8b-instruct-fp8-l4: | ||
enabled: true | ||
engine: VLLM | ||
resourceProfile: nvidia-gpu-l4:1 | ||
minReplicas: 1 # by default this is 0 | ||
EOF | ||
``` | ||
|
||
## Install a Text Generation Model using kubectl | ||
You can use the Model CRD directly to install a model using `kubectl apply -f model.yaml`. | ||
|
||
### Install Text Generation Model using CPU | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I dont think we need to duplicate the steps for CPU... I think we could just point to an example model after defining steps for install models above. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As a user I prefer to have the full examples in the doc. We can link to more examples in the repo but I want to at least show CPU and GPU for both helm and non-helm flow. |
||
|
||
Apply the following Model CRD to install the Gemma 2 2B model using Ollama on CPU: | ||
```yaml | ||
apiVersion: kubeai.org/v1 | ||
kind: Model | ||
metadata: | ||
name: gemma2-2b-cpu | ||
spec: | ||
features: [TextGeneration] | ||
url: ollama://gemma2:2b | ||
engine: OLlama | ||
resourceProfile: cpu:2 | ||
``` | ||
|
||
### Install Text Generation Model using L4 GPU | ||
|
||
Apply the following Model CRD to install the Llama 3.1 8B model using vLLM on L4 GPU: | ||
```yaml | ||
apiVersion: kubeai.org/v1 | ||
kind: Model | ||
metadata: | ||
name: llama-3.1-8b-instruct-fp8-l4 | ||
spec: | ||
features: [TextGeneration] | ||
url: hf://neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 | ||
engine: VLLM | ||
args: | ||
- --max-model-len=16384 | ||
- --max-num-batched-token=16384 | ||
- --gpu-memory-utilization=0.9 | ||
- --disable-log-requests | ||
resourceProfile: nvidia-gpu-l4:1 | ||
``` | ||
|
||
## Interact with the Text Generation Model | ||
The KubeAI service exposes an OpenAI compatible API that you can use to query the available models and interact with them. | ||
|
||
The KubeAI service is available at `http://kubeai/openai/v1` within the Kubernetes cluster. | ||
|
||
You can also port-forward the KubeAI service to your local machine to interact with the models: | ||
|
||
```bash | ||
kubectl port-forward svc/kubeai 8000:80 | ||
``` | ||
|
||
You can now query the available models using curl: | ||
|
||
```bash | ||
curl http://localhost:8000/openai/v1/models | ||
``` | ||
|
||
### Using curl to interact with the model | ||
|
||
Run the following curl command to interact with the model named `llama-3.1-8b-instruct-fp8-l4`: | ||
```bash | ||
curl "http://localhost:8000/openai/v1/chat/completions" \ | ||
-H "Content-Type: application/json" \ | ||
-d '{ | ||
"model": "llama-3.1-8b-instruct-fp8-l4", | ||
"messages": [ | ||
{ | ||
"role": "system", | ||
"content": "You are a helpful assistant." | ||
}, | ||
{ | ||
"role": "user", | ||
"content": "Write a haiku about recursion in programming." | ||
} | ||
] | ||
}' | ||
``` | ||
|
||
### Using the OpenAI Python SDK to interact with the model | ||
Once the pod is ready, you can use the OpenAI Python SDK to interact with the model: | ||
All OpenAI SDKs work with KubeAI since the KubeAI service is OpenAI API compatible. | ||
|
||
See the below example code to interact with the model using the OpenAI Python SDK: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Alt: we could just link to different examples to keep docs easier to maintain. This would have the added benefit of directing users to the repo where they could add stars, issues, etc. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should embed code from the repo in docs in the future, but I do think from user pov that having it in the same doc is preferred. |
||
```python | ||
from openai import OpenAI | ||
# Assumes port-forward of kubeai service to localhost:8000. | ||
kubeai_endpoint = "http://localhost:8000/openai/v1" | ||
model_name = "llama-3.1-8b-instruct-fp8-l4" | ||
|
||
# If you are running in a Kubernetes cluster, you can use the kubeai service endpoint. | ||
if os.getenv("KUBERNETES_SERVICE_HOST"): | ||
kubeai_endpoint = "http://kubeai/openai/v1" | ||
|
||
client = OpenAI(api_key="ignored", base_url=kubeai_endpoint) | ||
|
||
chat_completion = client.chat.completions.create( | ||
messages=[ | ||
{ | ||
"role": "user", | ||
"content": "Say this is a test", | ||
} | ||
], | ||
model=model_name, | ||
) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/CRD/Custom Resource
- CRD is the definition of a CR