KubeAI: AI Inferencing Operator

The easiest way to serve ML models in production. Supports LLMs, embeddings, and speech-to-text.

✅️ OpenAI API Compatibility: Drop-in replacement for OpenAI
⚖️ Autoscaling: Scale from zero, autoscale based on load
🧠 Serve text generation models with vLLM or Ollama
🔌 Dynamic LoRA adapter loading
⛕ Inference-optimized load balancing
💬 Speech to Text API with FasterWhisper
🧮 Embedding/Vector API with Infinity
🚀 Multi-platform: CPU, GPU, TPU
💾 Model caching with shared filesystems (EFS, Filestore, etc.)
🛠️ Zero dependencies (does not depend on Istio, Knative, etc.)
💬 Chat UI included (OpenWebUI)
✉ Stream/batch inference via messaging integrations (Kafka, PubSub, etc.)

Quotes from the community:

reusable, well abstracted solution to run LLMs - Mike Ensor

Architecture

KubeAI serves an OpenAI compatible HTTP API. Admins can configure ML models via kind: Model Kubernetes Custom Resources. KubeAI can be thought of as a Model Operator (See Operator Pattern) that manages vLLM and Ollama servers.

Adopters

List of known adopters:

Name	Description	Link
Telescope	Telescope uses KubeAI for multi-region large scale batch LLM inference.	trytelescope.ai
Google Cloud Distributed Edge	KubeAI is included as a reference architecture for inferencing at the edge.	LinkedIn, GitLab
Lambda	You can try KubeAI on the Lambda AI Developer Cloud. See Lambda's tutorial and video.	Lambda
Vultr	KubeAI can be deployed on Vultr Managed Kubernetes using the application marketplace.	Vultr
Arcee	Arcee uses KubeAI for multi-region, multi-tenant SLM inference.	Arcee

If you are using KubeAI and would like to be listed as an adopter, please make a PR.

Local Quickstart

kubeai-quickstart-demo.mp4

Create a local cluster using kind or minikube.

TIP: If you are using Podman for kind...

Make sure your Podman machine can use up to 6G of memory (by default it is capped at 2G):

# You might need to stop and remove the existing machine:
podman machine stop
podman machine rm

# Init and start a new machine:
podman machine init --memory 6144 --disk-size 120
podman machine start

kind create cluster # OR: minikube start

Add the KubeAI Helm repository.

helm repo add kubeai https://www.kubeai.org
helm repo update

Install KubeAI and wait for all components to be ready (may take a minute).

helm install kubeai kubeai/kubeai --wait --timeout 10m

Install some predefined models.

cat <<EOF > kubeai-models.yaml
catalog:
  gemma2-2b-cpu:
    enabled: true
    minReplicas: 1
  qwen2-500m-cpu:
    enabled: true
  nomic-embed-text-cpu:
    enabled: true
EOF

helm install kubeai-models kubeai/models \
    -f ./kubeai-models.yaml

Before progressing to the next steps, start a watch on Pods in a standalone terminal to see how KubeAI deploys models.

kubectl get pods --watch

Interact with Gemma2

Because we set minReplicas: 1 for the Gemma model you should see a model Pod already coming up.

Start a local port-forward to the bundled chat UI.

kubectl port-forward svc/openwebui 8000:80

Now open your browser to localhost:8000 and select the Gemma model to start chatting with.

Scale up Qwen2 from Zero

If you go back to the browser and start a chat with Qwen2, you will notice that it will take a while to respond at first. This is because we set minReplicas: 0 for this model and KubeAI needs to spin up a new Pod (you can verify with kubectl get models -oyaml qwen2-500m-cpu).

Documentation

Checkout our documentation on kubeai.org to find info on:

Installing KubeAI in the cloud
How to guides (e.g. how to manage models and resource profiles).
Concepts (how the components of KubeAI work).
How to contribute

OpenAI API Compatibility

# Implemented #
/v1/chat/completions
/v1/completions
/v1/embeddings
/v1/models
/v1/audio/transcriptions

# Planned #
# /v1/assistants/*
# /v1/batches/*
# /v1/fine_tuning/*
# /v1/images/*
# /v1/vector_stores/*

Immediate Roadmap

Model caching
LoRA finetuning (compatible with OpenAI finetuning API)
Image generation (compatible with OpenAI images API)

NOTE: KubeAI was born out of a project called Lingo which was a simple Kubernetes LLM proxy with basic autoscaling. We relaunched the project as KubeAI (late August 2024) and expanded the roadmap to what it is today.

🌟 Don't forget to drop us a star on GitHub and follow the repo to stay up to date!

Contact

Let us know about features you are interested in seeing or reach out with questions. Visit our Discord channel to join the discussion!

Or just reach out on LinkedIn if you want to connect:

Name		Name	Last commit message	Last commit date
Latest commit History 235 Commits
.github		.github
api/v1		api/v1
benchmarks/chat		benchmarks/chat
charts		charts
cmd		cmd
components/model-loader		components/model-loader
docs		docs
examples		examples
hack		hack
internal		internal
manifests/models		manifests/models
proposals		proposals
test		test
.dockerignore		.dockerignore
.gitignore		.gitignore
.golangci.yml		.golangci.yml
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
PROJECT		PROJECT
go.mod		go.mod
go.sum		go.sum
mkdocs.yml		mkdocs.yml
skaffold-build.json		skaffold-build.json
skaffold-tags.json		skaffold-tags.json
skaffold.yaml		skaffold.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KubeAI: AI Inferencing Operator

Architecture

Adopters

Local Quickstart

Interact with Gemma2

Scale up Qwen2 from Zero

Documentation

OpenAI API Compatibility

Immediate Roadmap

Contact

About

Releases 57

Packages

Contributors 16

Languages

License

substratusai/kubeai

Folders and files

Latest commit

History

Repository files navigation

KubeAI: AI Inferencing Operator

Architecture

Adopters

Local Quickstart

Interact with Gemma2

Scale up Qwen2 from Zero

Documentation

OpenAI API Compatibility

Immediate Roadmap

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 57

Packages 0

Contributors 16

Languages

Packages