Skip to content

4. Compatibility

av edited this page Sep 14, 2024 · 1 revision

Known compatibility issues between the services and models.


The format of the page is as follows:

## Service | Model

Short description of nature/cause of the issue

### Affected Service, Model (or combination)

Description of the issue affects the service or combination of services, possible workaround.

Gemma 2 - System Prompt

Gemma 2 models lack a system prompt.

vllm x searxng x webui

When WebRAG is enabled, Open WebUI will send requests to the respective backend that will use the system role (for RAG). Such requests will fail with Gemma 2 (or other models without system prompt support) when running against VLLM.


cmdh needs a system prompt to outline the task.


HuggingFace ChatUI uses system prompt for chat title generation.


Unfortunately - switch to another model. Alternatively, disable WebRAG.

LiteLLM - Dynamic Dependencies

When LiteLLM starts, it tries to install Node.js and some other packages right at the runtime. This may cause issues when running in a restricted environment or when related CDNs are unavailable.


harbor logs litellm is stuck at:

# First candidate
litellm  | Installing Prisma CLI

# Another candidate
litellm  |  * Install prebuilt node (22.5.1) ..... done.

litellm x webui

When LiteLLM is stuck at the setup phase, WebUI won't load any of the proxied models.


Restart the service a few times until it starts successfully.

WebUI - Missing models in the list

WebUI v0.3.11 fails to load models from OpenAI-compatible endpoints when the API key is specified as an empty string (or missing).

Example configuration that'll not work:

  "openai": {
		"api_base_urls": [
		"api_keys": [
		"enabled": true


Setup the endpoint to use an actual API key or a fake API key if supported by the service.

  "openai": {
		"api_base_urls": [
		"api_keys": [
		"enabled": true

WebUI - Same audio after audio config change

webui has a built-in cache for tts. It's sometimes used at a sentence level, so after changing the audio model, generating audio for a previously seen sentence will result in the same audio.


Use a new sentence for testing audio config changes. For example by re-generating the model response.

Exllama2 - GPTQ

Exllama2 (and related engines - Aphrodite engine, TabbyAPI, etc) only support GPTQ in 4-bits. You can detect this problem when running, for example, a 2-bit GPTQ model and seeing logs like these:

RuntimeError: q_weight and gptq_qzeros have incompatible shapes


Switch to a 4-bit GPTQ model if possible. Otherwise, switch to another inference backend.

vLLM - Out of workspace memory in AlignedAlloactor

This is an error between vLLM and FlashInfer attention backend.


Ensure you're running the latest vLLM version.

harbor pull vllm

TTS - xtts-v2


openedai-speech only starts downloading the xtts-v2 model when the first generation request is made, not on startup. There are no logs or indication on download progress.


Here're sample logs/steps when the xtts-v2 is not downloaded yet:

# Initial startup, xtts-v2 isn't downloaded yet
harbor.tts  | First startup may download 2GB of speech models. Please wait.
harbor.tts  | INFO:     Started server process [27]
harbor.tts  | INFO:     Waiting for application startup.
harbor.tts  | INFO:     Application startup complete.
harbor.tts  | INFO:     Uvicorn running on (Press CTRL+C to quit)
harbor.tts  | INFO: - "POST /v1/audio/speech HTTP/1.1" 200 OK

# 1. Configure Open WebUI to use tts-1-hd
# 2. Generate speech from some uncached text
# 3. openedai-speech will log this:
harbor.tts  | 2024-08-26 08:51:44.737 | INFO     | __main__:__init__:59 - Loading model xtts to cuda
# ... Takes some time to download the model

# Check the folder size to see download progress
# voices/tts/tts_models--multilingual--multi-dataset--xtts
du -h $(harbor home)/tts

# Sample output when download is complete
user@os:~/code/harbor$ ▼ du -h $(harbor home)/tts
12K	/home/user/code/harbor/tts/config
1.8G	/home/user/code/harbor/tts/voices/tts/tts_models--multilingual--multi-dataset--xtts
1.8G	/home/user/code/harbor/tts/voices/tts
1.9G	/home/user/code/harbor/tts/voices
1.9G	/home/user/code/harbor/tts

Ollama - truncated input

When using OpenAI-compatible endpoints - there's no way to specify num_ctx (context size) for the model. This parameter affects how the model is loaded into memory, so must be known/set ahead of inference and Ollama can't change it on a per-request basis or dynamically.


Create a new Modelfile from the base model specifying desired num_ctx parameter by default. Here's an example:

# 1. Export Modelfile for the LLM:
harbor ollama show --modelfile model > Modelfile

# 2. Edit the Modelfile to include the desired num_ctx:
# FROM model
# PARAMETER num_ctx 128000
code ./Modelfile

# 3. Put modelfile into a folder that is shared with ollama service:
cp Modelfile $(harbor home)/ollama/modelfiles/Modelfile

# 4. Import the Modelfile back:
# "/modelfiles" is where the shared folder from above is mounted
harbor ollama create -f /modelfiles/Modelfile model-128k

# 4. Verify the import
harbor ollama show model-128k
Clone this wiki locally