Skip to content

4. Compatibility

av edited this page Sep 14, 2024 · 1 revision

Known compatibility issues between the services and models.

Format

The format of the page is as follows:

## Service | Model

Short description of nature/cause of the issue

### Affected Service, Model (or combination)

Description of the issue affects the service or combination of services, possible workaround.

Gemma 2 - System Prompt

Gemma 2 models lack a system prompt.

vllm x searxng x webui

When WebRAG is enabled, Open WebUI will send requests to the respective backend that will use the system role (for RAG). Such requests will fail with Gemma 2 (or other models without system prompt support) when running against VLLM.

cmdh

cmdh needs a system prompt to outline the task.

chatui

HuggingFace ChatUI uses system prompt for chat title generation.

Workaround

Unfortunately - switch to another model. Alternatively, disable WebRAG.

LiteLLM - Dynamic Dependencies

When LiteLLM starts, it tries to install Node.js and some other packages right at the runtime. This may cause issues when running in a restricted environment or when related CDNs are unavailable.

Diagnostic:

harbor logs litellm is stuck at:

# First candidate
litellm  | Installing Prisma CLI

# Another candidate
litellm  |  * Install prebuilt node (22.5.1) ..... done.

litellm x webui

When LiteLLM is stuck at the setup phase, WebUI won't load any of the proxied models.

Workaround

Restart the service a few times until it starts successfully.

WebUI - Missing models in the list

WebUI v0.3.11 fails to load models from OpenAI-compatible endpoints when the API key is specified as an empty string (or missing).

Example configuration that'll not work:

{
  "openai": {
		"api_base_urls": [
			"http://mistralrs:8021/v1"
		],
		"api_keys": [
			""
		],
		"enabled": true
	}
}

Workaround

Setup the endpoint to use an actual API key or a fake API key if supported by the service.

{
  "openai": {
		"api_base_urls": [
			"http://mistralrs:8021/v1"
		],
		"api_keys": [
			"sk-mistralrs"
		],
		"enabled": true
	}
}

WebUI - Same audio after audio config change

webui has a built-in cache for tts. It's sometimes used at a sentence level, so after changing the audio model, generating audio for a previously seen sentence will result in the same audio.

Workaround

Use a new sentence for testing audio config changes. For example by re-generating the model response.

Exllama2 - GPTQ

Exllama2 (and related engines - Aphrodite engine, TabbyAPI, etc) only support GPTQ in 4-bits. You can detect this problem when running, for example, a 2-bit GPTQ model and seeing logs like these:

RuntimeError: q_weight and gptq_qzeros have incompatible shapes

Workaround

Switch to a 4-bit GPTQ model if possible. Otherwise, switch to another inference backend.

vLLM - Out of workspace memory in AlignedAlloactor

This is an error between vLLM and FlashInfer attention backend.

Workaround

Ensure you're running the latest vLLM version.

harbor pull vllm

TTS - xtts-v2

webui

openedai-speech only starts downloading the xtts-v2 model when the first generation request is made, not on startup. There are no logs or indication on download progress.

Workaround

Here're sample logs/steps when the xtts-v2 is not downloaded yet:

# Initial startup, xtts-v2 isn't downloaded yet
harbor.tts  | First startup may download 2GB of speech models. Please wait.
harbor.tts  | INFO:     Started server process [27]
harbor.tts  | INFO:     Waiting for application startup.
harbor.tts  | INFO:     Application startup complete.
harbor.tts  | INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
harbor.tts  | INFO:     192.168.64.3:40698 - "POST /v1/audio/speech HTTP/1.1" 200 OK

# 1. Configure Open WebUI to use tts-1-hd
# 2. Generate speech from some uncached text
# 3. openedai-speech will log this:
harbor.tts  | 2024-08-26 08:51:44.737 | INFO     | __main__:__init__:59 - Loading model xtts to cuda
# ... Takes some time to download the model

# Check the folder size to see download progress
# voices/tts/tts_models--multilingual--multi-dataset--xtts
du -h $(harbor home)/tts

# Sample output when download is complete
user@os:~/code/harbor$ ▼ du -h $(harbor home)/tts
12K	/home/user/code/harbor/tts/config
1.8G	/home/user/code/harbor/tts/voices/tts/tts_models--multilingual--multi-dataset--xtts
1.8G	/home/user/code/harbor/tts/voices/tts
1.9G	/home/user/code/harbor/tts/voices
1.9G	/home/user/code/harbor/tts

Ollama - truncated input

When using OpenAI-compatible endpoints - there's no way to specify num_ctx (context size) for the model. This parameter affects how the model is loaded into memory, so must be known/set ahead of inference and Ollama can't change it on a per-request basis or dynamically.

Workaround

Create a new Modelfile from the base model specifying desired num_ctx parameter by default. Here's an example:

# 1. Export Modelfile for the LLM:
harbor ollama show --modelfile model > Modelfile

# 2. Edit the Modelfile to include the desired num_ctx:
# FROM model
# PARAMETER num_ctx 128000
code ./Modelfile

# 3. Put modelfile into a folder that is shared with ollama service:
cp Modelfile $(harbor home)/ollama/modelfiles/Modelfile

# 4. Import the Modelfile back:
# "/modelfiles" is where the shared folder from above is mounted
harbor ollama create -f /modelfiles/Modelfile model-128k

# 4. Verify the import
harbor ollama show model-128k
Clone this wiki locally