Skip to content

2.2.6 Backend: mistral.rs

av edited this page Sep 14, 2024 · 1 revision

Handle: mistralrs URL: http://localhost:33951

Blazingly fast LLM inference.

| Rust Documentation | Python Documentation | Discord | Matrix |

Mistral.rs is a fast LLM inference platform supporting inference on a variety of devices, quantization, and easy-to-use application with an Open-AI API compatible HTTP server and Python bindings.

Starting

# [Optional] Pull the mistralrs images
harbor pull mistralrs

# Start the service
harbor up mistralrs

# Verify service health and logs
harbor mistralrs health
harbor logs mistralrs -n 200

Models

# Open HF Search to find compatible models
harbor hf find gemma2
Plain models
# For "plain" models:
# Download the model to the global HF cache
harbor hf download IlyaGusev/gemma-2-2b-it-abliterated
# Set model/type/arch
harbor mistralrs model IlyaGusev/gemma-2-2b-it-abliterated
harbor mistralrs type plain
harbor mistralrs arch gemma2
# Gemma 2 doesnt't support paged attention
harbor mistralrs args --no-paged-attn
# Launch, mistralrs
# Running model will be available in the webui
harbor up mistralrs
ISQ

mistral.rs supports an interesting technique of in situ quantization

# Gemma2 from the previous example
# IlyaGusev/gemma-2-2b-it-abliterated
# nvidia-smi > 5584MiB

# Enable ISQ
harbor mistralrs isq Q2K
# Restart the service
harbor restart mistralrs
# nvidia-smi > 2094MiB

# Disable ISQ if not needed
harbor mistralrs isq ""

The difference will increase for models with larger Linear layers. Note that ISQ will affect the performance of the model.

GGUF models

Harbor mounts global llama.cpp cache to the mistralrs service as a gguf folder. You can download models in the same way as for llama.cpp.

# Set the model type to GGUF
harbor mistralrs type gguf

# - Unset ISQ off, as it's not supported
# for GGUF models
# - For GGUFs, architecture is inferred from the file
harbor mistralrs isq ""
harbor mistralrs arch ""

# Example 1: llama.cpp cache

# [Optional] See which GGUFs were already downloaded for the llama.cpp
# `config get llamacpp.cache` is also a folder Harbor will mount for Mistral.rs
ls $(eval echo "$(harbor config get llamacpp.cache)")

# Use "folder" specifier to point to the model
# "gguf"          - mounted llama.cpp cache
# "-f Model.gguf" - the model file
harbor mistralrs model "gguf -f Phi-3-mini-4k-instruct-q4.gguf"

# Example 2: HF cache

# [Optional] Grab the folder where the model is located
harbor hf scan-cache

# Use "folder" specifier to point to the model
# "hf/full/path"  - mounted HF cache. Note that you need
#                   a full path to the folder with .gguf
# "-f Model.gguf" - the model file
harbor mistralrs model "hf/hub/models--microsoft--Phi-3-mini-4k-instruct-gguf/snapshots/999f761fe19e26cf1a339a5ec5f9f201301cbb83/ -f Phi-3-mini-4k-instruct-q4.gguf"

# When configured, launch
harbor up mistralrs

Configuration

Specify extra args via the Harbor CLI:

# See available options
harbor run mistralrs --help

# Get/Set the extra arguments
harbor mistralrs args --no-paged-attn
Clone this wiki locally