A PowerShell automation to rebuild llama.cpp for a Windows environment. It automates the following steps:
- Fetching and extracting a specific release of OpenBLAS
- Fetching the latest version of llama.cpp
- Fixing OpenBLAS binding in the
CMakeLists.txt
- Rebuilding the binaries with CMake
- Updating the Python dependencies
- Automatically detects the best BLAS acceleration
This script currently supports OpenBLAS
for CPU BLAS acceleration and CUDA
for NVIDIA GPU BLAS acceleration.
Download and install the latest versions:
Tip
When installing Visual Studio 2022 it is sufficent to just install the Build Tools for Visual Studio 2022
package. Also make sure that Desktop development with C++
is enabled in the installer.
Execute the following in a PowerShell terminal with Administrator privileges to enable the Hardware Accelerated GPU Scheduling feature:
New-ItemProperty `
-Path "HKLM:\SYSTEM\CurrentControlSet\Control\GraphicsDrivers" `
-Name "HwSchMode" `
-Value "2" `
-PropertyType DWORD `
-Force
Then restart your computer to activate the feature.
Clone the repository to a nice place on your machine via:
git clone --recurse-submodules git@github.com:countzero/windows_llama.cpp.git
Create a new Conda environment for this project with a specific version of Python:
conda create --name llama.cpp python=3.12
To make Conda available in you current shell execute the following:
conda init
Tip
You can always revert this via conda init --reverse
.
To build llama.cpp binaries for a Windows environment with the best available BLAS acceleration execute the script:
./rebuild_llama.cpp.ps1
Tip
If PowerShell is not configured to execute files allow it by executing the following in an elevated PowerShell: Set-ExecutionPolicy RemoteSigned
Download a large language model (LLM) with weights in the GGUF format into the ./vendor/llama.cpp/models
directory. You can for example download the gemma-2-9b-it model in a quantized GGUF format:
Tip
See the 🤗 Open LLM Leaderboard and LMSYS Chatbot Arena Leaderboard for best in class open source LLMs.
You can easily chat with a specific model by using the .\examples\server.ps1 script:
.\examples\server.ps1 -model ".\vendor\llama.cpp\models\gemma-2-9b-it-IQ4_XS.gguf"
Note
The script will automatically start the llama.cpp server with an optimal configuration for your machine.
Execute the following to get detailed help on further options of the server script:
Get-Help -Detailed .\examples\server.ps1
You can now chat with the model:
./vendor/llama.cpp/build/bin/Release/llama-cli `
--model "./vendor/llama.cpp/models/gemma-2-9b-it-IQ4_XS.gguf" `
--ctx-size 8192 `
--threads 16 `
--n-gpu-layers 33 `
--reverse-prompt '[[USER_NAME]]:' `
--prompt-cache "./cache/gemma-2-9b-it-IQ4_XS.gguf.prompt" `
--file "./vendor/llama.cpp/prompts/chat-with-vicuna-v1.txt" `
--color `
--interactive
You can start llama.cpp as a webserver:
./vendor/llama.cpp/build/bin/Release/llama-server `
--model "./vendor/llama.cpp/models/gemma-2-9b-it-IQ4_XS.gguf" `
--ctx-size 8192 `
--threads 16 `
--n-gpu-layers 33
And then access llama.cpp via the webinterface at:
You can increase the context size of a model with a minimal quality loss by setting the RoPE parameters. The formula for the parameters is as follows:
context_scale = increased_context_size / original_context_size
rope_frequency_scale = 1 / context_scale
rope_frequency_base = 10000 * context_scale
Note
To increase the context size of an openchat-3.6-8b-20240522 model from its original context size of 8192
to 32768
means, that the context_scale
is 4.0
. The rope_frequency_scale
will then be 0.25
and the rope_frequency_base
equals 40000
.
To extend the context to 32k execute the following:
./vendor/llama.cpp/build/bin/Release/llama-cli `
--model "./vendor/llama.cpp/models/gemma-2-9b-it-IQ4_XS.gguf" `
--ctx-size 32768 `
--rope-freq-scale 0.25 `
--rope-freq-base 40000 `
--threads 16 `
--n-gpu-layers 33 `
--reverse-prompt '[[USER_NAME]]:' `
--prompt-cache "./cache/gemma-2-9b-it-IQ4_XS.gguf.prompt" `
--file "./vendor/llama.cpp/prompts/chat-with-vicuna-v1.txt" `
--color `
--interactive
You can enforce a specific grammar for the response generation. The following will always return a JSON response:
./vendor/llama.cpp/build/bin/Release/llama-cli `
--model "./vendor/llama.cpp/models/gemma-2-9b-it-IQ4_XS.gguf" `
--ctx-size 8192 `
--threads 16 `
--n-gpu-layers 33 `
--prompt-cache "./cache/gemma-2-9b-it-IQ4_XS.gguf.prompt" `
--prompt "The scientific classification (Taxonomy) of a Llama: " `
--grammar-file "./vendor/llama.cpp/grammars/json.gbnf"
--color
Execute the following to measure the perplexity of the GGML formatted model:
./vendor/llama.cpp/build/bin/Release/llama-perplexity `
--model "./vendor/llama.cpp/models/gemma-2-9b-it-IQ4_XS.gguf" `
--ctx-size 8192 `
--threads 16 `
--n-gpu-layers 33 `
--file "./vendor/wikitext-2-raw-v1/wikitext-2-raw/wiki.test.raw"
You can easily count the tokens of a prompt for a specific model by using the .\examples\count_tokens.ps1 script:
.\examples\count_tokens.ps1 `
-model ".\vendor\llama.cpp\models\gemma-2-9b-it-IQ4_XS.gguf" `
-file ".\prompts\chat_with_llm.txt"
To inspect the actual tokenization result you can use the -debug
flag:
.\examples\count_tokens.ps1 `
-model ".\vendor\llama.cpp\models\gemma-2-9b-it-IQ4_XS.gguf" `
-prompt "Hello Word!" `
-debug
Note
The script is a simple wrapper for the tokenize.cpp example of the llama.cpp project.
Execute the following to get detailed help on further options of the server script:
Get-Help -Detailed .\examples\count_tokens.ps1
Every time there is a new release of llama.cpp you can simply execute the script to automatically rebuild everything:
Command | Description |
---|---|
./rebuild_llama.cpp.ps1 |
Automatically detects best BLAS acceleration |
./rebuild_llama.cpp.ps1 -blasAccelerator "OFF" |
Without any BLAS acceleration |
./rebuild_llama.cpp.ps1 -blasAccelerator "OpenBLAS" |
With CPU BLAS acceleration |
./rebuild_llama.cpp.ps1 -blasAccelerator "CUDA" |
With NVIDIA GPU BLAS acceleration |
You can build a specific version of llama.cpp by specifying a git tag or commit:
Command | Description |
---|---|
./rebuild_llama.cpp.ps1 |
The latest release |
./rebuild_llama.cpp.ps1 -version "b1138" |
The tag b1138 |
./rebuild_llama.cpp.ps1 -version "1d16309" |
The commit 1d16309 |