llamafile v0.8
llamafile lets you distribute and run LLMs with a single file
llamafile is a local LLM inference tool introduced by Mozilla Ocho in Nov 2023, which offers superior performance and binary portability to the stock installs of six OSes without needing to be installed. llamafile goes 2x faster than llama.cpp and 25x faster than ollama for some use cases like CPU prompt evaluation. It has a fun web GUI chatbot, a turnkey OpenAI API compatible server, and a shell-scriptable CLI interface which together put you in control of artificial intelligence.
This release further improves performance and introduces support for new models.
- Support for LLaMA3 is now available
- Support for Grok has been introduced
- Support for Mixtral 8x22b has been introduced
- Support for Command-R models has been introduced
- MoE models (e.g. Mixtral, Grok) now go 2-5x faster on CPU 4db03a1
- F16 is now 20% faster on Raspberry Pi 5 (TinyLLaMA 1.1b prompt eval improved 62 -> 75 tok/sec)
- F16 is now 30% faster on Skylake (TinyLLaMA 1.1b prompt eval improved 171 -> 219 tok/sec)
- F16 is now 60% faster on Apple M2 (Mistral 7b prompt eval improved 79 -> 128 tok/sec)
- Add ability to override chat template in web gui when creating llamafiles da5cbe4
- Improve markdown and syntax highlighting in server (#88)
- CPU feature detection has been improved
Downloads
You can download prebuilt llamafiles from:
-
https://huggingface.co/jartine
llamafiles quantized and compiled by us -
https://huggingface.co/models?library=llamafile
llamafiles built by our user community
Errata
- The new web gui chat template override feature isn't working as intended. If you want to use LLaMA3 8B then you need to manually copy and paste the chat templates from our README into the llamafile web GUI.
- The
llamafile-quantize
program may fail with an assertion error when K-quantizing weights from an F32 converted file. You can work around this by asking llama.cpp'sconvert.py
script to output an FP16 GGUF file, and then runninglllamafile-quantize
on that instead.