This repository showcases examples of utilizing libraries like llamafile for efficient deployment of large language models (LLMs) on consumer-grade CPU hardware, emphasizing high-throughput and memory-efficient inference.
Introduction • Demo. Notebooks • References • Issues • TODOs
Open source large language models (LLMs) are being democratized in a variety of applications, but most of these LLMs still face fundamental issues by demanding large memory and computational power (e.g., GPUs). To address these fundamental challenges, an increase in libraries and frameworks for LLM inference and serving are being developed.
This repository is focused on demonstrating some of these packages which provide the following benefits of low-latency, high-throughput, and cost-effectiveness. Several of the notebooks in this repo. will demonstrate how to:
- Execute LLMs on CPUs instead of GPU hardware.
- Execute quantized
Llama-2
models that are ~4GB in size. - Obtain hidden dimension embeddings from GGUF models.
llamafile lets you distribute LLMs with a single binary file. llamafile is the fastest executable file format ever and it lets you turn LLM weights into runnable llama.cpp binaries using cosmo libc. It executes on six different OSes and can run on CPU or GPUs. The following notebooks show examples of how to call and execute LLMs using the llamafile library. The files in folder llamafile-assets
were downloaded from here.
8-Core CPU Executing llamafile Command-line Binary with Mistral-7B
1.) llamafile command-line binary: this notebook demonstrates how to execute jartine/mistral-7b.llamafile from the command line and then save the model's output to a text file.
2.) llamafile with external model weights: this notebook demonstrates how to execute a LLM downloaded in the .GGUF file format using llamafile-main
and then save the model's output to a text file.
LLaMa.cpp (or LLaMa C++) provides a lighter, more portable alternative to heavyweight frameworks. LLaMa.cpp was developed by Georgi Gerganov. It implements the Meta’s LLaMa architecture in efficient C/C++, and it is one of the most dynamic open-source communities around the LLM inference [Source].
1.) llama.cpp embeddings: this notebook demonstrates how to get hidden dimension embeddings from a single pass through a GGUF model. Once the embeddings are available they can be used for several ML/AI techniques such as classification, text-similarity, clustering, etc.
- File Formats:
- GGML Georgi Gerganov's Machine Learning
- GGUF (GPT-Generated Unified Format)
- How do I create a GGUF model file?
- Download GGUF files from Hugging Face - The Bloke; also see TheBloke (Tom Jobbins).
- Libraries and Frameworks:
- Articles and Blogs:
This repository is will do its best to be maintained. If you face any issue or want to make improvements please raise an issue or submit a Pull Request. 😃
- Feel free to raise an Issue for a feature you would like to see added.