GGUF parser in Python with NumPy-vectorized dequantization of GGML tensors.
- This code has only been tested for the TinyLlama model. It might (or might not) work for other models. If any issues arise, a probable source might be the weird transposition of the key and query weights.
- If you want maximum performance, you should probably use a C implementation instead.
Install NumPy:
pip install numpy
Download the Q4_K_M
model file from https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/tree/main
mkdir -p 'data/TinyLlama-1.1B-Chat-v1.0-GGUF'
wget 'https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf?download=true' -O 'data/TinyLlama-1.1B-Chat-v1.0-GGUF/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf'
Install pygguf:
git clone https://github.com/99991/pygguf.git
cd pygguf
pip install -e .
import gguf
filename = "data/TinyLlama-1.1B-Chat-v1.0-GGUF/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf"
with open(filename, "rb") as f:
# Load metadata
info, tensorinfo = gguf.load_gguf(f)
# Print metadata
for key, value in info.items():
print(f"{key:30} {repr(value)[:100]}")
# Load tensors
for name in tensorinfo:
weights = gguf.load_gguf_tensor(f, tensorinfo, name)
print(name, type(weights), weights.shape)
For testing, follow these steps:
- Install required libraries (only required for testing)
pip install tqdm requests safetensors
- Run
python test.py
- This will download the TinyLlama model (safentesors, GGUF) from