GPT-2 inference engine written in Zig. Generation time: ~28ms per token.
- No third-party dependencies besides BLAS (Accelerate or OpenBLAS).
- No memory allocations at runtime.
- Can run NanoGPT.
Download the GPT-2 checkpoint from OpenAI.
python3 download_weights.py
Build the Zig binary and run it with a prompt to generate completions:
zig build -DOptimize=ReleaseFast
./zig-out/bin/zig_gpt2 "Marcus Aurelius said"
Generate test data by forwarding random tensors through PyTorch ops.
python3 generate_test_data.py
Run tests. Verifies Zig ops produce the same output as PyTorch.
zig build test
Implementation:
- ✅ Implement basic ops: Embedding, Linear, LayerNorm, GELU, Softmax, CausalSelfAttention.
- ✅ Implement transformer modules: MLP, Transformer block.
- ✅ Implement the full GPT model.
- ✅ Implement sampling from the model.
- ✅ Implement BPE encoding/decoding.
Efficiency:
- ✅ Replace custom linear algebra kernels with BLAS.
- ✅ Stream output as each new token is generated.
- ✅ Create central set of memory buffers and reuse them for each layer. No allocations at runtime.
- ✅ Add KV cache.
- Parallelize
softmax
andgelu
operations.