GitHub - snunez1/llama.cl: Inference Llama in Common Lisp

llama.cl

This is a Common Lisp port of Karpathy's llama2.c to idiomatic Common Lisp.

Why? Two reasons:

Because Common Lisp is a fantastic language for experimentation, and this makes it easy to explore LLM techniques
To serve as a reference implementation for the Common Lisp community

How to run from emacs/slime/sly

Prerequisites

We assume you have a working emacs, lisp and slime/sly setup. Most of the systems llama requires are in quicklisp, however Quicklisp isn't in the greatest of health, and the systems haven't been updated since June 2023. Therefore you'll need to get at least binary-types from the repository, and LLA if you want to use BLAS/LAPACK libraries for matrix multiplication. Put them in a location accessible to Quicklisp, like ~/common-lisp.

Get the models from Karpathy's repo (original instructions) pretrained on TinyStories the dataset.

wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories42M.bin
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.bin

Load the file run.lisp into an emacs buffer
Load slime with M-x slime
Load LLA with (ql:quickload :lla) (optional - requires setup)
Load LLAMA with (ql:quickload :llama) from the REPL
Move into the package (in-package :llama)
Initalise the system with (init #P"stories15M.bin" #P"tokenizer.bin" 32000) (adjust paths if neccessary)
Generate a story with: (generate *model* *tokenizer*)

You can experiment with temperature, prompts and various samplers. See code for all the options.

Performance

My machine is running a 3.5 GHz 6-core Intel i7 5930, 256K/15MB cache with 64GB DDR4 RAM and with the stories15M I get about 2.5 tok/sec with CCL and 3.7 tok/s with SBCL.

If you want to use BLAS for matrix multiplication, you'll get about a 10X speed up. Make sure that LLA is loaded before you load LLAMA, if you do so it will automatically use the BLAS library.

Using LLA, the numbers are 14.4 tok/sec for CCL and 34.4 tok/sec for SBCL.

Usage notes

dynamic variable binding lisp heap size etc

Original README.md

For instructions on conversions to/from .bin format, training and other background, see the original repo

Name		Name	Last commit message	Last commit date
Latest commit History 525 Commits
assets		assets
LICENSE		LICENSE
README.md		README.md
configurator.py		configurator.py
export.py		export.py
llama.asd		llama.asd
pkgdcl.lisp		pkgdcl.lisp
run.lisp		run.lisp
tinystories.py		tinystories.py
tokenizer.bin		tokenizer.bin
tokenizer.model		tokenizer.model
tokenizer.py		tokenizer.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llama.cl

How to run from emacs/slime/sly

Prerequisites

Performance

Usage notes

Original README.md

About

Languages

License

snunez1/llama.cl

Folders and files

Latest commit

History

Repository files navigation

llama.cl

How to run from emacs/slime/sly

Prerequisites

Performance

Usage notes

Original README.md

About

Resources

License

Stars

Watchers

Forks

Languages