Skip to content

chris-ch/llama2-py-inference

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llama2-py-inference

Pythonized version of llama2.c

A package entirely written in Python, with no recourse to any external dependency, has been developed by Aydyn Tairov.

The external packages this version uses is limited to numpy: it considerably improves performances compared to pure Python with no dependency at all, and also better represents algebraic operations, which is always nice for educational purposes.

The primary purpose of this library is essentially a scaffolding that will eventually help me to port llama2.c into Haskell. This has some impact on the coding style: explicitly calling functions, avoiding operators overloading and usage of typing as much as possible.

An additional word on coding style: as opposed to the vast majority of numpy users, I do not adhere to the import numpy as np tradition, because I do not like it. For aesthetics reasons first, and also because this is not the generally accepted way of handling external packages in Python.

numpy (and pandas) are both exceptions, only because they are remotely related to the BLAS library (in fact the purpose of numpy was to provide access to that library), which was originally written in Fortran, which was originally written on punchcards, where every character was thoroughly counted indeed. So, bottom-line: I am not going to shorten numpy into np only because someone wrote the BLAS library in Fortran in the 70s.

Poetry environment

The dependencies and environment are managed using Poetry.

Pre-requisites

You will need to install a few training sets, for example the mini stories from llama.c.

wget --directory-prefix=data https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin

Important Run first poetry install from the command line in order for the modules in the src directory to become accessible.

Running llama2

poetry run llama2 data/stories15M.bin 0.8 256 "In that small Swiss town"

For testing purposes, you can set the seed option to some value to always get the same output:

poetry run llama2 --seed=1 data/stories15M.bin 0.8 256 "In that small Swiss town"

Generated output for that particular seed:

<s>
In that small Swiss town, there was a little girl named Lily. She loved to sit in her chair and read her favorite novel. One day, Lily asked her mom if she could help her put some sour candies in her book. 
Her mom said, "Sure, Lily! Let's start by getting some candies." 
Lily went to the kitchen and grabbed a bag of sour candies. She put them in her book and poured the candies into her book. 
Lily's mom smiled and said, "Wow, Lily! You're such a good helper. Did you have fun with your book?" 
Lily replied, "Yes, Mommy! I folded my book and brought some candy to share with you." 
Her mom smiled and said, "That's very kind of you, Lily. You're such a good helper!" 
Lily felt proud of herself and continued to read her books, feeling happy and content.
<s>

Attention architecture

Attention architecture From Wikipedia:

The Q and K sub-networks of a single "attention head" calculate the soft weights, originating from the word "that". (Encoder-only QKV variant). The sentence is sent through 3 parallel streams (left), which emerge at the end as the context vector (right). The word embedding size is 300 and the neuron count is 100 in each sub-network of the attention head.

The capital letter X denotes a matrix sized 4 × 300, consisting of the embeddings of all four words. The small underlined letter x denotes the embedding vector (sized 300) of the word "that". The attention head includes three (vertically arranged in the illustration) sub-networks, each having 100 neurons with a weight matrix sized 300 × 100.

The asterisk within parenthesis "(*)" denotes the softmax( qKT / √100 ), i.e. not yet multiplied by the matrix V. Rescaling by √100 prevents a high variance in qKT that would allow a single word to excessively dominate the softmax resulting in attention to only one word, as a discrete hard max would do.

Notation: the commonly written row-wise softmax formula above assumes that vectors are rows, which contradicts the standard math notation of column vectors. More correctly, we should take the transpose of the context vector and use the column-wise softmax, resulting in the more correct form

Context = (XV_W)T × softmax( (K_W X^T) × (xQ_w)^T / √100 ).

About

Pythonized version of llama2.c

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages