LLaMA with Wrapyfi

Wrapyfi enables distributing LLaMA (inference only) on multiple GPUs/machines, each with less than 16GB VRAM

Setup

Option 1: Install within conda or python environment using pip

Download LLaMA weights using the official form below and install this wrapyfi-examples_llama inside conda or virtual env:

git clone https://github.com/modular-ml/wrapyfi-examples_llama.git
cd wrapyfi-examples_llama
pip install -r requirements.txt
pip install -e .

Install Wrapyfi within the same environment:

git clone https://github.com/fabawi/wrapyfi.git
cd wrapyfi
pip install .[pyzmq]

Option 2: Install using Docker (Nvidia-docker 2)

Install a linux image of PyTorch with all LLaMA and Wrapyfi dependencies using the Dockerfile. From within this repository, run:

docker build --force-rm -t wrapyfi_llama .

To test it, run the command below. This opens up a terminal with all Python dependencies installed:

docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -it --rm wrapyfi_llama

Benchmarks

LLaMA 7B: input_sequence=1024, output_sequence=256, batch_size=32

LLaMA 7B: 2 GPUs/ Same Machine -> 2 TITAN Xp 12 GB

GPU ID \| Type	CPU Mem. \| Power	GPU Mem.	WPS
0 \| TITAN Xp 12GB	2.3 GB \| 114 W / 250 W	8.8 GB	14
1 \| TITAN Xp 12GB	1.7 GB \| 103 W / 250 W	9.1 GB	14

LLaMA 7B: 4 GPUs/ Same Machine -> 2 TITAN Xp 12 GB, 2 TITAN X 12 GB

GPU ID \| Type	CPU Mem. \| Power	GPU Mem.	WPS
0 \| TITAN Xp 12GB	2.4 GB \| 79 W / 250 W	5.6 GB	12
1 \| TITAN Xp 12GB	1.3 GB \| 63 W / 250 W	5.6 GB	12
2 \| TITAN X 12GB	1.3 GB \| 89 W / 250 W	5.5 GB	12
3 \| TITAN X 12GB	1.3 GB \| 99 W / 250 W	6.2 GB	12

UPDATE 1: More benchmarks to follow on 1080 Ti, 3080 Ti with 8 (4x4) remote

UPDATE 2: Much faster than CPU offloading approaches, and uses about 9 GB VRAM on each card with batch size: 32

Example on running 2 GPUs with 7B:

Replace all occurances of <YOUR_IP> and <YOUR_CHECKPOINT_DIRECTORY> before running the scripts
Start the Wrapyfi ZeroMQ broker from within the Wrapyfi repo:

cd wrapyfi/standalone 
python zeromq_proxy_broker.py --comm_type pubsubpoll

Start the first instance of the Wrapyfi-wrapped LLaMA from within this repo and env (order is important, dont start wrapyfi_device_idx=0 before wrapyfi_device_idx=1):

CUDA_VISIBLE_DEVICES="0" OMP_NUM_THREADS=1 torchrun --nproc_per_node 1 example.py --ckpt_dir <YOUR_CHECKPOINT_DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR_CHECKPOINT_DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 1 --wrapyfi_total_devices 2

Now start the second instance (within this repo and env) :

CUDA_VISIBLE_DEVICES="1" OMP_NUM_THREADS=1 torchrun --master_port=29503 --nproc_per_node 1 example.py --ckpt_dir <YOUR_CHECKPOINT_DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR_CHECKPOINT_DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 0 --wrapyfi_total_devices 2

Finally, you will see the output on both terminals

Running 7B on two different machines

To run on different machines, the broker must be running on a specific IP in step 4. Start the ZeroMQ broker by setting the IP and provide the env variables for steps 5+6 e.g.,

### (replace 10.0.0.101 with <YOUR_IP> ###

# step 1 modification 
python zeromq_proxy_broker.py --socket_ip 10.0.0.101 --comm_type pubsubpoll

# step 2 modification
CUDA_VISIBLE_DEVICES="0" OMP_NUM_THREADS=1 WRAPYFI_ZEROMQ_SOCKET_IP='10.0.0.101' torchrun --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR_CHECKPOINT_DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 1 --wrapyfi_total_devices 2

# step 3 modification
CUDA_VISIBLE_DEVICES="1" OMP_NUM_THREADS=1 WRAPYFI_ZEROMQ_SOCKET_IP='10.0.0.101' torchrun --master_port=29503 --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR_CHECKPOINT_DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 0 --wrapyfi_total_devices 2

Running 7B on 4 machines

To run the model on more machines, make sure that the number of layers (32 layers for 7B, 40 for 13 B, etc.) is divisible by wrapyfi_total_devices. To run on 4 machines. Make sure CUDA_VISIBLE_DEVICES is set to the correct GPU for each. Execute these commands in order:

### (replace 10.0.0.101 with <YOUR_IP> ###

# step 1 modification 
python zeromq_proxy_broker.py --socket_ip 10.0.0.101 --comm_type pubsubpoll

# step 2 modification
CUDA_VISIBLE_DEVICES="0" OMP_NUM_THREADS=1 WRAPYFI_ZEROMQ_SOCKET_IP='10.0.0.101' torchrun --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR_CHECKPOINT_DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 3 --wrapyfi_total_devices 4

# step 3 modification
CUDA_VISIBLE_DEVICES="1" OMP_NUM_THREADS=1 WRAPYFI_ZEROMQ_SOCKET_IP='10.0.0.101' torchrun --master_port=29503 --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR_CHECKPOINT_DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 2 --wrapyfi_total_devices 4

# add step 4 (on machine or device 3)
CUDA_VISIBLE_DEVICES="2" OMP_NUM_THREADS=1 WRAPYFI_ZEROMQ_SOCKET_IP='10.0.0.101' torchrun --master_port=29504 --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR_CHECKPOINT_DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 1 --wrapyfi_total_devices 4

# add step 5 (on machine or device 4)
CUDA_VISIBLE_DEVICES="3" OMP_NUM_THREADS=1 WRAPYFI_ZEROMQ_SOCKET_IP='10.0.0.101' torchrun --master_port=29505 --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR_CHECKPOINT_DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 0 --wrapyfi_total_devices 4

Running 13B and Larger

Running larger variants of LLaMA requires a few extra modifications. First off, LLaMA has all model checkpoints resharded, spliting the keys, values and querries into predefined chunks (MP = 2 for the case of 13B, meaning it expects consolidated.00.pth and consolidated.01.pth). The current solution is to reshard the files into a single checkpoint. This solution is not ideal, since you'd need a lot of RAM (about 61 GB) on loading the weights initially. I am looking into optimal ways for splitting the checkpoint such that the transformer blocks are split instead of parallel parts of the model.

For now, shard into one file (increase your swap memory if you don't have a large RAM). Note that this does not affect the GPU VRAM. I have set the model to only load the weights which are required for the specific --wrapyfi_device_idx. For example, if you have --wrapyfi_total_devices 4, then each instance would load 1/4th of the model approximately.

To reshard your model, download the reshard.py script into your LLaMA checkpoints directory (the parent directory containing 7B,13B,30B, and 65B). Now run the following command:

mkdir 13B_resharded
reshard.py 1 ./13B/ ./13B_resharded/

This will take a while, but once done, use the new 13B_resharded directory when setting --ckpt_dir. Follow the steps for the 7B examples before.

For larger models, it should theoretically be the same, I just haven't tried them out yet.

LLaMA

This repository is intended as a minimal, hackable and readable example to load LLaMA (arXiv) models and run inference. In order to download the checkpoints and tokenizer, fill this google form

Setup

In a conda env with pytorch / cuda available, run

pip install -r requirements.txt

Then in this repository

pip install -e .

Download

Once your request is approved, you will receive links to download the tokenizer and model files. Edit the download.sh script with the signed url provided in the email to download the model weights and tokenizer.

Inference

The provided example.py can be run on a single or multi-gpu node with torchrun and will output completions for two pre-defined prompts. Using TARGET_FOLDER as defined in download.sh:

torchrun --nproc_per_node MP example.py --ckpt_dir $TARGET_FOLDER/model_size --tokenizer_path $TARGET_FOLDER/tokenizer.model

Different models require different MP values:

Model	MP
7B	1
13B	2
33B	4
65B	8

FAQ

1. The download.sh script doesn't work on default bash in MacOS X
2. Generations are bad!
3. CUDA Out of memory errors
4. Other languages

Model Card

See MODEL_CARD.md

License

See the LICENSE file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

LLaMA with Wrapyfi

Setup

Option 1: Install within conda or python environment using pip

Option 2: Install using Docker (Nvidia-docker 2)

Benchmarks

Example on running 2 GPUs with 7B:

Running 7B on two different machines

Running 7B on 4 machines

Running 13B and Larger

LLaMA

Setup

Download

Inference

FAQ

Model Card

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

LLaMA with Wrapyfi

Setup

Option 1: Install within conda or python environment using pip

Option 2: Install using Docker (Nvidia-docker 2)

Benchmarks

Example on running 2 GPUs with 7B:

Running 7B on two different machines

Running 7B on 4 machines

Running 13B and Larger

LLaMA

Setup

Download

Inference

FAQ

Model Card

License