SwiftTransformer is a tiny yet powerful implementation of the inference infrastructure for transformer model families. It aims at providing an easy-to-use framework for researchers to try on their ideas and iterate quickly. Yet it also supports popular features like model/pipeline parallelism, FlashAttention, Continuous Batching, PagedAttention and should works as a great foundation for researchers to build their prototype. Currently, DistServe and FastServe use SwiftTransformer as the execution backend.
It has the following advantages:
- Tiny. It only contains essential code to run the LLM inference, thus you can get your hands on it and experiment your research ideas without much effort. In fact, this project is launched after the author tried to implement a research prototype on FasterTransformer.
- Efficient. It is written in C++ and adopts custom CUDA kernels from xformers for performance. It also supports features like model/pipeline parallelism, FlashAttention, Continuous Batching and PagedAttention.
- Easy-to-use. It provides Pytorch bindings for easy integration with Python, so you can easily build your own prototype in Python on top of it.
- Well-documented. It has detailed documentation for researchers to hack around easily.
NOTE: For users who want to run LLM inference off-the-shelf, please refer to other high-level LLM serving systems written in Python based on SwiftTransformer (like DistServe and FastServe). They all contain detailed documentation about environment setup.
If you want to build your own project on top of SwiftTransformer, please follow the following steps:
# setup and activate the conda environment
conda env create -f environment.yml && conda activate SwiftTransformer
# build SwiftTransformer
cmake -B build && cmake --build build -j$(nproc)
If everything works fine, you should see libst_pybinding.so
under the SwiftTransformer/build/lib
directory. You can load this dynamic library in your Python project.
We provide a simple example to run the OPT-1.3B model. Again, if you want to run LLM inference off-the-shelf, please see DistServe and FastServe.
-
Download the tokenizer.
mkdir models wget https://raw.githubusercontent.com/facebookresearch/metaseq/main/projects/OPT/assets/gpt2-merges.txt -O models/gpt2-merges.txt wget https://raw.githubusercontent.com/facebookresearch/metaseq/main/projects/OPT/assets/gpt2-vocab.json -O models/gpt2-vocab.json
-
Download the OPT-1.3B model weights.
wget https://dl.fbaipublicfiles.com/opt/v1_20230405/1.3b/reshard-model_part-0.pt -O models/opt-1.3b.pt
Note: Please do not choose OPT-350M since its architecture is different from others.
-
Convert the weight format. The weight file is stored in .pt format (generated by
torch.save()
), which cannot be loaded by LibTorch, so we need to convert it.Use
python3 scripts/converter.py --input <path/to/your/downloaded/model> --output <path/to/converted/weights> --dtype <datatype (fp16 or fp32)> --model <modelname (opt or llama2)>
python3 scripts/converter.py --input models/opt-1.3b.pt --output models/opt-1.3b-conv-fp16.pt --dtype fp16 --model opt
-
Prepare your input.. Use
python3 scripts/encode_input.py <path/to/vocab.json> <path/to/merges.txt>
to encode your input. This script accepts your requests from stdin (one per line) and outputs the encoded input to stdout.mkdir inputs printf "Life blooms like a flower. Far away or by the road. Waiting for the one, to\nA quick brown fox\nArtificial intelligence is\nTo be or not to be," > inputs/input1_plain.txt python3 scripts/encode_input.py models/gpt2-vocab.json models/gpt2-merges.txt < inputs/input1_plain.txt > inputs/input1_encoded.txt
-
Run the model.
build/bin/run_opt models/opt-1.3b-conv-fp16.pt 1.3b models/gpt2-vocab.json fp16 inputs/input1_encoded.txt
We provide various unit tests to test the correctness of components of the model. To run the test, please compile the project, and then execute bin/unittest_XXX
in the build
directory.
Currently, the code is organized as follows:
src
├── csrc
│ ├── kernel
│ ├── layer
│ ├── model
│ ├── pybinding.cc
│ └── util
├── examples
│ ├── benchmark_all_input_same.cc
│ ├── CMakeLists.txt
│ ├── lib
│ └── run_gpt.cc
└── unittest
├── kernel
├── layer
├── model
├── unittest_torch_utils.h
├── unittest_utils.h
└── util
The csrc
folder contains the core implementation of the model, including every kernel, layer and model.
The unittest
folder contains unit tests for the components in csrc
. The kernel
, layer
, model
, and util
folders under the unittest
folder contain the implementation of the corresponding components. For example, src/unittest/layer/attention.cc
contains the unit test for the Attention
layer, which is implemented in src/csrc/layer/attention.cc
.
Note for vscode users: If you encounter #include errors detected. Please update your includePath.
, you may need to update include path in .vscode/c_cpp_properties.json
.
- Well-documented. We strongly believe that a well-documented codebase boosts the efficiency of research. Therefore we try our best to document every function and class. Typically we explain the purpose and meanings of arguments of a function before its implementation in the
.cc
file. - POP-styled design. Different from FastTransformer which adopts an Object-oriented programming (OOP) design, we adopt a more Procedure-Oriented Programming (POP) style. We believe that POP is more suitable for research projects, since it is easier to extend and modify the code. Think why we need OOP, and you will find the answer is "to hide the details". However in research projects, we need to know, and alter the details. Therefore all kernels and layers are implemented in POP style.
- Extensive unit tests. Every kernel and layer is paired with a unit test. We believe that unit tests are essential for research projects, since they can help us to verify the correctness of our implementation. We use googletest as our unit test framework. With the help of
TYPED_TEST
from googletest, we can test our kernels and layers with different data types (e.g.float
andhalf
) without writing redundant code. - LibTorch for reference in unit tests. For the "reference" part in unittests, we use LibTorch to implement the same kernel or layer. This is because LibTorch is well-tested, and we can use it as a reference to verify the correctness of our implementation.
- Raw pointers instead of
at::Tensor
. We prefer the raw pointer in C overat::Tensor
(The tensor class provided by LibTorch, the C++ frontend of PyTorch), since we need fine-grained control over the memory layout.