Sarathi-Serve is a high througput and low-latency LLM serving framework. Please refer to our OSDI'24 paper for more details.
Sarathi-Serve has been tested with CUDA 12.3 on H100 and A100 GPUs.
git clone git@github.com:microsoft/sarathi-serve.git
Setup mamba if you don't already have it,
wget https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh
bash Mambaforge-Linux-x86_64.sh # follow the instructions from there
Create a Python 3.10 environment,
mamba create -p ./env python=3.10
pip install -e . --extra-index-url https://flashinfer.ai/whl/cu121/torch2.3/
Refer to readmes in individual folders corresponding to each figure in osdi-experiments
.
If you use our work, please consider citing our paper:
@article{agrawal2024taming,
title={Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve},
author={Agrawal, Amey and Kedia, Nitin and Panwar, Ashish and Mohan, Jayashree and Kwatra, Nipun and Gulavani, Bhargav S and Tumanov, Alexey and Ramjee, Ramachandran},
journal={Proceedings of 18th USENIX Symposium on Operating Systems Design and Implementation, 2024, Santa Clara},
year={2024}
}
This repository originally started as a fork of the vLLM project. Sarathi-Serve is a research prototype and does not have complete feature parity with open-source vLLM. We have only retained the most critical features and adopted the codebase for faster research iterations.