first official release
ScaleLLM is a high-performance inference system for large language models, designed for production environments. It supports most popular open-source models, including Llama2, Bloom, GPT-NeoX, and more.
- High Performance: ScaleLLM is optimized for high-performance LLM inference.
- Tensor Parallelism: Utilizes tensor parallelism for efficient model execution.
- OpenAI-compatible API Efficient golang rest api server that compatible with OpenAI.
- Huggingface models Integration Seamless integration with most popular HF models.
- Customizable: Offers flexibility for customization to meet your specific needs.
- Production Ready: Designed to be deployed in production environments.
Supported Models
Models | Tensor Parallel | Quantization | HF models examples |
---|---|---|---|
Llama2 | Yes | Yes | meta-llama/Llama-2-7b, TheBloke/Llama-2-13B-chat-GPTQ, TheBloke/Llama-2-70B-AWQ |
Aquila | Yes | Yes | BAAI/Aquila-7B, BAAI/AquilaChat-7B |
Bloom | Yes | Yes | bigscience/bloom |
GPT_j | Yes | Yes | EleutherAI/gpt-j-6b |
GPT_NeoX | Yes | -- | EleutherAI/gpt-neox-20b |
GPT2 | Yes | -- | gpt2 |
InternLM | Yes | Yes | internlm/internlm-7b |
Mistral | Yes | Yes | mistralai/Mistral-7B-v0.1 |
MPT | Yes | Yes | mosaicml/mpt-30b |