Skip to content

first official release

Compare
Choose a tag to compare
@guocuimi guocuimi released this 06 Nov 23:22
· 361 commits to main since this release

ScaleLLM is a high-performance inference system for large language models, designed for production environments. It supports most popular open-source models, including Llama2, Bloom, GPT-NeoX, and more.

Key Features

  • High Performance: ScaleLLM is optimized for high-performance LLM inference.
  • Tensor Parallelism: Utilizes tensor parallelism for efficient model execution.
  • OpenAI-compatible API Efficient golang rest api server that compatible with OpenAI.
  • Huggingface models Integration Seamless integration with most popular HF models.
  • Customizable: Offers flexibility for customization to meet your specific needs.
  • Production Ready: Designed to be deployed in production environments.

Supported Models

Models Tensor Parallel Quantization HF models examples
Llama2 Yes Yes meta-llama/Llama-2-7b, TheBloke/Llama-2-13B-chat-GPTQ, TheBloke/Llama-2-70B-AWQ
Aquila Yes Yes BAAI/Aquila-7B, BAAI/AquilaChat-7B
Bloom Yes Yes bigscience/bloom
GPT_j Yes Yes EleutherAI/gpt-j-6b
GPT_NeoX Yes -- EleutherAI/gpt-neox-20b
GPT2 Yes -- gpt2
InternLM Yes Yes internlm/internlm-7b
Mistral Yes Yes mistralai/Mistral-7B-v0.1
MPT Yes Yes mosaicml/mpt-30b