Skip to content

friendliai/friendli-model-optimizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Friendli Model Optimizer (FMO) for supercharging generative AI serving 🚀

CI Status Python Version PyPi Package Version Documentation License

Overview

Friendli Model Optimizer (FMO) is a tool that provides model optimizations for efficient generative AI serving with Friendli Engine. The optimizations improve generative AI serving performance without compromising task accuracy.

FMO is designed to work with Hugging Face pretrained models, which can be loaded using 'PreTrainedModel.from_pretrained()'.

FMO offers a pedantic level setting, which controls the trade-off between accuracy and processing time. Higher pedantic levels provide more accurate model but can increase the time required to generate quantized models, and may sometimes slow down inference. Lower pedantic levels allow for faster quantization, though they may reduce model accuracy. Each quantization mode supports different ranges of pedantic levels.

Note

The list of Hugging Face model architectures that can be optimized with FMO is specified in Supported Features & Model Architecture.

Note

Currently, FMO supports Python3.8 to Python3.11.

Table of Contents

Quick Installation

pip install friendli-model-optimizer

Supported Features & Model Architecture

FMO currently supports the following PTQ (Post-Training Quantization) techniques:

FP8

FP8 is an 8-bit floating-point format that offers a higher dynamic range than INT8, making it better suited for quantizing both weights and activations. This leads to increased throughput and reduced latency while maintaining high output quality with minimal degradation.

FP8 support 0-2 pedantic level. Defaults to 1.

Important

FP8 is only supported by NVIDIA Ada, Hopper, and Blackwell GPU architectures.

Note

For now, we only support the E4M3 (4-bit exponent and 3-bit mantissa) encoding format.

Supported Model Architectures for FP8 Quantization

  • LlamaForCausalLM
  • MistralForcausalLM
  • CohereForCausalLM
  • Qwen2ForCausalLM
  • Gemma2ForCausalLM
  • Phi3ForCausalLM
  • MptForCausalLM
  • ArcticForCausalLM
  • MixtralForCausalLM

Note

Currently, Phi3ForCausalLM, MptForCausalLM, ArcticForCausalLM, and MixtralForCausalLM only support pendantic level 0 Please add --pedantic-level 0 in command line.

INT8

INT8 Quantization represents weights and activations using the INT8 format with acceptable accuracy drops. Friendli Engine enables dynamic activation scaling, where scales are computed on the fly during runtime. Thus, FMO only quantizes model weights, and Friendli Engine will load the quantized weights.

INT8 support 0-1 pedantic level. Defaults to 1.

Supported Model Architectures for INT8 Quantization

  • LlamaForCausalLM
  • MistralForcausalLM
  • CohereForCausalLM
  • Qwen2ForCausalLM
  • Gemma2ForCausalLM

User Guides

You can run the quantization processes with the command below:

fmo quantize \
--model-name-or-path $MODEL_NAME_OR_PATH \
--output-dir $OUTPUT_DIR \
--mode $QUANTIZATION_SCHEME \
--pedantic-level $PEDANTIC_LEVEL
--device $DEVICE \
--offload

The command line arguments means :

  • model-name-or-path: Hugging Face pretrained model name or directory path of the saved model checkpoint.
  • output-dir: Directory path to save the quantized checkpoint and related configurations.
  • mode: Quantization techniques to apply. You can use fp8, int8.
  • pedantic-level: Represent to accuracy-latency trade-off. Higher pedantic level ensure a more accurate representaition of the model, but increase the quantization processing time. Defaults to 1.
  • device: Device to run the quantization process. Defaults to "cuda:0".
  • offload: When enabled, this option significantly reduces GPU memory usage by offloading model layers onto CPU RAM. Defaults to False.

Example: Run FP8 quantization with Meta-Llama-3.1-8B-Instruct

export MODEL_NAME_OR_PATH="meta-llama/Meta-Llama-3.1-8B-Instruct"
export OUTPUT_DIR="./"
export QUANTIZATION_SCHEME=fp8
export PEDANTIC_LEVEL=1
export DEVICE=1

fmo quantize \
--model-name-or-path $MODEL_NAME_OR_PATH \
--output-dir $OUTPUT_DIR \
--mode $QUANTIZATION_SCHEME \
--pedantic-level $PEDANTIC_LEVEL \
--device $DEVICE \
--offload

If successfully run, you will see the progress of the quantization as shown in the screenshot below:

image

How to serve an optimized model with Friendli Engine?

Once your optimized model is ready, you can serve the model with Friendli Engine.
Please check out our official documentation to learn more!