Friendli Model Optimizer (FMO) is a tool that provides model optimizations for efficient generative AI serving with Friendli Engine. The optimizations improve generative AI serving performance without compromising task accuracy.
FMO is designed to work with Hugging Face pretrained models, which can be loaded using 'PreTrainedModel.from_pretrained()'.
FMO offers a pedantic level setting, which controls the trade-off between accuracy and processing time. Higher pedantic levels provide more accurate model but can increase the time required to generate quantized models, and may sometimes slow down inference. Lower pedantic levels allow for faster quantization, though they may reduce model accuracy. Each quantization mode supports different ranges of pedantic levels.
Note
The list of Hugging Face model architectures that can be optimized with FMO is specified in Supported Features & Model Architecture.
Note
Currently, FMO supports Python3.8 to Python3.11.
pip install friendli-model-optimizer
FMO currently supports the following PTQ (Post-Training Quantization) techniques:
FP8 is an 8-bit floating-point format that offers a higher dynamic range than INT8, making it better suited for quantizing both weights and activations. This leads to increased throughput and reduced latency while maintaining high output quality with minimal degradation.
FP8 support 0-2 pedantic level. Defaults to 1.
Important
FP8 is only supported by NVIDIA Ada, Hopper, and Blackwell GPU architectures.
Note
For now, we only support the E4M3 (4-bit exponent and 3-bit mantissa) encoding format.
LlamaForCausalLM
MistralForcausalLM
CohereForCausalLM
Qwen2ForCausalLM
Gemma2ForCausalLM
Phi3ForCausalLM
MptForCausalLM
ArcticForCausalLM
MixtralForCausalLM
Note
Currently, Phi3ForCausalLM
, MptForCausalLM
, ArcticForCausalLM
, and MixtralForCausalLM
only support pendantic level 0
Please add --pedantic-level 0
in command line.
INT8 Quantization represents weights and activations using the INT8 format with acceptable accuracy drops. Friendli Engine enables dynamic activation scaling, where scales are computed on the fly during runtime. Thus, FMO only quantizes model weights, and Friendli Engine will load the quantized weights.
INT8 support 0-1 pedantic level. Defaults to 1.
LlamaForCausalLM
MistralForcausalLM
CohereForCausalLM
Qwen2ForCausalLM
Gemma2ForCausalLM
You can run the quantization processes with the command below:
fmo quantize \
--model-name-or-path $MODEL_NAME_OR_PATH \
--output-dir $OUTPUT_DIR \
--mode $QUANTIZATION_SCHEME \
--pedantic-level $PEDANTIC_LEVEL
--device $DEVICE \
--offload
The command line arguments means :
model-name-or-path
: Hugging Face pretrained model name or directory path of the saved model checkpoint.output-dir
: Directory path to save the quantized checkpoint and related configurations.mode
: Quantization techniques to apply. You can usefp8
,int8
.pedantic-level
: Represent to accuracy-latency trade-off. Higher pedantic level ensure a more accurate representaition of the model, but increase the quantization processing time. Defaults to 1.device
: Device to run the quantization process. Defaults to "cuda:0".offload
: When enabled, this option significantly reduces GPU memory usage by offloading model layers onto CPU RAM. Defaults to False.
export MODEL_NAME_OR_PATH="meta-llama/Meta-Llama-3.1-8B-Instruct"
export OUTPUT_DIR="./"
export QUANTIZATION_SCHEME=fp8
export PEDANTIC_LEVEL=1
export DEVICE=1
fmo quantize \
--model-name-or-path $MODEL_NAME_OR_PATH \
--output-dir $OUTPUT_DIR \
--mode $QUANTIZATION_SCHEME \
--pedantic-level $PEDANTIC_LEVEL \
--device $DEVICE \
--offload
If successfully run, you will see the progress of the quantization as shown in the screenshot below:
Once your optimized model is ready, you can serve the model with Friendli Engine.
Please check out our official documentation to learn more!