Overview

Friendli Model Optimizer (FMO) for supercharging generative AI serving 🚀

Overview

Friendli Model Optimizer (FMO) is a tool that provides model optimizations for efficient generative AI serving with Friendli Engine. The optimizations improve generative AI serving performance without compromising task accuracy.

FMO is designed to work with Hugging Face pretrained models, which can be loaded using 'PreTrainedModel.from_pretrained()'.

FMO offers a pedantic level setting, which controls the trade-off between accuracy and processing time. Higher pedantic levels provide more accurate model but can increase the time required to generate quantized models, and may sometimes slow down inference. Lower pedantic levels allow for faster quantization, though they may reduce model accuracy. Each quantization mode supports different ranges of pedantic levels.

Note

The list of Hugging Face model architectures that can be optimized with FMO is specified in Supported Features & Model Architecture.

Note

Currently, FMO supports Python3.8 to Python3.11.

Quick Installation

pip install friendli-model-optimizer

Supported Features & Model Architecture

FMO currently supports the following PTQ (Post-Training Quantization) techniques:

FP8

FP8 is an 8-bit floating-point format that offers a higher dynamic range than INT8, making it better suited for quantizing both weights and activations. This leads to increased throughput and reduced latency while maintaining high output quality with minimal degradation.

FP8 support 0-2 pedantic level. Defaults to 1.

Important

FP8 is only supported by NVIDIA Ada, Hopper, and Blackwell GPU architectures.

Note

For now, we only support the E4M3 (4-bit exponent and 3-bit mantissa) encoding format.

Supported Model Architectures for FP8 Quantization

LlamaForCausalLM
MistralForcausalLM
CohereForCausalLM
Qwen2ForCausalLM
Gemma2ForCausalLM
Phi3ForCausalLM
MptForCausalLM
ArcticForCausalLM
MixtralForCausalLM

Note

Currently, Phi3ForCausalLM, MptForCausalLM, ArcticForCausalLM, and MixtralForCausalLM only support pendantic level 0 Please add --pedantic-level 0 in command line.

INT8

INT8 Quantization represents weights and activations using the INT8 format with acceptable accuracy drops. Friendli Engine enables dynamic activation scaling, where scales are computed on the fly during runtime. Thus, FMO only quantizes model weights, and Friendli Engine will load the quantized weights.

INT8 support 0-1 pedantic level. Defaults to 1.

Supported Model Architectures for INT8 Quantization

LlamaForCausalLM
MistralForcausalLM
CohereForCausalLM
Qwen2ForCausalLM
Gemma2ForCausalLM

User Guides

You can run the quantization processes with the command below:

fmo quantize \
--model-name-or-path $MODEL_NAME_OR_PATH \
--output-dir $OUTPUT_DIR \
--mode $QUANTIZATION_SCHEME \
--pedantic-level $PEDANTIC_LEVEL
--device $DEVICE \
--offload

The command line arguments means :

model-name-or-path: Hugging Face pretrained model name or directory path of the saved model checkpoint.
output-dir: Directory path to save the quantized checkpoint and related configurations.
mode: Quantization techniques to apply. You can use fp8, int8.
pedantic-level: Represent to accuracy-latency trade-off. Higher pedantic level ensure a more accurate representaition of the model, but increase the quantization processing time. Defaults to 1.
device: Device to run the quantization process. Defaults to "cuda:0".
offload: When enabled, this option significantly reduces GPU memory usage by offloading model layers onto CPU RAM. Defaults to False.

Example: Run FP8 quantization with Meta-Llama-3.1-8B-Instruct

export MODEL_NAME_OR_PATH="meta-llama/Meta-Llama-3.1-8B-Instruct"
export OUTPUT_DIR="./"
export QUANTIZATION_SCHEME=fp8
export PEDANTIC_LEVEL=1
export DEVICE=1

fmo quantize \
--model-name-or-path $MODEL_NAME_OR_PATH \
--output-dir $OUTPUT_DIR \
--mode $QUANTIZATION_SCHEME \
--pedantic-level $PEDANTIC_LEVEL \
--device $DEVICE \
--offload

If successfully run, you will see the progress of the quantization as shown in the screenshot below:

How to serve an optimized model with Friendli Engine?

Once your optimized model is ready, you can serve the model with Friendli Engine.
Please check out our official documentation to learn more!

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github		.github
scripts		scripts
src/fmo		src/fmo
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Friendli Model Optimizer (FMO) for supercharging generative AI serving 🚀

Overview

Table of Contents

Quick Installation

Supported Features & Model Architecture

FP8

Supported Model Architectures for FP8 Quantization

INT8

Supported Model Architectures for INT8 Quantization

User Guides

Example: Run FP8 quantization with Meta-Llama-3.1-8B-Instruct

How to serve an optimized model with Friendli Engine?

About

Releases 6

Packages

Contributors 2

Languages

License

friendliai/friendli-model-optimizer

Folders and files

Latest commit

History

Repository files navigation

Friendli Model Optimizer (FMO) for supercharging generative AI serving 🚀

Overview

Table of Contents

Quick Installation

Supported Features & Model Architecture

FP8

Supported Model Architectures for FP8 Quantization

INT8

Supported Model Architectures for INT8 Quantization

User Guides

Example: Run FP8 quantization with Meta-Llama-3.1-8B-Instruct

How to serve an optimized model with Friendli Engine?

About

Resources

License

Stars

Watchers

Forks

Releases 6

Packages 0

Contributors 2

Languages

Packages