Skip to content

Latest commit

 

History

History
51 lines (32 loc) · 1.74 KB

optimization.md

File metadata and controls

51 lines (32 loc) · 1.74 KB

Optimization Strategies

ONNX Runtime

ONNX (Open Neural Network Exchange) provides a high-performance inference engine for machine learning models, allowing for faster and more efficient model execution. If an ONNX version of a model is available, it can serve as a substantial optimization for the scanner.

To leverage ONNX Runtime, you must first install the appropriate package:

pip install llm-guard[onnxruntime] # for CPU instances
pip install llm-guard[onnxruntime-gpu] # for GPU instances

Activate ONNX by initializing your scanner with the use_onnx parameter set to True:

scanner = Code(languages=["PHP"], use_onnx=True)

ONNX Runtime with Quantization

Although not built-in in the library, you can use quantized or optimized versions of the models. However, that doesn't always lead to better latency but can reduce the model size.

Enabling Low CPU/Memory Usage

To minimize CPU and memory usage:

from llm_guard.input_scanners.code import Code, DEFAULT_MODEL

DEFAULT_MODEL.kwargs["low_cpu_mem_usage"] = True
scanner = Code(languages=["PHP"], model=DEFAULT_MODEL)

For an in-depth understanding of this feature and its impact on large model handling, refer to the detailed Large Model Loading Documentation.

Use smaller models

For certain scanners, smaller model variants are available. These models are designed for enhanced performance, offering reduced latency without significantly compromising accuracy or effectiveness.

PyTorch hacks

To speed up warm compile times:

import torch
torch.set_float32_matmul_precision('high')

import torch._inductor.config
torch._inductor.config.fx_graph_cache = True