Optimization Strategies

ONNX Runtime

ONNX (Open Neural Network Exchange) provides a high-performance inference engine for machine learning models, allowing for faster and more efficient model execution. If an ONNX version of a model is available, it can serve as a substantial optimization for the scanner.

To leverage ONNX Runtime, you must first install the appropriate package:

pip install llm-guard[onnxruntime] # for CPU instances
pip install llm-guard[onnxruntime-gpu] # for GPU instances

Activate ONNX by initializing your scanner with the use_onnx parameter set to True:

scanner = Code(languages=["PHP"], use_onnx=True)

ONNX Runtime with Quantization

Although not built-in in the library, you can use quantized or optimized versions of the models. However, that doesn't always lead to better latency but can reduce the model size.

Enabling Low CPU/Memory Usage

To minimize CPU and memory usage:

from llm_guard.input_scanners.code import Code, DEFAULT_MODEL

DEFAULT_MODEL.kwargs["low_cpu_mem_usage"] = True
scanner = Code(languages=["PHP"], model=DEFAULT_MODEL)

For an in-depth understanding of this feature and its impact on large model handling, refer to the detailed Large Model Loading Documentation.

Use smaller models

For certain scanners, smaller model variants are available. These models are designed for enhanced performance, offering reduced latency without significantly compromising accuracy or effectiveness.

PyTorch hacks

To speed up warm compile times:

import torch
torch.set_float32_matmul_precision('high')

import torch._inductor.config
torch._inductor.config.fx_graph_cache = True

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimization.md

optimization.md

Optimization Strategies

ONNX Runtime

ONNX Runtime with Quantization

Enabling Low CPU/Memory Usage

Use smaller models

PyTorch hacks

Files

optimization.md

Latest commit

History

optimization.md

File metadata and controls

Optimization Strategies

ONNX Runtime

ONNX Runtime with Quantization

Enabling Low CPU/Memory Usage

Use smaller models

PyTorch hacks