ONNX (Open Neural Network Exchange) provides a high-performance inference engine for machine learning models, allowing for faster and more efficient model execution. If an ONNX version of a model is available, it can serve as a substantial optimization for the scanner.
To leverage ONNX Runtime, you must first install the appropriate package:
pip install llm-guard[onnxruntime] # for CPU instances
pip install llm-guard[onnxruntime-gpu] # for GPU instances
Activate ONNX by initializing your scanner with the use_onnx parameter set to True:
scanner = Code(languages=["PHP"], use_onnx=True)
Although not built-in in the library, you can use quantized or optimized versions of the models. However, that doesn't always lead to better latency but can reduce the model size.
To minimize CPU and memory usage:
from llm_guard.input_scanners.code import Code, DEFAULT_MODEL
DEFAULT_MODEL.kwargs["low_cpu_mem_usage"] = True
scanner = Code(languages=["PHP"], model=DEFAULT_MODEL)
For an in-depth understanding of this feature and its impact on large model handling, refer to the detailed Large Model Loading Documentation.
For certain scanners, smaller model variants are available. These models are designed for enhanced performance, offering reduced latency without significantly compromising accuracy or effectiveness.
To speed up warm compile times:
import torch
torch.set_float32_matmul_precision('high')
import torch._inductor.config
torch._inductor.config.fx_graph_cache = True