Measuring inference speed metrics for hosted and local LLM #822
Labels
ai-platform
model hosts and APIs
llm
Large Language Models
llm-benchmarks
testing and benchmarking large language models
llm-completions
large language models for completion tasks, e.g. copilot
llm-evaluation
Evaluating Large Language Models performance and behavior through human-written evaluation sets
llm-function-calling
Function Calling with Large Language Models
llm-inference-engines
Software to run inference on large language models
llm-quantization
All about Quantized LLM models and serving
llm-serving-optimisations
Tips, tricks and tools to speedup inference of large language models
Measuring inference speed metrics for hosted and local LLM
Snippet
GenAI-Perf is a command line tool for measuring the throughput and latency of generative AI models as served through an inference server. For large language models (LLMs), GenAI-Perf provides metrics such as output token throughput, time to first token, inter token latency, and request throughput. For a full list of metrics please see the Metrics section.
Users specify a model name, an inference server URL, the type of inputs to use (synthetic or from dataset), and the type of load to generate (number of concurrent requests, request rate).
GenAI-Perf generates the specified load, measures the performance of the inference server and reports the metrics in a simple table as console output. The tool also logs all results in a csv file that can be used to derive additional metrics and visualizations. The inference server must already be running when GenAI-Perf is run.
[!Note] GenAI-Perf is currently in early release and under rapid development. While we will try to remain consistent, command line options and functionality are subject to change as the tool matures.
Installation
Triton SDK Container
Available starting with the 24.03 release of the Triton Server SDK container.
Run the Triton Inference Server SDK docker container:
Run GenAI-Perf:
From Source
This method requires that Perf Analyzer is installed in your development environment and that you have at least Python 3.10 installed. To build Perf Analyzer from source, see here.
Run GenAI-Perf:
Quick Start
Measuring Throughput and Latency of GPT2 using Triton + TensorRT-LLM
Running GPT2 on Triton Inference Server using TensorRT-LLM
See instructions
Running GenAI-Perf
Run Triton Inference Server SDK container:
Run GenAI-Perf:
Example output:
See Tutorial for additional examples.
Model Inputs
GenAI-Perf supports model input prompts from either synthetically generated inputs, or from the HuggingFace OpenOrca or CNN_DailyMail datasets. This is specified using the
--input-dataset
CLI option.When the dataset is synthetic, you can specify the following options:
--num-prompts <int>
: The number of unique prompts to generate as stimulus, >= 1.--synthetic-input-tokens-mean <int>
: The mean of number of tokens in the generated prompts when prompt-source is synthetic, >= 1.--synthetic-input-tokens-stddev <int>
: The standard deviation of number of tokens in the generated prompts when prompt-source is synthetic, >= 0.--random-seed <int>
: The seed used to generate random values, >= 0.When the dataset is coming from HuggingFace, you can specify the following options:
--dataset {openorca,cnn_dailymail}
: HuggingFace dataset to use for benchmarking.--num-prompts <int>
: The number of unique prompts to generate as stimulus, >= 1.For any dataset, you can specify the following options:
--output-tokens-mean <int>
: The mean number of tokens in each output. Ensure the--tokenizer
value is set correctly, >= 1.--output-tokens-stddev <int>
: The standard deviation of the number of tokens in each output. This is only used whenoutput-tokens-mean
is provided, >= 1.--output-tokens-mean-deterministic
: When using--output-tokens-mean
, this flag can be set to improve precision by setting the minimum number of tokens equal to the requested number of tokens. This is currently supported with the Triton service-kind. Note that there is still some variability in the requested number of output tokens, but GenAi-Perf attempts its best effort with your model to get the right number of output tokens.You can optionally set additional model inputs with the following option:
--extra-inputs <input_name>:<value>
: An additional input for use with the model with a singular value, such asstream:true
ormax_tokens:5
. This flag can be repeated to supply multiple extra inputs.Metrics
GenAI-Perf collects a diverse set of metrics that captures the performance of the inference server.
Suggested labels
None
The text was updated successfully, but these errors were encountered: