Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Measuring inference speed metrics for hosted and local LLM #822

Open
1 task
ShellLM opened this issue May 1, 2024 · 1 comment
Open
1 task

Measuring inference speed metrics for hosted and local LLM #822

ShellLM opened this issue May 1, 2024 · 1 comment
Labels
ai-platform model hosts and APIs llm Large Language Models llm-benchmarks testing and benchmarking large language models llm-completions large language models for completion tasks, e.g. copilot llm-evaluation Evaluating Large Language Models performance and behavior through human-written evaluation sets llm-function-calling Function Calling with Large Language Models llm-inference-engines Software to run inference on large language models llm-quantization All about Quantized LLM models and serving llm-serving-optimisations Tips, tricks and tools to speedup inference of large language models

Comments

@ShellLM
Copy link
Collaborator

ShellLM commented May 1, 2024

Measuring inference speed metrics for hosted and local LLM

Snippet

GenAI-Perf is a command line tool for measuring the throughput and latency of generative AI models as served through an inference server. For large language models (LLMs), GenAI-Perf provides metrics such as output token throughput, time to first token, inter token latency, and request throughput. For a full list of metrics please see the Metrics section.

Users specify a model name, an inference server URL, the type of inputs to use (synthetic or from dataset), and the type of load to generate (number of concurrent requests, request rate).

GenAI-Perf generates the specified load, measures the performance of the inference server and reports the metrics in a simple table as console output. The tool also logs all results in a csv file that can be used to derive additional metrics and visualizations. The inference server must already be running when GenAI-Perf is run.

[!Note] GenAI-Perf is currently in early release and under rapid development. While we will try to remain consistent, command line options and functionality are subject to change as the tool matures.

Installation

Triton SDK Container

Available starting with the 24.03 release of the Triton Server SDK container.

Run the Triton Inference Server SDK docker container:

export RELEASE="mm.yy" # e.g. export RELEASE="24.03"

docker run -it --net=host --gpus=all  nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk

Run GenAI-Perf:

genai-perf --help

From Source

This method requires that Perf Analyzer is installed in your development environment and that you have at least Python 3.10 installed. To build Perf Analyzer from source, see here.

export RELEASE="mm.yy" # e.g. export RELEASE="24.03"

pip install "git+https://github.com/triton-inference-server/client.git@r${RELEASE}#subdirectory=src/c++/perf_analyzer/genai-perf"

Run GenAI-Perf:

genai-perf --help

Quick Start

Measuring Throughput and Latency of GPT2 using Triton + TensorRT-LLM

Running GPT2 on Triton Inference Server using TensorRT-LLM

See instructions

Running GenAI-Perf

Run Triton Inference Server SDK container:

export RELEASE="mm.yy" # e.g. export RELEASE="24.03"

docker run -it --net=host --rm --gpus=all nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk

Run GenAI-Perf:

genai-perf \
  -m gpt2 \
  --service-kind triton \
  --backend tensorrtllm \
  --prompt-source synthetic \
  --num-prompts 100 \
  --random-seed 123 \
  --synthetic-input-tokens-mean 200 \
  --synthetic-input-tokens-stddev 0 \
  --streaming \
  --output-tokens-mean 100 \
  --output-tokens-stddev 0 \
  --output-tokens-mean-deterministic \
  --tokenizer hf-internal-testing/llama-tokenizer \
  --concurrency 1 \
  --measurement-interval 4000 \
  --profile-export-file my_profile_export.json \
  --url localhost:8001

Example output:

                                                  LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃                Statistic ┃         avg ┃         min ┃         max ┃         p99 ┃         p90 ┃         p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Time to first token (ns) │  13,266,974 │  11,818,732 │  18,351,779 │  16,513,479 │  13,741,986 │  13,544,376 │
│ Inter token latency (ns) │   2,069,766 │      42,023 │  15,307,799 │   3,256,375 │   3,020,580 │   2,090,930 │
│     Request latency (ns) │ 223,532,625 │ 219,123,330 │ 241,004,192 │ 238,198,306 │ 229,676,183 │ 224,715,918 │
│         Num output token │         104 │         100 │         129 │         128 │         109 │         105 │
│          Num input token │         199 │         199 │         199 │         199 │         199 │         199 │
└──────────────────────────┴─────────────┴─────────────┴─────────────┴─────────────┴─────────────┴─────────────┘
Output token throughput (per sec): 460.42
Request throughput (per sec): 4.44

See Tutorial for additional examples.

Model Inputs

GenAI-Perf supports model input prompts from either synthetically generated inputs, or from the HuggingFace OpenOrca or CNN_DailyMail datasets. This is specified using the --input-dataset CLI option.

When the dataset is synthetic, you can specify the following options:

  • --num-prompts <int>: The number of unique prompts to generate as stimulus, >= 1.
  • --synthetic-input-tokens-mean <int>: The mean of number of tokens in the generated prompts when prompt-source is synthetic, >= 1.
  • --synthetic-input-tokens-stddev <int>: The standard deviation of number of tokens in the generated prompts when prompt-source is synthetic, >= 0.
  • --random-seed <int>: The seed used to generate random values, >= 0.

When the dataset is coming from HuggingFace, you can specify the following options:

  • --dataset {openorca,cnn_dailymail}: HuggingFace dataset to use for benchmarking.
  • --num-prompts <int>: The number of unique prompts to generate as stimulus, >= 1.

For any dataset, you can specify the following options:

  • --output-tokens-mean <int>: The mean number of tokens in each output. Ensure the --tokenizer value is set correctly, >= 1.
  • --output-tokens-stddev <int>: The standard deviation of the number of tokens in each output. This is only used when output-tokens-mean is provided, >= 1.
  • --output-tokens-mean-deterministic: When using --output-tokens-mean, this flag can be set to improve precision by setting the minimum number of tokens equal to the requested number of tokens. This is currently supported with the Triton service-kind. Note that there is still some variability in the requested number of output tokens, but GenAi-Perf attempts its best effort with your model to get the right number of output tokens.

You can optionally set additional model inputs with the following option:

  • --extra-inputs <input_name>:<value>: An additional input for use with the model with a singular value, such as stream:true or max_tokens:5. This flag can be repeated to supply multiple extra inputs.

Metrics

GenAI-Perf collects a diverse set of metrics that captures the performance of the inference server.

Metric Description Aggregations
Time to First Token Time between when a request is sent and when its first response is received, one value per request in benchmark Avg, min, max, p99, p90, p75
Inter Token Latency Time between intermediate responses for a single request divided by the number of generated tokens of the latter response, one value per response per request in benchmark Avg, min, max, p99, p90, p75
Request Latency Time between when a request is sent and when its final response is received, one value per request in benchmark Avg, min, max, p99, p90, p75
Number of Output Tokens Total number of output tokens of a request, one value per request in benchmark Avg, min, max, p99, p90, p75
Output Token Throughput Total number of output tokens from benchmark divided by benchmark duration None–one value per benchmark
Request Throughput Number of final responses from benchmark divided by benchmark duration None–one value per benchmark

Suggested labels

None

@ShellLM ShellLM added ai-platform model hosts and APIs dataset public datasets and embeddings embeddings vector embeddings and related tools llm Large Language Models llm-benchmarks testing and benchmarking large language models llm-completions large language models for completion tasks, e.g. copilot llm-evaluation Evaluating Large Language Models performance and behavior through human-written evaluation sets llm-function-calling Function Calling with Large Language Models llm-inference-engines Software to run inference on large language models llm-quantization All about Quantized LLM models and serving llm-serving-optimisations Tips, tricks and tools to speedup inference of large language models labels May 1, 2024
@ShellLM
Copy link
Collaborator Author

ShellLM commented May 1, 2024

Related content

#408 similarity score: 0.89
#690 similarity score: 0.88
#649 similarity score: 0.87
#324 similarity score: 0.86
#498 similarity score: 0.86
#811 similarity score: 0.86

@irthomasthomas irthomasthomas removed embeddings vector embeddings and related tools dataset public datasets and embeddings labels May 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ai-platform model hosts and APIs llm Large Language Models llm-benchmarks testing and benchmarking large language models llm-completions large language models for completion tasks, e.g. copilot llm-evaluation Evaluating Large Language Models performance and behavior through human-written evaluation sets llm-function-calling Function Calling with Large Language Models llm-inference-engines Software to run inference on large language models llm-quantization All about Quantized LLM models and serving llm-serving-optimisations Tips, tricks and tools to speedup inference of large language models
Projects
None yet
Development

No branches or pull requests

2 participants