Replies: 7 comments 30 replies
-
@JohannesGaessler @slaren as you are the main contributors on CUDA backend feel free to highlight or amend any hypothesis. Thanks a lot for your impressive job here. |
Beta Was this translation helpful? Give feedback.
-
I have the same objective. What data/prompt did you use and can I use the same with llama-bench? |
Beta Was this translation helpful? Give feedback.
-
@phymbert Can you please add updated test results? Based on details in linked threads inference server performance was significantly improved. It would be nice to compare results before and after here. |
Beta Was this translation helpful? Give feedback.
-
Hello, I'm new to this area and recently wanted to measure llamacpp throughput on an a100 video card. I would like to ask you, what about sending a request to llamacpp server after opening its service and then measuring the throughput of llamacpp? If so, can you refer me to your code for measuring these metrics? |
Beta Was this translation helpful? Give feedback.
-
Use an AI to develop a load testing script, by showing an example of how you use curl.
I have a repo for this https://github.com/SolidRusT/srt-inference-perf
Get Outlook for Android<https://aka.ms/AAb9ysg>
…________________________________
From: hitdra ***@***.***>
Sent: Monday, September 16, 2024 6:31:06 AM
To: ggerganov/llama.cpp ***@***.***>
Cc: Shaun Prince ***@***.***>; Comment ***@***.***>
Subject: Re: [ggerganov/llama.cpp] LLM inference server performances comparison llama.cpp / TGI / vLLM (Discussion #6730)
EXTERNAL
Start llama-server with --metrics then hit /metrics endpoint.<https://github.com/ggerganov/llama.cpp/tree/master/examples/server#:~:text=GET%20/metrics%3A%20Prometheus%20compatible%20metrics%20exporter>
Sorry, I have another query for you. How do I send so many requests in the dataset to the llamacpp server? With a one by one curl command?
—
Reply to this email directly, view it on GitHub<#6730 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BBLN4BPL7UVKKDRPQCCAGH3ZW3MRVAVCNFSM6AAAAABGL5PSISVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTANRVHE4DGMQ>.
You are receiving this because you commented.Message ID: ***@***.***>
This communication (including any attachments) is intended for the use of the intended recipient(s) only and may contain information that is confidential, privileged or legally protected. Any unauthorised use or dissemination of this communication is strictly prohibited. The content of this email and traffic data may be monitored for employment, security, compliance and other legally authorised purposes.
OpenBet Ltd, Registered no. 3134634. OpenBet Technologies Ltd, Registered no. 6712030. Both companies registered in England and Wales. Registered Office: Building 6 Chiswick Park, 566 Chiswick High Road, London, W4 5HR. NYX Digital Gaming (USA), LLC, Registered no. E0228962018-2. Registered in Nevada. Registered Office: c/o Corporation Service Company, 112 North Curry Street, Carson City, NV 89703. OpenBet Singapore Pte Limited. Co. Reg. No. 201529435R. Incorporated under the laws of the Republic of Singapore. Registered Address: 16, Raffles Quay, #33-03, Singapore 048581.
|
Beta Was this translation helpful? Give feedback.
-
would it be possible for you to create benchmark for CPU only since vllm supports CPU now |
Beta Was this translation helpful? Give feedback.
-
FYI: LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators This paper includes some benchmarks of |
Beta Was this translation helpful? Give feedback.
-
Performances and improvment area
This thread objective is to gather
llama.cpp
performance 📈 and improvement ideas💡against other popular LLM inferenceframeworks, especially on the
CUDA
backend. Let's try to fill the gap 🚀generation inference.
I have run a couple of benchmarks from the OpenAI
/chat/completions
endpoint client point of viewusing JMeter on 2 A100 with
mixtral8x7b
and a fine tunellama70b
models.Note 1: from the client point of view, it is not possible to get accurate PP and TG because, first you need steaming
enabled and then PP will always include one generated token. So easier to compare the total tokens of the transactions
in
completions.usage
.Note 2: from a performance tests server point of view, we generally consider following metrics:
iterations
: total request successfully completed during the testprompt tokens
: average prompt tokens per request, same by iteration number for all testsgenerated tokens
: average generated tokens per requestRPM
: Requests Per Minutelatency
: Duration of the http request in secondsPP+TG
: total tokens http clients send and receive per seconderrors
: number of request in errors during the test from the client point of view, it can be http timeout,connection close. It is not necessarily caused by the server.
Context size
The transaction tokens context here is:
Results
llama70b @ eedd42e
llama.cpp configuration
server --model myllama70b-f16-00001-of-00010.gguf \ --ctx-size 32768 \ --n-predict 4096 \ --n-gpu-layers 81 \ --batch-size 4096 \ --ubatch-size 256 \ --parallel 1|32 \ --metrics \ --log-format text
vLLM configuration
TGI Configuration
Please note how it is easy:
mixtral8x7b
llama.cpp configuration @ 137fbb8
server --model mixtral-8x7b-instruct-f16-00001-of-00010.gguf \ --ctx-size 131072 \ --n-predict 4096 \ --n-gpu-layers 33 \ --batch-size 4096 \ --ubatch-size 256 \ --parallel 1|32 \ --metrics \ --log-format text
Magically vLLM and TGI configuration are not changed.
Area of improvements of
llama.cpp
Please @ggerganov edit as will
Beta Was this translation helpful? Give feedback.
All reactions