Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

benchmarks? #34

Closed
ghost opened this issue Mar 12, 2023 · 56 comments
Closed

benchmarks? #34

ghost opened this issue Mar 12, 2023 · 56 comments
Labels
documentation Improvements or additions to documentation question Further information is requested stale

Comments

@ghost
Copy link

ghost commented Mar 12, 2023

Where are the benchmarks for various hardware - eg. apple silicon

@wizd
Copy link

wizd commented Mar 12, 2023

M1 with 7B model: 94.24 ms per token
M1 with 13B model: 202.18 ms per token

speed with command line config -t 4. If use -t 8, half speed.

@ggerganov ggerganov added question Further information is requested documentation Improvements or additions to documentation labels Mar 12, 2023
@ElRoberto538
Copy link

ElRoberto538 commented Mar 12, 2023

Using command line config -t 8, note this is in a VM assigned 42 logical cores out of the total 44, other services running on the server.
AMD EPYC 7443P 7B: 89.39 ms per token

@MLTQ
Copy link

MLTQ commented Mar 12, 2023

M1 Pro 32GB, 30B model:

main: mem per token = 43387780 bytes
main: load time = 10701.85 ms
main: sample time = 279.92 ms
main: predict time = 37065.80 ms / 226.01 ms per token
main: total time = 51992.27 ms

@diimdeep
Copy link

Macbook Pro 2013, Intel i5, 2 cores, 8 GB RAM
7B 4bit model
main: mem per token = 14335844 bytes
main: load time = 8224.30 ms
main: sample time = 1918.08 ms
main: predict time = 308737.91 ms / 604.18 ms per token
main: total time = 331646.62 ms

Thank you for this awesome project.

@neuhaus
Copy link

neuhaus commented Mar 12, 2023

Ryzen 7 3700X, 128GB RAM @ 3200, llama.cpp numbers:

$ ./main -m models/7B/ggml-model-q4_0.bin -t 8 -n 128
main: mem per token = 14434244 bytes
main:     load time =  1270.15 ms
main:   sample time =   325.76 ms
main:  predict time = 15147.15 ms / 117.42 ms per token
main:    total time = 17077.88 ms

$ ./main -m models/13B/ggml-model-q4_0.bin -t 8 -n 128
main: mem per token = 22439492 bytes
main:     load time =  2946.00 ms
main:   sample time =    86.11 ms
main:  predict time =  7358.48 ms / 216.43 ms per token
main:    total time = 11019.28 ms

$ ./main -m models/30B/ggml-model-q4_0.bin -t 8 -n 128
main: mem per token = 43387780 bytes
main:     load time =  6666.53 ms
main:   sample time =   332.71 ms
main:  predict time = 68779.27 ms / 533.17 ms per token
main:    total time = 77333.97 ms

$ ./main -m models/65B/ggml-model-q4_0.bin -t 8 -n 128
main: mem per token = 70897348 bytes
main:     load time = 14010.35 ms
main:   sample time =   335.09 ms
main:  predict time = 140527.48 ms / 1089.36 ms per token
main:    total time = 157951.48 ms

With the 30B model, a RTX 3090 manages 15 tokens/s using text-generation-webui

@MarkSchmidty
Copy link

MarkSchmidty commented Mar 13, 2023

llama.cpp on Samsung S22 Ultra at 1.2 tokens per second

1.2 tokens/s on a Samsung S22 Ultra running 4 threads.

The S22 obviously has a more powerful processor. But I do not think it is 12 times more powerful. It's likely you could get much faster speeds on the Pi.

I'd be willing to bet that the bottleneck is not the processor.

Reposting the 1.2 token/second Samsung S22 Ultra result here. (Originally posted in #58)

@ItsLogic
Copy link

ItsLogic commented Mar 13, 2023

I must say this running on my phone at all was surprising. Here are my results on an 8+gen1 for 4bit 7B
image

and results for my desktop with 13900k and 64gb ddr5
4bit quant 7B

main: mem per token = 14434244 bytes
main:     load time =   609.88 ms
main:   sample time =    36.60 ms
main:  predict time =  9487.02 ms / 71.33 ms per token
main:    total time = 10341.46 ms

full precision 7B

main: mem per token = 14434244 bytes
main:     load time = 26905.18 ms
main:   sample time =    37.78 ms
main:  predict time = 23033.74 ms / 173.19 ms per token
main:    total time = 50204.95 ms

4bit quant 65B

main: mem per token = 70897348 bytes
main:     load time = 83233.36 ms
main:   sample time =    36.90 ms
main:  predict time = 86000.03 ms / 646.62 ms per token
main:    total time = 172458.39 ms

Edit:
Did something really stupid and ran 4bit 13B on my phone. TLDR its slow, dont. (unless you have lots of ram)
My phone has 12gb ram and 7gb of manually added swap. I had to run it through an adb root shell instead of termux as the android memory manager would kill termux as soon as the model started to load. The downside to this approach is that everything else on my phone is killed meaning I couldnt even get the screen to turn on while inference was running

main: mem per token = 22357508 bytes
main:     load time = 29320.15 ms
main:   sample time =  2254.09 ms
main:  predict time = 5227881.50 ms / 39307.38 ms per token
main:    total time = 5335562.00 ms

@totoCZ
Copy link

totoCZ commented Mar 16, 2023

Here is my quick look at
2x Intel Xeon Gold 5120 @ 2.20GHz, march=native

7B

main: mem per token = 14762244 bytes
main:     load time =  3378.15 ms
main:   sample time =    15.87 ms
main:  predict time =  4494.55 ms / 115.24 ms per token
main:    total time =  8328.48 ms

7B fp16

main: mem per token = 14532644 bytes
main:     load time = 27977.19 ms
main:   sample time =    24.71 ms
main:  predict time =  9378.29 ms / 275.83 ms per token
main:    total time = 38135.22 ms

13B

main: mem per token = 22562468 bytes
main:     load time = 16860.55 ms
main:   sample time =   170.45 ms
main:  predict time = 56121.11 ms / 308.36 ms per token
main:    total time = 74377.55 ms

13B fp16

main: mem per token = 22562468 bytes
main:     load time = 64448.62 ms
main:   sample time =   129.29 ms
main:  predict time = 61505.41 ms / 455.60 ms per token
main:    total time = 127347.54 ms

30B

main: mem per token = 43547620 bytes
main:     load time = 51269.82 ms
main:   sample time =    49.77 ms
main:  predict time = 41543.11 ms / 585.11 ms per token
main:    total time = 95383.98 ms

65B

main: mem per token = 71553028 bytes
main:     load time = 99438.78 ms
main:   sample time =    44.94 ms
main:  predict time = 69203.49 ms / 1017.70 ms per token
main:    total time = 218532.06 ms

This is with 14 / 28 threads.
Running with 56 threads slows it down, probably NUMA.
I think 115ms is still a good result for this CPU.

So if anyone like me was wondering, does having a million cores in a server CPU give you a 65B model?
The answer is no.

@MarkSchmidty
Copy link

MarkSchmidty commented Mar 16, 2023

So if anyone like me was wondering, does having a million cores in a server CPU give you a 65B model?

It's clear by now that llama.cpp speed mostly depends on max single core performance for comparisons within the same CPU architecture, up to a limit where all CPUs of the same architecture perform approximately the same. Memory bandwidth and memory bus chokepoints appear to be the major bottlenecks after that point.

Using more cores can slow things down for two reasons:

  1. More memory bus congestion from moving bits between more places. llama.cpp is well written and easily maxes out the memory bus on most even moderately powerful systems.
  2. Reducing your effective max single core performance to that of your slowest cores. This is usually the primary culprit on 4 or 6 core devices (mostly phones) which often have 2 power cores and then 2-4 balanced and/or "efficiency" cores.

With these learnings in mind, it would be good to see benchmark results from anyone who manages to find some yet unknown optimization in their configuration, OS environment, or hardware environment.

@hanvyj
Copy link

hanvyj commented Mar 19, 2023

How are you getting such good performance?

I'm running an i7 10750h 32gig ram with -m ./models/7B/ggml-model-f16.bin -t 12 -n 128

main: mem per token = 14499844 bytes
main:     load time =  8892.24 ms
main:   sample time =  1988.34 ms
main:  predict time = 270018.50 ms / 2093.17 ms per token
main:    total time = 287685.50 ms

2+s per token! I get similar with the 4 bit quant, if not worse.

Edit: Running with -m ./models/7B/ggml-model-q4_0.bin -t 12 -n 128

main: mem per token = 14499844 bytes
main:     load time =  1631.32 ms
main:   sample time =  1513.06 ms
main:  predict time = 574477.00 ms / 6047.13 ms per token
main:    total time = 596436.75 ms

@Green-Sky
Copy link
Collaborator

How are you getting such good performance?

I'm running an i7 10750h 32gig ram with -m ./models/7B/ggml-model-f16.bin -t 12 -n 128

Try:

  • less threads. your cpu seems to only have 6 real cores. llama.cpp seems to scale poorly with threads.
  • tell us your system info line for more context eg: system_info: n_threads = 8 / 24 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
  • make sure you compile with all optimizations

@xportz
Copy link

xportz commented Apr 10, 2023

How do you guys get these benchmarking results? When I CTRL+C out of the program, I get no output. Thanks.

@jon-chuang
Copy link
Contributor

jon-chuang commented Apr 11, 2023

Just curious, why was llama.cpp invented when you can run the models on onnxruntime with CPU backend? Could someone make a comparison of relative performance at the same quantization level and also show the perplexity over a validation set?

I guess people just prefer the no-dependency route? But it seems like reinventing the wheel or reimplementing optimized code?

https://onnxruntime.ai/docs/build/inferencing.html

EDIT: I guess one significant advantage is 4-bit quantization which results in significant memory savings over 8-bit. But how does this affect perplexity?

@MarkSchmidty
Copy link

Just curious, why was llama.cpp invented when you can run the models on onnxruntime with CPU backend? Could someone make a comparison of relative performance at the same quantization level and also show the perplexity over a validation set?

I guess people just prefer the no-dependency route? But it seems like reinventing the wheel or reimplementing optimized code?

onnxruntime.ai/docs/build/inferencing.html

EDIT: I guess one significant advantage is 4-bit quantization which results in significant memory savings over 8-bit. But how does this affect perplexity?

The effect of 4bit on perplexity is negligible thanks to GPTQ quantization, act order, and binning. 

4bit is twice as fast as 8bit because llama.cpp is efficient enough to be memory bound, not compute bound, even on modest processors. I have not seen comparisons of ONNX CPU speeds to llama.cpp for the same quantization level, but Hugging Face Transformers is roughly 20x slower than llama.cpp. I suspect ONNX is about as efficient as HF Transformers.

@clulece
Copy link

clulece commented Apr 12, 2023

4bit is twice as fast as 8bit because llama.cpp is efficient enough to be memory bound, not compute bound, even on modest processors. I have not seen comparisons of ONNX CPU speeds to llama.cpp for the same quantization level, but Hugging Face Transformers is roughly 20x slower than llama.cpp. I suspect ONNX is about as efficient as HF Transformers.

How important is CPU cache size to llama.cpp's performance? Do llama's memory access patterns cause the cache to be evicted often (naive me assumes yes but I really don't know).

@jon-chuang
Copy link
Contributor

How important is CPU cache size to llama.cpp's performance?

A: doesn't seem super important: #778

@ridwanarf25
Copy link

How do you guys get these benchmarking results? When I CTRL+C out of the program, I get no output. Thanks.

i think you can do it with --mtest parameter

@raghav-deepsource
Copy link

Wish me luck, Imm running 65B with 6 cores nd 32 gigs of ram

@Green-Sky
Copy link
Collaborator

@raghav-deepsource luck is what you need. you need at least ~60gigs of ram for the 65B model. :)

@raghav-deepsource
Copy link

raghav-deepsource commented Apr 19, 2023

Got it chugging at about 30 seconds per token with "recite the alphabet backwards". Interestingly, my memory usage didn't go up by much. feels like the code may be paging the weights into memory to reduce usage or something

@ai-rex
Copy link

ai-rex commented Apr 21, 2023

CPU: E5-2680v4 MEM: 64GB

$ ./build/bin/Release/main.exe -m ./models/65B/ggml-model-q4_0.bin -t 14 -n 128

system_info: n_threads = 14 / 28 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

llama_print_timings: load time = 22915.12 ms
llama_print_timings: sample time = 76.15 ms / 128 runs ( 0.59 ms per run)
llama_print_timings: prompt eval time = 4425.61 ms / 2 tokens ( 2212.81 ms per token)
llama_print_timings: eval time = 176678.85 ms / 127 runs ( 1391.17 ms per run)
llama_print_timings: total time = 199672.21 ms

@ai-rex
Copy link

ai-rex commented Apr 21, 2023

$ ./build/bin/Release/main.exe -m ./models/llama-7B-ggml-int4/ggml-model-q4_0.bin -t 14 -n 128

system_info: n_threads = 14 / 28 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

llama_print_timings: load time = 2677.89 ms
llama_print_timings: sample time = 75.61 ms / 128 runs ( 0.59 ms per run)
llama_print_timings: prompt eval time = 225.42 ms / 2 tokens ( 112.71 ms per token)
llama_print_timings: eval time = 19808.81 ms / 127 runs ( 155.97 ms per run)
llama_print_timings: total time = 22564.25 ms

@kiratp
Copy link

kiratp commented May 1, 2023

M1 Max, maxed GPU, 64 GB.

Note that M1 Pro vs Max matters beyond core count here since memory bandwidth doubles - 200 GB/s -> 400 GB/s

10 or so Safari tabs in the background, ~6-10% idle CPU consumption observed before start of test.
Model

Script: https://gist.github.com/kiratp/18826c1c085acf732f480e726b32686c
Edited from @KASR 's script https://gist.github.com/KASR/dc3dd7f920f57013486583af7e3725f1#file-benchmark_threads_llama_cpp-py

cmd = "./main \
     --seed 147369852 \
     --threads {threads} \
     --n_predict 128 \
     --model ./models/7B/ggml-model-q4_0.bin \
     --top_k 40 \
     --top_p 0.9 \
     --temp 0.5 \
     --repeat_last_n 64 \
     --repeat_penalty 1.1 \
     -p \"Write a funny joke:\" \
     --ignore-eos"
Running with 1 threads...
	 1 threads | run 1/3 | current token time 199.07 ms - eval time 24809.17 ms - prompt eval time 1592.53 ms
	 1 threads | run 2/3 | current token time 198.85 ms - eval time 24866.71 ms - prompt eval time 1590.83 ms
	 1 threads | run 3/3 | current token time 198.93 ms - eval time 24866.36 ms - prompt eval time 1591.47 ms
Running with 2 threads...
	 2 threads | run 1/3 | current token time 102.17 ms - eval time 12880.66 ms - prompt eval time 817.39 ms
	 2 threads | run 2/3 | current token time 102.09 ms - eval time 12880.23 ms - prompt eval time 816.71 ms
	 2 threads | run 3/3 | current token time 102.05 ms - eval time 12888.98 ms - prompt eval time 816.39 ms
Running with 3 threads...
	 3 threads | run 1/3 | current token time 71.74 ms - eval time 8931.11 ms - prompt eval time 573.96 ms
	 3 threads | run 2/3 | current token time 71.65 ms - eval time 8948.05 ms - prompt eval time 573.17 ms
	 3 threads | run 3/3 | current token time 71.31 ms - eval time 8933.5 ms - prompt eval time 570.51 ms
Running with 4 threads...
	 4 threads | run 1/3 | current token time 54.97 ms - eval time 6944.32 ms - prompt eval time 439.75 ms
	 4 threads | run 2/3 | current token time 54.81 ms - eval time 7153.19 ms - prompt eval time 438.51 ms
	 4 threads | run 3/3 | current token time 54.75 ms - eval time 7073.57 ms - prompt eval time 437.97 ms
Running with 5 threads...
	 5 threads | run 1/3 | current token time 46.04 ms - eval time 6177.01 ms - prompt eval time 368.34 ms
	 5 threads | run 2/3 | current token time 46.33 ms - eval time 6168.68 ms - prompt eval time 370.61 ms
	 5 threads | run 3/3 | current token time 47.62 ms - eval time 6172.55 ms - prompt eval time 380.94 ms
Running with 6 threads...
	 6 threads | run 1/3 | current token time 39.43 ms - eval time 5563.91 ms - prompt eval time 315.41 ms
	 6 threads | run 2/3 | current token time 39.38 ms - eval time 5543.76 ms - prompt eval time 315.03 ms
	 6 threads | run 3/3 | current token time 39.42 ms - eval time 5599.16 ms - prompt eval time 315.39 ms
Running with 7 threads...
	 7 threads | run 1/3 | current token time 34.34 ms - eval time 5676.61 ms - prompt eval time 274.74 ms
	 7 threads | run 2/3 | current token time 34.48 ms - eval time 5688.08 ms - prompt eval time 275.81 ms
	 7 threads | run 3/3 | current token time 34.19 ms - eval time 5681.7 ms - prompt eval time 273.52 ms
Running with 8 threads...
	 8 threads | run 1/3 | current token time 33.95 ms - eval time 5394.02 ms - prompt eval time 271.57 ms
	 8 threads | run 2/3 | current token time 33.29 ms - eval time 5358.99 ms - prompt eval time 266.32 ms
	 8 threads | run 3/3 | current token time 32.22 ms - eval time 5311.68 ms - prompt eval time 257.74 ms
Running with 9 threads...
	 9 threads | run 1/3 | current token time 87.65 ms - eval time 15074.75 ms - prompt eval time 701.22 ms
	 9 threads | run 2/3 | current token time 88.11 ms - eval time 13013.74 ms - prompt eval time 704.86 ms
	 9 threads | run 3/3 | current token time 85.37 ms - eval time 12599.68 ms - prompt eval time 682.97 ms
Running with 10 threads...
	 10 threads | run 1/3 | current token time 114.17 ms - eval time 17767.65 ms - prompt eval time 913.38 ms
	 10 threads | run 2/3 | current token time 107.66 ms - eval time 17790.2 ms - prompt eval time 861.27 ms
	 10 threads | run 3/3 | current token time 103.85 ms - eval time 16773.97 ms - prompt eval time 830.81 ms

Llama scaling

@rankaiyx
Copy link
Contributor

I3-9100
On the same platform, AVX2 is 1.4 times faster than AVX.

@rankaiyx
Copy link
Contributor

Here is my quick look at 2x Intel Xeon Gold 5120 @ 2.20GHz, march=native

7B

main: mem per token = 14762244 bytes
main:     load time =  3378.15 ms
main:   sample time =    15.87 ms
main:  predict time =  4494.55 ms / 115.24 ms per token
main:    total time =  8328.48 ms

7B fp16

main: mem per token = 14532644 bytes
main:     load time = 27977.19 ms
main:   sample time =    24.71 ms
main:  predict time =  9378.29 ms / 275.83 ms per token
main:    total time = 38135.22 ms

13B

main: mem per token = 22562468 bytes
main:     load time = 16860.55 ms
main:   sample time =   170.45 ms
main:  predict time = 56121.11 ms / 308.36 ms per token
main:    total time = 74377.55 ms

13B fp16

main: mem per token = 22562468 bytes
main:     load time = 64448.62 ms
main:   sample time =   129.29 ms
main:  predict time = 61505.41 ms / 455.60 ms per token
main:    total time = 127347.54 ms

30B

main: mem per token = 43547620 bytes
main:     load time = 51269.82 ms
main:   sample time =    49.77 ms
main:  predict time = 41543.11 ms / 585.11 ms per token
main:    total time = 95383.98 ms

65B

main: mem per token = 71553028 bytes
main:     load time = 99438.78 ms
main:   sample time =    44.94 ms
main:  predict time = 69203.49 ms / 1017.70 ms per token
main:    total time = 218532.06 ms

This is with 14 / 28 threads. Running with 56 threads slows it down, probably NUMA. I think 115ms is still a good result for this CPU.

So if anyone like me was wondering, does having a million cores in a server CPU give you a 65B model? The answer is no.

For the 56 threads of 2 NUMA, can you try the following ways to run it?

$ numactl --interleave=0-1 ./main -m models/7B/ggml-model-q4_0.bin -n 512 -p "Building a website can be done in 10 simple steps:" -t 56

Reference link:
#1437

@zrm
Copy link
Contributor

zrm commented May 16, 2023

Memory latency seems to have a significant effect. >15% difference between dual channel DDR4-3200 at CL22 vs. CL16.

@zrm
Copy link
Contributor

zrm commented May 16, 2023

$ numactl --interleave=0-1 ./main -m models/7B/ggml-model-q4_0.bin -n 512 -p "Building a website can be done in 10 simple steps:" -t 56

If you previously ran it the other way you have to drop the page cache once first (or reboot):

# echo 3 > /proc/sys/vm/drop_caches

I'll probably post something that can do better than 'numactl --interleave=all' at some point but that still has better performance with the existing code.

@ghost
Copy link

ghost commented May 16, 2023

Since everyone's showing off their fancy machines here's my 16GB (DDR3) i5 6500 with Linux/OpenBLAS.

system_info: n_threads = 4 / 4 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 

7B q4_0:

llama_print_timings:        load time = 25853.83 ms
llama_print_timings:      sample time =    68.77 ms /   100 runs   (    0.69 ms per token)
llama_print_timings: prompt eval time = 25546.94 ms /   353 tokens (   72.37 ms per token)
llama_print_timings:        eval time = 25116.73 ms /    99 runs   (  253.70 ms per token)
llama_print_timings:       total time = 51069.18 ms

13B q5_0:

llama_print_timings:        load time = 47995.62 ms
llama_print_timings:      sample time =    68.99 ms /   100 runs   (    0.69 ms per token)
llama_print_timings: prompt eval time = 47271.48 ms /   353 tokens (  133.91 ms per token)
llama_print_timings:        eval time = 49603.13 ms /    99 runs   (  501.04 ms per token)
llama_print_timings:       total time = 97697.80 ms

Memory bandwidth:

Memory bandwidth: 20.023501 GB/s

Personally I care much more about the prompt eval time as my computer takes several minutes to ingest a 2k token prompt once it fills up the context and needs to rotate the buffer.

@aseok
Copy link

aseok commented May 31, 2023

Any benchmark available on Asus mining expert b250 with multiple amd rx570? Any compatibility issues? Is this type of hardware configuration suitable for running inference?

@clulece
Copy link

clulece commented Jun 3, 2023

I've recently benchmarked a 7950x-based system (was looking into how performance changed when threads vs full instances vs ram speed. Resulting data can be browsed interactively here: https://clulece.github.io/llamma-cpu-based-performance.

image

@valyagolev
Copy link

7B q4_0:

system_info: n_threads = 8 / 16 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000

...

llama_print_timings:        load time =  1657.50 ms
llama_print_timings:      sample time =   124.92 ms /   128 runs   (    0.98 ms per token)
llama_print_timings: prompt eval time =  1192.14 ms /    14 tokens (   85.15 ms per token)
llama_print_timings:        eval time = 14527.86 ms /   127 runs   (  114.39 ms per token)
llama_print_timings:       total time = 16324.63 ms

on the Hetzner Cloud Arm64 Ampere, 16 VCPU

@ghost ghost mentioned this issue Jun 5, 2023
@aseok
Copy link

aseok commented Jun 16, 2023

Poco F3, (8+5)Gb, llama-7b.ggmlv3.q2_K.bin, opencl: ~1.48 (t/s)
system_info: n_threads = 8 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 512, n_keep = 0
llama_print_timings: load time = 808.72 ms
llama_print_timings: sample time = 678.71 ms / 391 runs ( 1.74 ms per token)
llama_print_timings: prompt eval time = 3468.61 ms / 8 tokens ( 433.58 ms per token)
llama_print_timings: eval time = 262932.38 ms / 390 runs ( 674.19 ms per token)
llama_print_timings: total time = 267274.42 ms

@kiratp
Copy link

kiratp commented Jul 21, 2023

Anyone have access to a Sapphire Raids system and can test with MKL (on those sweet AMX units)? Intel is claiming 2048 INT8 instructions per cycle per core.

@emr-cc
Copy link

emr-cc commented Aug 1, 2023

llama-2-7b-chat.ggmlv3.q2_K.bin

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 |   NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 

llama_print_timings:        load time =   873.88 ms
llama_print_timings:      sample time =     4.48 ms /    12 runs   (    0.37 ms per token,  2679.17 tokens per second)
llama_print_timings: prompt eval time =   933.12 ms /     8 tokens (  116.64 ms per token,     8.57 tokens per second)
llama_print_timings:        eval time =   722.88 ms /    11 runs   (   65.72 ms per token,    15.22 tokens per second)
llama_print_timings:       total time =  1662.61 ms

Running on Ryzen 9 6900HX, 12GB 6850M XT with 32GB of ram.

@ghost
Copy link

ghost commented Aug 4, 2023

@ggerganov just to torture you a little more before yours arrives, here are benchmarks for M2 Ultra 76c GPU / 192 GB RAM as of 8183159 with only 16 CPU threads:

CPU only:
7B q4_0: prompt eval 74.3 tokens/s, inference 42 tokens/s
7B f16: prompt eval 25 tokens/s, inference 16.7 tokens/s
65B q4_0: prompt eval 7.2 tokens/s, inference 5.71 tokens/s
65B f16: prompt eval 1.6 tokens/s, inference 1.9 tokens/s

GPU only (-ngl 1):
7B q4_0: prompt eval 72.5 tokens/s, inference 83 tokens/s
7B f16: prompt eval 10.2 tokens/s, inference 26.9 tokens/s
65B q4_0: prompt eval 8.74 tokens/s, inference 14.2 tokens/s
65B f16: prompt eval 0.9 tokens/s, inference 3.5 tokens/s

Oh, also, the M2 GPU coil whine is audible when doing inference, but the fan doesn't turn up at all.

@grigio
Copy link

grigio commented Aug 9, 2023

llama2_7b_chat_uncensored.ggmlv3.q2_K.bin

system_info: n_threads = 6 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 

7B q2_K

llama_print_timings:        load time =   327,71 ms
llama_print_timings:      sample time =    12,21 ms /    32 runs   (    0,38 ms per token,  2620,80 tokens per second)
llama_print_timings: prompt eval time =   306,55 ms /    12 tokens (   25,55 ms per token,    39,15 tokens per second)
llama_print_timings:        eval time =  1950,89 ms /    31 runs   (   62,93 ms per token,    15,89 tokens per second)
llama_print_timings:       total time =  2274,46 ms

13B q4_K_S

llama_print_timings:        load time =  4703,91 ms
llama_print_timings:      sample time =   210,78 ms /   553 runs   (    0,38 ms per token,  2623,60 tokens per second)
llama_print_timings: prompt eval time = 12411,64 ms /   269 tokens (   46,14 ms per token,    21,67 tokens per second)
llama_print_timings:        eval time = 86960,41 ms /   551 runs   (  157,82 ms per token,     6,34 tokens per second)
llama_print_timings:       total time = 99669,44 ms

30B q2_K

llama_print_timings:        load time =  8828,89 ms
llama_print_timings:      sample time =    62,47 ms /   168 runs   (    0,37 ms per token,  2689,25 tokens per second)
llama_print_timings: prompt eval time =  1451,25 ms /    12 tokens (  120,94 ms per token,     8,27 tokens per second)
llama_print_timings:        eval time = 50421,39 ms /   167 runs   (  301,92 ms per token,     3,31 tokens per second)
llama_print_timings:       total time = 51960,64 ms

30B q4_K_S

llama_print_timings:        load time = 15271.31 ms
llama_print_timings:      sample time =     4.35 ms /    12 runs   (    0.36 ms per token,  2758.62 tokens per second)
llama_print_timings: prompt eval time = 15271.28 ms /    75 tokens (  203.62 ms per token,     4.91 tokens per second)
llama_print_timings:        eval time =  4426.11 ms /    11 runs   (  402.37 ms per token,     2.49 tokens per second)
llama_print_timings:       total time = 19718.30 ms
Output generated in 20.04 seconds (0.55 tokens/s, 11 tokens, context 75, seed 2103820326)

70B q4_K_S platypus2-70b-instruct.gguf.q4_K_S.bin

llama_print_timings:        load time = 30837.91 ms
llama_print_timings:      sample time =     8.26 ms /    22 runs   (    0.38 ms per token,  2664.41 tokens per second)
llama_print_timings: prompt eval time =  7082.51 ms /    23 tokens (  307.94 ms per token,     3.25 tokens per second)
llama_print_timings:        eval time = 18292.52 ms /    21 runs   (  871.07 ms per token,     1.15 tokens per second)
llama_print_timings:       total time = 25388.38 ms

AMD Ryzen 7 7700 48GB ram on Linux

@slaren
Copy link
Collaborator

slaren commented Aug 12, 2023

13900k, DDR5 6400:

7B q4_0 gen 128:

Threads ms/t t/s
4 79.08 12.65
8 54.74 18.27
16 63.86 15.66
24 59.84 16.71
32 62.96 15.88

7B q4_0 pp 512:

Threads ms/t t/s
4 44.56 22.44
8 29.70 33.67
16 27.23 36.73
24 21.26 47.04
32 18.11 55.21

@kiratp
Copy link

kiratp commented Aug 19, 2023

GCP instance: c3-highcpu-44

Compiled with:
cmake .. -DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=Intel10_64lp -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DCMAKE_CXX_FLAGS="-Ofast -mSAPPHIRERAPIDS -xSAPPHIRERAPIDS -qopt-zmm-usage=high -mno-shstk"

./build/bin/main --model /usr/src/models/<llama2 merged LORA model>.ggml.q4_k_m.bin --ctx-size 4096 --threads 42 -eps 1e-5 -p "Building a website can be done in 10 simple steps:"

llama_print_timings:        load time =   217.25 ms
llama_print_timings:      sample time =   300.20 ms /   491 runs   (    0.61 ms per token,  1635.58 tokens per second)
llama_print_timings: prompt eval time =   215.73 ms /    14 tokens (   15.41 ms per token,    64.90 tokens per second)
llama_print_timings:        eval time = 22698.24 ms /   490 runs   (   46.32 ms per token,    21.59 tokens per second)
llama_print_timings:       total time = 23328.41 ms

22 threads (the hardware core count) is slower at around 18-19 t/sec

Q8_0

llama_print_timings:        load time =   367.61 ms
llama_print_timings:      sample time =   292.28 ms /   477 runs   (    0.61 ms per token,  1631.97 tokens per second)
llama_print_timings: prompt eval time =   232.65 ms /    14 tokens (   16.62 ms per token,    60.18 tokens per second)
llama_print_timings:        eval time = 33848.05 ms /   476 runs   (   71.11 ms per token,    14.06 tokens per second)
llama_print_timings:       total time = 34484.84 ms

@kiratp
Copy link

kiratp commented Aug 19, 2023

This is on a GCP t2d-standartd-32 instance. 32 Milan cors with SMT/HT turned off so 1 core = 1 physical core - https://cloud.google.com/compute/docs/general-purpose-machines#t2d_machines

root@ml-perf-testing:/usr/src/app/llama.cpp# ./llama-bench --model /usr/src/models/<llama2 7B merged lora>.ggml.q4_k_m.bin -t 2,8,16,22,28,30,31,32
| model                          | backend    |  n_threads | test       |             t/s |
| ------------------------------ | ---------- | ---------: | ---------- | --------------: |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |          2 | pp 512     |    34.75 ± 1.77 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |          8 | pp 512     |    34.44 ± 1.71 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         16 | pp 512     |    34.57 ± 1.11 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         22 | pp 512     |    33.10 ± 0.98 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         28 | pp 512     |    29.34 ± 5.15 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         30 | pp 512     |    23.68 ± 1.64 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         31 | pp 512     |    25.66 ± 2.93 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         32 | pp 512     |    32.49 ± 2.29 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |          2 | tg 128     |     4.43 ± 1.70 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |          8 | tg 128     |    17.72 ± 1.97 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         16 | tg 128     |    21.51 ± 0.52 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         22 | tg 128     |    21.18 ± 1.47 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         28 | tg 128     |    19.85 ± 0.79 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         30 | tg 128     |    17.71 ± 3.40 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         31 | tg 128     |    21.59 ± 1.88 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         32 | tg 128     |    17.96 ± 4.88 |

build: 1f0bccb (1007)

This line from llama.cpp seems to explain the pp speed curve.

n_threads = N >= 32 && ggml_cpu_has_blas() && !ggml_cpu_has_gpublas() ? 1 : n_threads;

    // for big prompts, if BLAS is enabled, it is better to use only one thread
    // otherwise, the threads are spin-lock waiting for the BLAS calls and are degrading the performance
    n_threads = N >= 32 && ggml_cpu_has_blas() && !ggml_cpu_has_gpublas() ? 1 : n_threads;

@AlessandroSpallina
Copy link

AlessandroSpallina commented Aug 24, 2023

Here are my results, I'm using CPU only:

cpu_benchmark_general


cpu_benchmark_detail

So based on these data, should I allocate 46 threads for a chatbot use case?

@tijszwinkels
Copy link

tijszwinkels commented Oct 10, 2023

Currently, I can get a macbook pro with a M1 max for similar $ as a M2 pro.
Both have the same number of tensor cores, but the M2 pro cores should be faster. However, the M1 max has twice the memory bandwidth.

Following the discussion up here, what would be better for Llama inference?

@myname36
Copy link

@ggerganov just to torture you a little more before yours arrives, here are benchmarks for M2 Ultra 76c GPU / 192 GB RAM as of 8183159 with only 16 CPU threads:

CPU only: 7B q4_0: prompt eval 74.3 tokens/s, inference 42 tokens/s 7B f16: prompt eval 25 tokens/s, inference 16.7 tokens/s 65B q4_0: prompt eval 7.2 tokens/s, inference 5.71 tokens/s 65B f16: prompt eval 1.6 tokens/s, inference 1.9 tokens/s

GPU only (-ngl 1): 7B q4_0: prompt eval 72.5 tokens/s, inference 83 tokens/s 7B f16: prompt eval 10.2 tokens/s, inference 26.9 tokens/s 65B q4_0: prompt eval 8.74 tokens/s, inference 14.2 tokens/s 65B f16: prompt eval 0.9 tokens/s, inference 3.5 tokens/s

Oh, also, the M2 GPU coil whine is audible when doing inference, but the fan doesn't turn up at all.

wooow m i missing somthing here, i have i7 12700h , 64gb ram and rtx 3050 4g vram , and i dont get nearly half this proformence , how are you getting this crazy results

@ghost
Copy link

ghost commented Oct 11, 2023

@myname36: M2 Ultra has the GPU equivalent of a 3070 with >100 GB of VRAM and no need to copy over PCIe from CPU to GPU.

@xucian
Copy link

xucian commented Nov 5, 2023

M2 Ultra

damn. could you hook up a few of these to run the 70b? it seems at this pace Apple Silicon will dominate the ml hardware market, eventually

@Nan-Do
Copy link

Nan-Do commented Nov 24, 2023

Processor: Snapdragon 870 / 8GB of ram
make flags: make -j 8 (compiled on termux, tried to use clblast with the vulkan backend as there is no native one, opencl apps like clinfo and clpeak work fine but running llama.cpp fails)
Model: zephyr-7b-beta.Q4_K_M.gguf
System info:
system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
Log result:

llama_print_timings:      sample time =      25.51 ms /    84 runs   (    0.30 ms per token,  3292.70 tokens per second)
llama_print_timings: prompt eval time =     931.67 ms /     6 tokens (  155.28 ms per token,     6.44 tokens per second)
llama_print_timings:        eval time =   17652.66 ms /    83 runs   (  212.68 ms per token,     4.70 tokens per second)
llama_print_timings:       total time =   18634.25 ms

Pretty cool result for a mobile CPU that was used 4/5 generations ago, the model is totally usable at ~5 tokens per second.

@jacooooooooool
Copy link

Alright so I got GPT4 to write me a C equivalent. I am not sure as to its quality but cursory analysis seems to indicate that it is correct but I think there is a bunch of overhead in the call to pthread_create. Same repo: https://github.com/kiratp/memory-bandwidth

Same M1:

Thread 6 completed. Accumulated value: 499971380.906237
Thread 0 completed. Accumulated value: 500034749.012463
Thread 1 completed. Accumulated value: 499991005.713216
Thread 3 completed. Accumulated value: 500045810.595650
Thread 5 completed. Accumulated value: 500009162.631447
Thread 2 completed. Accumulated value: 500017471.449399
Elapsed time: 1.029004 seconds
Memory bandwidth: 62.196065 GB/s

Same Threadripper:

<Snipping 64 thread readouts>
Thread 20 completed. Accumulated value: 499997710.298199
Elapsed time: 6.650833 seconds
Memory bandwidth: 76.982839 GB/s

------------------------------ My tests --------------------------
For AMD Threadripper processors, I recommend clang compilation - it makes a difference

Compilation gcc & clang Linux Ubuntu 22
instruction (https://www.amd.com/content/dam/amd/en/documents/txt/aocc-4.0.0-readme.txt)

~/Downloads/memory-bandwidth/c$ gcc memory_bandwidth.c -o memorygcc
~/Downloads/memory-bandwidth/c$ clang memory_bandwidth.c -o memoryclang


~/Downloads/memory-bandwidth/c$ ./memorygcc
Thread 5 completed. Accumulated value: 499996813.458786
Thread 6 completed. Accumulated value: 499960337.415465
Thread 2 completed. Accumulated value: 499968170.145612
Thread 3 completed. Accumulated value: 500054010.424205
Thread 7 completed. Accumulated value: 499965792.708528
Thread 4 completed. Accumulated value: 500011268.358132
Thread 1 completed. Accumulated value: 499988091.667167
Thread 0 completed. Accumulated value: 499986819.865105
Elapsed time: 3.632777 seconds
Memory bandwidth: 17.617375 GB/s

~/Downloads/memory-bandwidth/c$ ./memoryclang
Thread 2 completed. Accumulated value: 499986598.725620
Thread 5 completed. Accumulated value: 499954520.078316
Thread 1 completed. Accumulated value: 499998191.870599
Thread 0 completed. Accumulated value: 499977712.885238
Thread 6 completed. Accumulated value: 499955904.518224
Thread 4 completed. Accumulated value: 499945673.589080
Thread 3 completed. Accumulated value: 499984568.362708
Thread 7 completed. Accumulated value: 500036217.412915
Elapsed time: 2.171946 seconds
Memory bandwidth: 29.466667 GB/s

@github-actions github-actions bot added the stale label Mar 25, 2024
Copy link
Contributor

github-actions bot commented Apr 9, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 9, 2024
@drosanda
Copy link

drosanda commented Dec 15, 2024

base command

~/llama.cpp/build/bin/llama-cli -m ~/models/Llama-3.2-1B-Instruct-Q4_K_M.gguf -p "what is the meaning of
life?" -n 128

orangepi 3B: Rockchip RK3566 quad-core 64-bit processor | 4 GB LPDDR4

llama_perf_sampler_print:    sampling time =     170.97 ms /   136 runs   (    1.26 ms per token,   795.48 tokens per second)
llama_perf_context_print:        load time =    4450.58 ms
llama_perf_context_print: prompt eval time =    4245.68 ms /     8 tokens (  530.71 ms per token,     1.88 tokens per second)
llama_perf_context_print:        eval time =  124973.60 ms /   127 runs   (  984.04 ms per token,     1.02 tokens per second)
llama_perf_context_print:       total time =  129561.47 ms /   135 tokens

intel core i5 13400F | 64GB DDR5

llama_perf_sampler_print:    sampling time =       8.69 ms /   137 runs   (    0.06 ms per token, 15772.51 tokens per second)
llama_perf_context_print:        load time =    1397.88 ms
llama_perf_context_print: prompt eval time =      88.24 ms /     9 tokens (    9.80 ms per token,   102.00 tokens per second)
llama_perf_context_print:        eval time =    2226.40 ms /   127 runs   (   17.53 ms per token,    57.04 tokens per second)
llama_perf_context_print:       total time =    2343.64 ms /   136 tokens

intel core i5 6300u | 16GB DDR4 | WSL

llama_perf_sampler_print:    sampling time =      22.59 ms /   137 runs   (    0.16 ms per token,  6064.63 tokens per second)
llama_perf_context_print:        load time =     920.13 ms
llama_perf_context_print: prompt eval time =     339.01 ms /     9 tokens (   37.67 ms per token,    26.55 tokens per second)
llama_perf_context_print:        eval time =    8976.43 ms /   127 runs   (   70.68 ms per token,    14.15 tokens per second)
llama_perf_context_print:       total time =    9384.25 ms /   136 tokens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation question Further information is requested stale
Projects
None yet
Development

No branches or pull requests