benchmarks? #34

ghost · 2023-03-12T05:20:58Z

Where are the benchmarks for various hardware - eg. apple silicon

wizd · 2023-03-12T05:48:40Z

M1 with 7B model: 94.24 ms per token
M1 with 13B model: 202.18 ms per token

speed with command line config -t 4. If use -t 8, half speed.

ElRoberto538 · 2023-03-12T09:31:29Z

Using command line config -t 8, note this is in a VM assigned 42 logical cores out of the total 44, other services running on the server.
AMD EPYC 7443P 7B: 89.39 ms per token

MLTQ · 2023-03-12T15:36:01Z

M1 Pro 32GB, 30B model:

main: mem per token = 43387780 bytes
main: load time = 10701.85 ms
main: sample time = 279.92 ms
main: predict time = 37065.80 ms / 226.01 ms per token
main: total time = 51992.27 ms

diimdeep · 2023-03-12T17:04:17Z

Macbook Pro 2013, Intel i5, 2 cores, 8 GB RAM
7B 4bit model
main: mem per token = 14335844 bytes
main: load time = 8224.30 ms
main: sample time = 1918.08 ms
main: predict time = 308737.91 ms / 604.18 ms per token
main: total time = 331646.62 ms

Thank you for this awesome project.

neuhaus · 2023-03-12T22:12:36Z

Ryzen 7 3700X, 128GB RAM @ 3200, llama.cpp numbers:

$ ./main -m models/7B/ggml-model-q4_0.bin -t 8 -n 128
main: mem per token = 14434244 bytes
main:     load time =  1270.15 ms
main:   sample time =   325.76 ms
main:  predict time = 15147.15 ms / 117.42 ms per token
main:    total time = 17077.88 ms

$ ./main -m models/13B/ggml-model-q4_0.bin -t 8 -n 128
main: mem per token = 22439492 bytes
main:     load time =  2946.00 ms
main:   sample time =    86.11 ms
main:  predict time =  7358.48 ms / 216.43 ms per token
main:    total time = 11019.28 ms

$ ./main -m models/30B/ggml-model-q4_0.bin -t 8 -n 128
main: mem per token = 43387780 bytes
main:     load time =  6666.53 ms
main:   sample time =   332.71 ms
main:  predict time = 68779.27 ms / 533.17 ms per token
main:    total time = 77333.97 ms

$ ./main -m models/65B/ggml-model-q4_0.bin -t 8 -n 128
main: mem per token = 70897348 bytes
main:     load time = 14010.35 ms
main:   sample time =   335.09 ms
main:  predict time = 140527.48 ms / 1089.36 ms per token
main:    total time = 157951.48 ms

With the 30B model, a RTX 3090 manages 15 tokens/s using text-generation-webui

MarkSchmidty · 2023-03-13T18:45:07Z

1.2 tokens/s on a Samsung S22 Ultra running 4 threads.

The S22 obviously has a more powerful processor. But I do not think it is 12 times more powerful. It's likely you could get much faster speeds on the Pi.

I'd be willing to bet that the bottleneck is not the processor.

Reposting the 1.2 token/second Samsung S22 Ultra result here. (Originally posted in #58)

ItsLogic · 2023-03-13T22:33:08Z

I must say this running on my phone at all was surprising. Here are my results on an 8+gen1 for 4bit 7B

and results for my desktop with 13900k and 64gb ddr5
4bit quant 7B

main: mem per token = 14434244 bytes
main:     load time =   609.88 ms
main:   sample time =    36.60 ms
main:  predict time =  9487.02 ms / 71.33 ms per token
main:    total time = 10341.46 ms

full precision 7B

main: mem per token = 14434244 bytes
main:     load time = 26905.18 ms
main:   sample time =    37.78 ms
main:  predict time = 23033.74 ms / 173.19 ms per token
main:    total time = 50204.95 ms

4bit quant 65B

main: mem per token = 70897348 bytes
main:     load time = 83233.36 ms
main:   sample time =    36.90 ms
main:  predict time = 86000.03 ms / 646.62 ms per token
main:    total time = 172458.39 ms

Edit:
Did something really stupid and ran 4bit 13B on my phone. TLDR its slow, dont. (unless you have lots of ram)
My phone has 12gb ram and 7gb of manually added swap. I had to run it through an adb root shell instead of termux as the android memory manager would kill termux as soon as the model started to load. The downside to this approach is that everything else on my phone is killed meaning I couldnt even get the screen to turn on while inference was running

main: mem per token = 22357508 bytes
main:     load time = 29320.15 ms
main:   sample time =  2254.09 ms
main:  predict time = 5227881.50 ms / 39307.38 ms per token
main:    total time = 5335562.00 ms

totoCZ · 2023-03-16T02:08:20Z

Here is my quick look at
2x Intel Xeon Gold 5120 @ 2.20GHz, march=native

7B

main: mem per token = 14762244 bytes
main:     load time =  3378.15 ms
main:   sample time =    15.87 ms
main:  predict time =  4494.55 ms / 115.24 ms per token
main:    total time =  8328.48 ms

7B fp16

main: mem per token = 14532644 bytes
main:     load time = 27977.19 ms
main:   sample time =    24.71 ms
main:  predict time =  9378.29 ms / 275.83 ms per token
main:    total time = 38135.22 ms

13B

main: mem per token = 22562468 bytes
main:     load time = 16860.55 ms
main:   sample time =   170.45 ms
main:  predict time = 56121.11 ms / 308.36 ms per token
main:    total time = 74377.55 ms

13B fp16

main: mem per token = 22562468 bytes
main:     load time = 64448.62 ms
main:   sample time =   129.29 ms
main:  predict time = 61505.41 ms / 455.60 ms per token
main:    total time = 127347.54 ms

30B

main: mem per token = 43547620 bytes
main:     load time = 51269.82 ms
main:   sample time =    49.77 ms
main:  predict time = 41543.11 ms / 585.11 ms per token
main:    total time = 95383.98 ms

65B

main: mem per token = 71553028 bytes
main:     load time = 99438.78 ms
main:   sample time =    44.94 ms
main:  predict time = 69203.49 ms / 1017.70 ms per token
main:    total time = 218532.06 ms

This is with 14 / 28 threads.
Running with 56 threads slows it down, probably NUMA.
I think 115ms is still a good result for this CPU.

So if anyone like me was wondering, does having a million cores in a server CPU give you a 65B model?
The answer is no.

MarkSchmidty · 2023-03-16T09:15:02Z

So if anyone like me was wondering, does having a million cores in a server CPU give you a 65B model?

It's clear by now that llama.cpp speed mostly depends on max single core performance for comparisons within the same CPU architecture, up to a limit where all CPUs of the same architecture perform approximately the same. Memory bandwidth and memory bus chokepoints appear to be the major bottlenecks after that point.

Using more cores can slow things down for two reasons:

More memory bus congestion from moving bits between more places. llama.cpp is well written and easily maxes out the memory bus on most even moderately powerful systems.
Reducing your effective max single core performance to that of your slowest cores. This is usually the primary culprit on 4 or 6 core devices (mostly phones) which often have 2 power cores and then 2-4 balanced and/or "efficiency" cores.

With these learnings in mind, it would be good to see benchmark results from anyone who manages to find some yet unknown optimization in their configuration, OS environment, or hardware environment.

hanvyj · 2023-03-19T16:02:39Z

How are you getting such good performance?

I'm running an i7 10750h 32gig ram with -m ./models/7B/ggml-model-f16.bin -t 12 -n 128

main: mem per token = 14499844 bytes
main:     load time =  8892.24 ms
main:   sample time =  1988.34 ms
main:  predict time = 270018.50 ms / 2093.17 ms per token
main:    total time = 287685.50 ms

2+s per token! I get similar with the 4 bit quant, if not worse.

Edit: Running with -m ./models/7B/ggml-model-q4_0.bin -t 12 -n 128

main: mem per token = 14499844 bytes
main:     load time =  1631.32 ms
main:   sample time =  1513.06 ms
main:  predict time = 574477.00 ms / 6047.13 ms per token
main:    total time = 596436.75 ms

Green-Sky · 2023-03-19T16:46:34Z

How are you getting such good performance?

I'm running an i7 10750h 32gig ram with -m ./models/7B/ggml-model-f16.bin -t 12 -n 128

Try:

less threads. your cpu seems to only have 6 real cores. llama.cpp seems to scale poorly with threads.
tell us your system info line for more context eg: system_info: n_threads = 8 / 24 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
make sure you compile with all optimizations

xportz · 2023-04-10T23:19:27Z

How do you guys get these benchmarking results? When I CTRL+C out of the program, I get no output. Thanks.

jon-chuang · 2023-04-11T14:22:23Z

Just curious, why was llama.cpp invented when you can run the models on onnxruntime with CPU backend? Could someone make a comparison of relative performance at the same quantization level and also show the perplexity over a validation set?

I guess people just prefer the no-dependency route? But it seems like reinventing the wheel or reimplementing optimized code?

https://onnxruntime.ai/docs/build/inferencing.html

EDIT: I guess one significant advantage is 4-bit quantization which results in significant memory savings over 8-bit. But how does this affect perplexity?

MarkSchmidty · 2023-04-12T00:59:06Z

Just curious, why was llama.cpp invented when you can run the models on onnxruntime with CPU backend? Could someone make a comparison of relative performance at the same quantization level and also show the perplexity over a validation set?

I guess people just prefer the no-dependency route? But it seems like reinventing the wheel or reimplementing optimized code?

onnxruntime.ai/docs/build/inferencing.html

EDIT: I guess one significant advantage is 4-bit quantization which results in significant memory savings over 8-bit. But how does this affect perplexity?

The effect of 4bit on perplexity is negligible thanks to GPTQ quantization, act order, and binning.

4bit is twice as fast as 8bit because llama.cpp is efficient enough to be memory bound, not compute bound, even on modest processors. I have not seen comparisons of ONNX CPU speeds to llama.cpp for the same quantization level, but Hugging Face Transformers is roughly 20x slower than llama.cpp. I suspect ONNX is about as efficient as HF Transformers.

clulece · 2023-04-12T03:52:32Z

4bit is twice as fast as 8bit because llama.cpp is efficient enough to be memory bound, not compute bound, even on modest processors. I have not seen comparisons of ONNX CPU speeds to llama.cpp for the same quantization level, but Hugging Face Transformers is roughly 20x slower than llama.cpp. I suspect ONNX is about as efficient as HF Transformers.

How important is CPU cache size to llama.cpp's performance? Do llama's memory access patterns cause the cache to be evicted often (naive me assumes yes but I really don't know).

jon-chuang · 2023-04-12T03:57:41Z

How important is CPU cache size to llama.cpp's performance?

A: doesn't seem super important: #778

ridwanarf25 · 2023-04-13T04:52:32Z

How do you guys get these benchmarking results? When I CTRL+C out of the program, I get no output. Thanks.

i think you can do it with --mtest parameter

raghav-deepsource · 2023-04-18T16:42:19Z

Wish me luck, Imm running 65B with 6 cores nd 32 gigs of ram

Green-Sky · 2023-04-18T17:04:11Z

@raghav-deepsource luck is what you need. you need at least ~60gigs of ram for the 65B model. :)

raghav-deepsource · 2023-04-19T05:44:05Z

Got it chugging at about 30 seconds per token with "recite the alphabet backwards". Interestingly, my memory usage didn't go up by much. feels like the code may be paging the weights into memory to reduce usage or something

ai-rex · 2023-04-21T11:33:23Z

CPU: E5-2680v4 MEM: 64GB

$ ./build/bin/Release/main.exe -m ./models/65B/ggml-model-q4_0.bin -t 14 -n 128

system_info: n_threads = 14 / 28 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

llama_print_timings: load time = 22915.12 ms
llama_print_timings: sample time = 76.15 ms / 128 runs ( 0.59 ms per run)
llama_print_timings: prompt eval time = 4425.61 ms / 2 tokens ( 2212.81 ms per token)
llama_print_timings: eval time = 176678.85 ms / 127 runs ( 1391.17 ms per run)
llama_print_timings: total time = 199672.21 ms

ai-rex · 2023-04-21T11:49:37Z

$ ./build/bin/Release/main.exe -m ./models/llama-7B-ggml-int4/ggml-model-q4_0.bin -t 14 -n 128

system_info: n_threads = 14 / 28 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

llama_print_timings: load time = 2677.89 ms
llama_print_timings: sample time = 75.61 ms / 128 runs ( 0.59 ms per run)
llama_print_timings: prompt eval time = 225.42 ms / 2 tokens ( 112.71 ms per token)
llama_print_timings: eval time = 19808.81 ms / 127 runs ( 155.97 ms per run)
llama_print_timings: total time = 22564.25 ms

kiratp · 2023-05-01T00:05:42Z

M1 Max, maxed GPU, 64 GB.

Note that M1 Pro vs Max matters beyond core count here since memory bandwidth doubles - 200 GB/s -> 400 GB/s

10 or so Safari tabs in the background, ~6-10% idle CPU consumption observed before start of test.
Model

Script: https://gist.github.com/kiratp/18826c1c085acf732f480e726b32686c
Edited from @KASR 's script https://gist.github.com/KASR/dc3dd7f920f57013486583af7e3725f1#file-benchmark_threads_llama_cpp-py

cmd = "./main \
     --seed 147369852 \
     --threads {threads} \
     --n_predict 128 \
     --model ./models/7B/ggml-model-q4_0.bin \
     --top_k 40 \
     --top_p 0.9 \
     --temp 0.5 \
     --repeat_last_n 64 \
     --repeat_penalty 1.1 \
     -p \"Write a funny joke:\" \
     --ignore-eos"

Running with 1 threads...
	 1 threads | run 1/3 | current token time 199.07 ms - eval time 24809.17 ms - prompt eval time 1592.53 ms
	 1 threads | run 2/3 | current token time 198.85 ms - eval time 24866.71 ms - prompt eval time 1590.83 ms
	 1 threads | run 3/3 | current token time 198.93 ms - eval time 24866.36 ms - prompt eval time 1591.47 ms
Running with 2 threads...
	 2 threads | run 1/3 | current token time 102.17 ms - eval time 12880.66 ms - prompt eval time 817.39 ms
	 2 threads | run 2/3 | current token time 102.09 ms - eval time 12880.23 ms - prompt eval time 816.71 ms
	 2 threads | run 3/3 | current token time 102.05 ms - eval time 12888.98 ms - prompt eval time 816.39 ms
Running with 3 threads...
	 3 threads | run 1/3 | current token time 71.74 ms - eval time 8931.11 ms - prompt eval time 573.96 ms
	 3 threads | run 2/3 | current token time 71.65 ms - eval time 8948.05 ms - prompt eval time 573.17 ms
	 3 threads | run 3/3 | current token time 71.31 ms - eval time 8933.5 ms - prompt eval time 570.51 ms
Running with 4 threads...
	 4 threads | run 1/3 | current token time 54.97 ms - eval time 6944.32 ms - prompt eval time 439.75 ms
	 4 threads | run 2/3 | current token time 54.81 ms - eval time 7153.19 ms - prompt eval time 438.51 ms
	 4 threads | run 3/3 | current token time 54.75 ms - eval time 7073.57 ms - prompt eval time 437.97 ms
Running with 5 threads...
	 5 threads | run 1/3 | current token time 46.04 ms - eval time 6177.01 ms - prompt eval time 368.34 ms
	 5 threads | run 2/3 | current token time 46.33 ms - eval time 6168.68 ms - prompt eval time 370.61 ms
	 5 threads | run 3/3 | current token time 47.62 ms - eval time 6172.55 ms - prompt eval time 380.94 ms
Running with 6 threads...
	 6 threads | run 1/3 | current token time 39.43 ms - eval time 5563.91 ms - prompt eval time 315.41 ms
	 6 threads | run 2/3 | current token time 39.38 ms - eval time 5543.76 ms - prompt eval time 315.03 ms
	 6 threads | run 3/3 | current token time 39.42 ms - eval time 5599.16 ms - prompt eval time 315.39 ms
Running with 7 threads...
	 7 threads | run 1/3 | current token time 34.34 ms - eval time 5676.61 ms - prompt eval time 274.74 ms
	 7 threads | run 2/3 | current token time 34.48 ms - eval time 5688.08 ms - prompt eval time 275.81 ms
	 7 threads | run 3/3 | current token time 34.19 ms - eval time 5681.7 ms - prompt eval time 273.52 ms
Running with 8 threads...
	 8 threads | run 1/3 | current token time 33.95 ms - eval time 5394.02 ms - prompt eval time 271.57 ms
	 8 threads | run 2/3 | current token time 33.29 ms - eval time 5358.99 ms - prompt eval time 266.32 ms
	 8 threads | run 3/3 | current token time 32.22 ms - eval time 5311.68 ms - prompt eval time 257.74 ms
Running with 9 threads...
	 9 threads | run 1/3 | current token time 87.65 ms - eval time 15074.75 ms - prompt eval time 701.22 ms
	 9 threads | run 2/3 | current token time 88.11 ms - eval time 13013.74 ms - prompt eval time 704.86 ms
	 9 threads | run 3/3 | current token time 85.37 ms - eval time 12599.68 ms - prompt eval time 682.97 ms
Running with 10 threads...
	 10 threads | run 1/3 | current token time 114.17 ms - eval time 17767.65 ms - prompt eval time 913.38 ms
	 10 threads | run 2/3 | current token time 107.66 ms - eval time 17790.2 ms - prompt eval time 861.27 ms
	 10 threads | run 3/3 | current token time 103.85 ms - eval time 16773.97 ms - prompt eval time 830.81 ms

rankaiyx · 2023-05-15T08:03:12Z

I3-9100
On the same platform, AVX2 is 1.4 times faster than AVX.

rankaiyx · 2023-05-16T05:24:01Z

Here is my quick look at 2x Intel Xeon Gold 5120 @ 2.20GHz, march=native

7B

main: mem per token = 14762244 bytes
main:     load time =  3378.15 ms
main:   sample time =    15.87 ms
main:  predict time =  4494.55 ms / 115.24 ms per token
main:    total time =  8328.48 ms

7B fp16

main: mem per token = 14532644 bytes
main:     load time = 27977.19 ms
main:   sample time =    24.71 ms
main:  predict time =  9378.29 ms / 275.83 ms per token
main:    total time = 38135.22 ms

13B

main: mem per token = 22562468 bytes
main:     load time = 16860.55 ms
main:   sample time =   170.45 ms
main:  predict time = 56121.11 ms / 308.36 ms per token
main:    total time = 74377.55 ms

13B fp16

main: mem per token = 22562468 bytes
main:     load time = 64448.62 ms
main:   sample time =   129.29 ms
main:  predict time = 61505.41 ms / 455.60 ms per token
main:    total time = 127347.54 ms

30B

main: mem per token = 43547620 bytes
main:     load time = 51269.82 ms
main:   sample time =    49.77 ms
main:  predict time = 41543.11 ms / 585.11 ms per token
main:    total time = 95383.98 ms

65B

main: mem per token = 71553028 bytes
main:     load time = 99438.78 ms
main:   sample time =    44.94 ms
main:  predict time = 69203.49 ms / 1017.70 ms per token
main:    total time = 218532.06 ms

This is with 14 / 28 threads. Running with 56 threads slows it down, probably NUMA. I think 115ms is still a good result for this CPU.

So if anyone like me was wondering, does having a million cores in a server CPU give you a 65B model? The answer is no.

For the 56 threads of 2 NUMA, can you try the following ways to run it?

$ numactl --interleave=0-1 ./main -m models/7B/ggml-model-q4_0.bin -n 512 -p "Building a website can be done in 10 simple steps:" -t 56

Reference link:
#1437

zrm · 2023-05-16T19:14:52Z

Memory latency seems to have a significant effect. >15% difference between dual channel DDR4-3200 at CL22 vs. CL16.

zrm · 2023-05-16T19:22:50Z

$ numactl --interleave=0-1 ./main -m models/7B/ggml-model-q4_0.bin -n 512 -p "Building a website can be done in 10 simple steps:" -t 56

If you previously ran it the other way you have to drop the page cache once first (or reboot):

# echo 3 > /proc/sys/vm/drop_caches

I'll probably post something that can do better than 'numactl --interleave=all' at some point but that still has better performance with the existing code.

ghost · 2023-05-16T22:01:10Z

Since everyone's showing off their fancy machines here's my 16GB (DDR3) i5 6500 with Linux/OpenBLAS.

system_info: n_threads = 4 / 4 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |

7B q4_0:

llama_print_timings:        load time = 25853.83 ms
llama_print_timings:      sample time =    68.77 ms /   100 runs   (    0.69 ms per token)
llama_print_timings: prompt eval time = 25546.94 ms /   353 tokens (   72.37 ms per token)
llama_print_timings:        eval time = 25116.73 ms /    99 runs   (  253.70 ms per token)
llama_print_timings:       total time = 51069.18 ms

13B q5_0:

llama_print_timings:        load time = 47995.62 ms
llama_print_timings:      sample time =    68.99 ms /   100 runs   (    0.69 ms per token)
llama_print_timings: prompt eval time = 47271.48 ms /   353 tokens (  133.91 ms per token)
llama_print_timings:        eval time = 49603.13 ms /    99 runs   (  501.04 ms per token)
llama_print_timings:       total time = 97697.80 ms

Memory bandwidth:

Memory bandwidth: 20.023501 GB/s

Personally I care much more about the prompt eval time as my computer takes several minutes to ingest a 2k token prompt once it fills up the context and needs to rotate the buffer.

aseok · 2023-05-31T16:35:15Z

Any benchmark available on Asus mining expert b250 with multiple amd rx570? Any compatibility issues? Is this type of hardware configuration suitable for running inference?

clulece · 2023-06-03T07:59:11Z

I've recently benchmarked a 7950x-based system (was looking into how performance changed when threads vs full instances vs ram speed. Resulting data can be browsed interactively here: https://clulece.github.io/llamma-cpu-based-performance.

valyagolev · 2023-06-04T21:28:37Z

7B q4_0:

system_info: n_threads = 8 / 16 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000

...

llama_print_timings:        load time =  1657.50 ms
llama_print_timings:      sample time =   124.92 ms /   128 runs   (    0.98 ms per token)
llama_print_timings: prompt eval time =  1192.14 ms /    14 tokens (   85.15 ms per token)
llama_print_timings:        eval time = 14527.86 ms /   127 runs   (  114.39 ms per token)
llama_print_timings:       total time = 16324.63 ms

on the Hetzner Cloud Arm64 Ampere, 16 VCPU

aseok · 2023-06-16T18:38:03Z

Poco F3, (8+5)Gb, llama-7b.ggmlv3.q2_K.bin, opencl: ~1.48 (t/s)
system_info: n_threads = 8 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 512, n_keep = 0
llama_print_timings: load time = 808.72 ms
llama_print_timings: sample time = 678.71 ms / 391 runs ( 1.74 ms per token)
llama_print_timings: prompt eval time = 3468.61 ms / 8 tokens ( 433.58 ms per token)
llama_print_timings: eval time = 262932.38 ms / 390 runs ( 674.19 ms per token)
llama_print_timings: total time = 267274.42 ms

kiratp · 2023-07-21T21:14:43Z

Anyone have access to a Sapphire Raids system and can test with MKL (on those sweet AMX units)? Intel is claiming 2048 INT8 instructions per cycle per core.

emr-cc · 2023-08-01T22:07:35Z

llama-2-7b-chat.ggmlv3.q2_K.bin

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 |   NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 

llama_print_timings:        load time =   873.88 ms
llama_print_timings:      sample time =     4.48 ms /    12 runs   (    0.37 ms per token,  2679.17 tokens per second)
llama_print_timings: prompt eval time =   933.12 ms /     8 tokens (  116.64 ms per token,     8.57 tokens per second)
llama_print_timings:        eval time =   722.88 ms /    11 runs   (   65.72 ms per token,    15.22 tokens per second)
llama_print_timings:       total time =  1662.61 ms

Running on Ryzen 9 6900HX, 12GB 6850M XT with 32GB of ram.

ghost · 2023-08-04T05:34:22Z

@ggerganov just to torture you a little more before yours arrives, here are benchmarks for M2 Ultra 76c GPU / 192 GB RAM as of 8183159 with only 16 CPU threads:

CPU only:
7B q4_0: prompt eval 74.3 tokens/s, inference 42 tokens/s
7B f16: prompt eval 25 tokens/s, inference 16.7 tokens/s
65B q4_0: prompt eval 7.2 tokens/s, inference 5.71 tokens/s
65B f16: prompt eval 1.6 tokens/s, inference 1.9 tokens/s

GPU only (-ngl 1):
7B q4_0: prompt eval 72.5 tokens/s, inference 83 tokens/s
7B f16: prompt eval 10.2 tokens/s, inference 26.9 tokens/s
65B q4_0: prompt eval 8.74 tokens/s, inference 14.2 tokens/s
65B f16: prompt eval 0.9 tokens/s, inference 3.5 tokens/s

Oh, also, the M2 GPU coil whine is audible when doing inference, but the fan doesn't turn up at all.

grigio · 2023-08-09T19:39:09Z

llama2_7b_chat_uncensored.ggmlv3.q2_K.bin

system_info: n_threads = 6 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

7B q2_K

llama_print_timings:        load time =   327,71 ms
llama_print_timings:      sample time =    12,21 ms /    32 runs   (    0,38 ms per token,  2620,80 tokens per second)
llama_print_timings: prompt eval time =   306,55 ms /    12 tokens (   25,55 ms per token,    39,15 tokens per second)
llama_print_timings:        eval time =  1950,89 ms /    31 runs   (   62,93 ms per token,    15,89 tokens per second)
llama_print_timings:       total time =  2274,46 ms

13B q4_K_S

llama_print_timings:        load time =  4703,91 ms
llama_print_timings:      sample time =   210,78 ms /   553 runs   (    0,38 ms per token,  2623,60 tokens per second)
llama_print_timings: prompt eval time = 12411,64 ms /   269 tokens (   46,14 ms per token,    21,67 tokens per second)
llama_print_timings:        eval time = 86960,41 ms /   551 runs   (  157,82 ms per token,     6,34 tokens per second)
llama_print_timings:       total time = 99669,44 ms

30B q2_K

llama_print_timings:        load time =  8828,89 ms
llama_print_timings:      sample time =    62,47 ms /   168 runs   (    0,37 ms per token,  2689,25 tokens per second)
llama_print_timings: prompt eval time =  1451,25 ms /    12 tokens (  120,94 ms per token,     8,27 tokens per second)
llama_print_timings:        eval time = 50421,39 ms /   167 runs   (  301,92 ms per token,     3,31 tokens per second)
llama_print_timings:       total time = 51960,64 ms

30B q4_K_S

llama_print_timings:        load time = 15271.31 ms
llama_print_timings:      sample time =     4.35 ms /    12 runs   (    0.36 ms per token,  2758.62 tokens per second)
llama_print_timings: prompt eval time = 15271.28 ms /    75 tokens (  203.62 ms per token,     4.91 tokens per second)
llama_print_timings:        eval time =  4426.11 ms /    11 runs   (  402.37 ms per token,     2.49 tokens per second)
llama_print_timings:       total time = 19718.30 ms
Output generated in 20.04 seconds (0.55 tokens/s, 11 tokens, context 75, seed 2103820326)

70B q4_K_S platypus2-70b-instruct.gguf.q4_K_S.bin

llama_print_timings:        load time = 30837.91 ms
llama_print_timings:      sample time =     8.26 ms /    22 runs   (    0.38 ms per token,  2664.41 tokens per second)
llama_print_timings: prompt eval time =  7082.51 ms /    23 tokens (  307.94 ms per token,     3.25 tokens per second)
llama_print_timings:        eval time = 18292.52 ms /    21 runs   (  871.07 ms per token,     1.15 tokens per second)
llama_print_timings:       total time = 25388.38 ms

AMD Ryzen 7 7700 48GB ram on Linux

slaren · 2023-08-12T15:46:45Z

13900k, DDR5 6400:

7B q4_0 gen 128:

Threads	ms/t	t/s
4	79.08	12.65
8	54.74	18.27
16	63.86	15.66
24	59.84	16.71
32	62.96	15.88

7B q4_0 pp 512:

Threads	ms/t	t/s
4	44.56	22.44
8	29.70	33.67
16	27.23	36.73
24	21.26	47.04
32	18.11	55.21

kiratp · 2023-08-19T18:02:44Z

GCP instance: c3-highcpu-44

Compiled with:
cmake .. -DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=Intel10_64lp -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DCMAKE_CXX_FLAGS="-Ofast -mSAPPHIRERAPIDS -xSAPPHIRERAPIDS -qopt-zmm-usage=high -mno-shstk"

./build/bin/main --model /usr/src/models/<llama2 merged LORA model>.ggml.q4_k_m.bin --ctx-size 4096 --threads 42 -eps 1e-5 -p "Building a website can be done in 10 simple steps:"

llama_print_timings:        load time =   217.25 ms
llama_print_timings:      sample time =   300.20 ms /   491 runs   (    0.61 ms per token,  1635.58 tokens per second)
llama_print_timings: prompt eval time =   215.73 ms /    14 tokens (   15.41 ms per token,    64.90 tokens per second)
llama_print_timings:        eval time = 22698.24 ms /   490 runs   (   46.32 ms per token,    21.59 tokens per second)
llama_print_timings:       total time = 23328.41 ms

22 threads (the hardware core count) is slower at around 18-19 t/sec

Q8_0

llama_print_timings:        load time =   367.61 ms
llama_print_timings:      sample time =   292.28 ms /   477 runs   (    0.61 ms per token,  1631.97 tokens per second)
llama_print_timings: prompt eval time =   232.65 ms /    14 tokens (   16.62 ms per token,    60.18 tokens per second)
llama_print_timings:        eval time = 33848.05 ms /   476 runs   (   71.11 ms per token,    14.06 tokens per second)
llama_print_timings:       total time = 34484.84 ms

kiratp · 2023-08-19T22:09:04Z

This is on a GCP t2d-standartd-32 instance. 32 Milan cors with SMT/HT turned off so 1 core = 1 physical core - https://cloud.google.com/compute/docs/general-purpose-machines#t2d_machines

root@ml-perf-testing:/usr/src/app/llama.cpp# ./llama-bench --model /usr/src/models/<llama2 7B merged lora>.ggml.q4_k_m.bin -t 2,8,16,22,28,30,31,32
| model                          | backend    |  n_threads | test       |             t/s |
| ------------------------------ | ---------- | ---------: | ---------- | --------------: |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |          2 | pp 512     |    34.75 ± 1.77 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |          8 | pp 512     |    34.44 ± 1.71 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         16 | pp 512     |    34.57 ± 1.11 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         22 | pp 512     |    33.10 ± 0.98 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         28 | pp 512     |    29.34 ± 5.15 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         30 | pp 512     |    23.68 ± 1.64 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         31 | pp 512     |    25.66 ± 2.93 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         32 | pp 512     |    32.49 ± 2.29 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |          2 | tg 128     |     4.43 ± 1.70 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |          8 | tg 128     |    17.72 ± 1.97 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         16 | tg 128     |    21.51 ± 0.52 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         22 | tg 128     |    21.18 ± 1.47 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         28 | tg 128     |    19.85 ± 0.79 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         30 | tg 128     |    17.71 ± 3.40 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         31 | tg 128     |    21.59 ± 1.88 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         32 | tg 128     |    17.96 ± 4.88 |

build: 1f0bccb (1007)

This line from llama.cpp seems to explain the pp speed curve.

llama.cpp/llama.cpp

Line 1841 in 1f0bccb

    
           n_threads = N >= 32 && ggml_cpu_has_blas() && !ggml_cpu_has_gpublas() ? 1 : n_threads;

    // for big prompts, if BLAS is enabled, it is better to use only one thread
    // otherwise, the threads are spin-lock waiting for the BLAS calls and are degrading the performance
    n_threads = N >= 32 && ggml_cpu_has_blas() && !ggml_cpu_has_gpublas() ? 1 : n_threads;

AlessandroSpallina · 2023-08-24T21:54:13Z

Here are my results, I'm using CPU only:

So based on these data, should I allocate 46 threads for a chatbot use case?

tijszwinkels · 2023-10-10T10:57:09Z

Currently, I can get a macbook pro with a M1 max for similar $ as a M2 pro.
Both have the same number of tensor cores, but the M2 pro cores should be faster. However, the M1 max has twice the memory bandwidth.

Following the discussion up here, what would be better for Llama inference?

myname36 · 2023-10-10T12:31:39Z

@ggerganov just to torture you a little more before yours arrives, here are benchmarks for M2 Ultra 76c GPU / 192 GB RAM as of 8183159 with only 16 CPU threads:

CPU only: 7B q4_0: prompt eval 74.3 tokens/s, inference 42 tokens/s 7B f16: prompt eval 25 tokens/s, inference 16.7 tokens/s 65B q4_0: prompt eval 7.2 tokens/s, inference 5.71 tokens/s 65B f16: prompt eval 1.6 tokens/s, inference 1.9 tokens/s

GPU only (-ngl 1): 7B q4_0: prompt eval 72.5 tokens/s, inference 83 tokens/s 7B f16: prompt eval 10.2 tokens/s, inference 26.9 tokens/s 65B q4_0: prompt eval 8.74 tokens/s, inference 14.2 tokens/s 65B f16: prompt eval 0.9 tokens/s, inference 3.5 tokens/s

Oh, also, the M2 GPU coil whine is audible when doing inference, but the fan doesn't turn up at all.

wooow m i missing somthing here, i have i7 12700h , 64gb ram and rtx 3050 4g vram , and i dont get nearly half this proformence , how are you getting this crazy results

ghost · 2023-10-11T03:28:06Z

@myname36: M2 Ultra has the GPU equivalent of a 3070 with >100 GB of VRAM and no need to copy over PCIe from CPU to GPU.

xucian · 2023-11-05T15:00:51Z

M2 Ultra

damn. could you hook up a few of these to run the 70b? it seems at this pace Apple Silicon will dominate the ml hardware market, eventually

Nan-Do · 2023-11-24T10:48:05Z

Processor: Snapdragon 870 / 8GB of ram
make flags: make -j 8 (compiled on termux, tried to use clblast with the vulkan backend as there is no native one, opencl apps like clinfo and clpeak work fine but running llama.cpp fails)
Model: zephyr-7b-beta.Q4_K_M.gguf
System info:
system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
Log result:

llama_print_timings:      sample time =      25.51 ms /    84 runs   (    0.30 ms per token,  3292.70 tokens per second)
llama_print_timings: prompt eval time =     931.67 ms /     6 tokens (  155.28 ms per token,     6.44 tokens per second)
llama_print_timings:        eval time =   17652.66 ms /    83 runs   (  212.68 ms per token,     4.70 tokens per second)
llama_print_timings:       total time =   18634.25 ms

Pretty cool result for a mobile CPU that was used 4/5 generations ago, the model is totally usable at ~5 tokens per second.

jacooooooooool · 2023-11-28T00:15:46Z

Alright so I got GPT4 to write me a C equivalent. I am not sure as to its quality but cursory analysis seems to indicate that it is correct but I think there is a bunch of overhead in the call to pthread_create. Same repo: https://github.com/kiratp/memory-bandwidth

Same M1:
Thread 6 completed. Accumulated value: 499971380.906237
Thread 0 completed. Accumulated value: 500034749.012463
Thread 1 completed. Accumulated value: 499991005.713216
Thread 3 completed. Accumulated value: 500045810.595650
Thread 5 completed. Accumulated value: 500009162.631447
Thread 2 completed. Accumulated value: 500017471.449399
Elapsed time: 1.029004 seconds
Memory bandwidth: 62.196065 GB/s
Same Threadripper:
<Snipping 64 thread readouts>
Thread 20 completed. Accumulated value: 499997710.298199
Elapsed time: 6.650833 seconds
Memory bandwidth: 76.982839 GB/s

------------------------------ My tests --------------------------
For AMD Threadripper processors, I recommend clang compilation - it makes a difference

Compilation gcc & clang Linux Ubuntu 22
instruction (https://www.amd.com/content/dam/amd/en/documents/txt/aocc-4.0.0-readme.txt)

~/Downloads/memory-bandwidth/c$ gcc memory_bandwidth.c -o memorygcc
~/Downloads/memory-bandwidth/c$ clang memory_bandwidth.c -o memoryclang

~/Downloads/memory-bandwidth/c$ ./memorygcc
Thread 5 completed. Accumulated value: 499996813.458786
Thread 6 completed. Accumulated value: 499960337.415465
Thread 2 completed. Accumulated value: 499968170.145612
Thread 3 completed. Accumulated value: 500054010.424205
Thread 7 completed. Accumulated value: 499965792.708528
Thread 4 completed. Accumulated value: 500011268.358132
Thread 1 completed. Accumulated value: 499988091.667167
Thread 0 completed. Accumulated value: 499986819.865105
Elapsed time: 3.632777 seconds
Memory bandwidth: 17.617375 GB/s

~/Downloads/memory-bandwidth/c$ ./memoryclang
Thread 2 completed. Accumulated value: 499986598.725620
Thread 5 completed. Accumulated value: 499954520.078316
Thread 1 completed. Accumulated value: 499998191.870599
Thread 0 completed. Accumulated value: 499977712.885238
Thread 6 completed. Accumulated value: 499955904.518224
Thread 4 completed. Accumulated value: 499945673.589080
Thread 3 completed. Accumulated value: 499984568.362708
Thread 7 completed. Accumulated value: 500036217.412915
Elapsed time: 2.171946 seconds
Memory bandwidth: 29.466667 GB/s

github-actions · 2024-04-09T01:10:24Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

drosanda · 2024-12-15T17:00:46Z

base command

~/llama.cpp/build/bin/llama-cli -m ~/models/Llama-3.2-1B-Instruct-Q4_K_M.gguf -p "what is the meaning of
life?" -n 128

orangepi 3B: Rockchip RK3566 quad-core 64-bit processor | 4 GB LPDDR4

llama_perf_sampler_print:    sampling time =     170.97 ms /   136 runs   (    1.26 ms per token,   795.48 tokens per second)
llama_perf_context_print:        load time =    4450.58 ms
llama_perf_context_print: prompt eval time =    4245.68 ms /     8 tokens (  530.71 ms per token,     1.88 tokens per second)
llama_perf_context_print:        eval time =  124973.60 ms /   127 runs   (  984.04 ms per token,     1.02 tokens per second)
llama_perf_context_print:       total time =  129561.47 ms /   135 tokens

intel core i5 13400F | 64GB DDR5

llama_perf_sampler_print:    sampling time =       8.69 ms /   137 runs   (    0.06 ms per token, 15772.51 tokens per second)
llama_perf_context_print:        load time =    1397.88 ms
llama_perf_context_print: prompt eval time =      88.24 ms /     9 tokens (    9.80 ms per token,   102.00 tokens per second)
llama_perf_context_print:        eval time =    2226.40 ms /   127 runs   (   17.53 ms per token,    57.04 tokens per second)
llama_perf_context_print:       total time =    2343.64 ms /   136 tokens

intel core i5 6300u | 16GB DDR4 | WSL

llama_perf_sampler_print:    sampling time =      22.59 ms /   137 runs   (    0.16 ms per token,  6064.63 tokens per second)
llama_perf_context_print:        load time =     920.13 ms
llama_perf_context_print: prompt eval time =     339.01 ms /     9 tokens (   37.67 ms per token,    26.55 tokens per second)
llama_perf_context_print:        eval time =    8976.43 ms /   127 runs   (   70.68 ms per token,    14.15 tokens per second)
llama_perf_context_print:       total time =    9384.25 ms /   136 tokens

ggerganov added question Further information is requested documentation Improvements or additions to documentation labels Mar 12, 2023

This was referenced Mar 12, 2023

Raspberry Pi 4 4GB #58

Closed

How good is the 65B model? Anyone tested it? meta-llama/llama#157

Closed

sgsdxzy mentioned this issue Mar 13, 2023

GPTQ quantization(3 or 4 bit quantization) support for LLaMa oobabooga/text-generation-webui#177

Closed

jon-chuang mentioned this issue Apr 13, 2023

perf: Investigate performance discrepancy with llama-rs - 1.5x-2x slower #932

Closed

jon-chuang mentioned this issue Apr 13, 2023

fix(perf/UX): Use num physical cores by default, warn about E/P cores #934

Merged

ggerganov mentioned this issue May 26, 2023

Using Accelerate for vector scale ggerganov/ggml#193

Closed

ghost mentioned this issue Jun 5, 2023

k-quants #1684

Merged

biswaroop1547 mentioned this issue Sep 5, 2023

How GGML is different from ONNX #3022

Closed

github-actions bot added the stale label Mar 25, 2024

github-actions bot closed this as completed Apr 9, 2024

benchmarks? #34

benchmarks? #34

Comments

ghost commented Mar 12, 2023

wizd commented Mar 12, 2023

ElRoberto538 commented Mar 12, 2023 • edited Loading

MLTQ commented Mar 12, 2023

diimdeep commented Mar 12, 2023

neuhaus commented Mar 12, 2023 • edited Loading

MarkSchmidty commented Mar 13, 2023 • edited Loading

ItsLogic commented Mar 13, 2023 • edited Loading

totoCZ commented Mar 16, 2023 • edited Loading

MarkSchmidty commented Mar 16, 2023 • edited Loading

hanvyj commented Mar 19, 2023 • edited Loading

Green-Sky commented Mar 19, 2023

xportz commented Apr 10, 2023

jon-chuang commented Apr 11, 2023 • edited Loading

MarkSchmidty commented Apr 12, 2023

clulece commented Apr 12, 2023

jon-chuang commented Apr 12, 2023

ridwanarf25 commented Apr 13, 2023

raghav-deepsource commented Apr 18, 2023

Green-Sky commented Apr 18, 2023

raghav-deepsource commented Apr 19, 2023 • edited Loading

ai-rex commented Apr 21, 2023

ai-rex commented Apr 21, 2023 • edited Loading

kiratp commented May 1, 2023 • edited Loading

rankaiyx commented May 15, 2023

rankaiyx commented May 16, 2023

zrm commented May 16, 2023

zrm commented May 16, 2023

ghost commented May 16, 2023 • edited by ghost Loading

aseok commented May 31, 2023 • edited Loading

clulece commented Jun 3, 2023 • edited Loading

valyagolev commented Jun 4, 2023

aseok commented Jun 16, 2023

kiratp commented Jul 21, 2023

emr-cc commented Aug 1, 2023 • edited Loading

ghost commented Aug 4, 2023 • edited by ghost Loading

grigio commented Aug 9, 2023 • edited Loading

slaren commented Aug 12, 2023 • edited Loading

kiratp commented Aug 19, 2023 • edited Loading

kiratp commented Aug 19, 2023 • edited Loading

AlessandroSpallina commented Aug 24, 2023 • edited Loading

tijszwinkels commented Oct 10, 2023 • edited Loading

myname36 commented Oct 10, 2023

ghost commented Oct 11, 2023

xucian commented Nov 5, 2023

Nan-Do commented Nov 24, 2023 • edited Loading

jacooooooooool commented Nov 28, 2023

github-actions bot commented Apr 9, 2024

drosanda commented Dec 15, 2024 • edited Loading

ElRoberto538 commented Mar 12, 2023 •

edited

Loading

neuhaus commented Mar 12, 2023 •

edited

Loading

MarkSchmidty commented Mar 13, 2023 •

edited

Loading

ItsLogic commented Mar 13, 2023 •

edited

Loading

totoCZ commented Mar 16, 2023 •

edited

Loading

MarkSchmidty commented Mar 16, 2023 •

edited

Loading

hanvyj commented Mar 19, 2023 •

edited

Loading

jon-chuang commented Apr 11, 2023 •

edited

Loading

raghav-deepsource commented Apr 19, 2023 •

edited

Loading

ai-rex commented Apr 21, 2023 •

edited

Loading

kiratp commented May 1, 2023 •

edited

Loading

ghost commented May 16, 2023 •

edited by ghost

Loading

aseok commented May 31, 2023 •

edited

Loading

clulece commented Jun 3, 2023 •

edited

Loading

emr-cc commented Aug 1, 2023 •

edited

Loading

ghost commented Aug 4, 2023 •

edited by ghost

Loading

grigio commented Aug 9, 2023 •

edited

Loading

slaren commented Aug 12, 2023 •

edited

Loading

kiratp commented Aug 19, 2023 •

edited

Loading

kiratp commented Aug 19, 2023 •

edited

Loading

AlessandroSpallina commented Aug 24, 2023 •

edited

Loading

tijszwinkels commented Oct 10, 2023 •

edited

Loading

Nan-Do commented Nov 24, 2023 •

edited

Loading

drosanda commented Dec 15, 2024 •

edited

Loading