-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
benchmarks? #34
Comments
M1 with 7B model: 94.24 ms per token speed with command line config -t 4. If use -t 8, half speed. |
Using command line config -t 8, note this is in a VM assigned 42 logical cores out of the total 44, other services running on the server. |
M1 Pro 32GB, 30B model: main: mem per token = 43387780 bytes |
Macbook Pro 2013, Intel i5, 2 cores, 8 GB RAM Thank you for this awesome project. |
Ryzen 7 3700X, 128GB RAM @ 3200, llama.cpp numbers:
With the 30B model, a RTX 3090 manages 15 tokens/s using text-generation-webui |
Reposting the 1.2 token/second Samsung S22 Ultra result here. (Originally posted in #58) |
Here is my quick look at 7B
7B fp16
13B
13B fp16
30B
65B
This is with 14 / 28 threads. So if anyone like me was wondering, does having a million cores in a server CPU give you a 65B model? |
It's clear by now that llama.cpp speed mostly depends on max single core performance for comparisons within the same CPU architecture, up to a limit where all CPUs of the same architecture perform approximately the same. Memory bandwidth and memory bus chokepoints appear to be the major bottlenecks after that point. Using more cores can slow things down for two reasons:
With these learnings in mind, it would be good to see benchmark results from anyone who manages to find some yet unknown optimization in their configuration, OS environment, or hardware environment. |
How are you getting such good performance? I'm running an i7 10750h 32gig ram with
2+s per token! I get similar with the 4 bit quant, if not worse. Edit: Running with
|
Try:
|
How do you guys get these benchmarking results? When I CTRL+C out of the program, I get no output. Thanks. |
Just curious, why was llama.cpp invented when you can run the models on onnxruntime with CPU backend? Could someone make a comparison of relative performance at the same quantization level and also show the perplexity over a validation set? I guess people just prefer the no-dependency route? But it seems like reinventing the wheel or reimplementing optimized code? https://onnxruntime.ai/docs/build/inferencing.html EDIT: I guess one significant advantage is 4-bit quantization which results in significant memory savings over 8-bit. But how does this affect perplexity? |
The effect of 4bit on perplexity is negligible thanks to GPTQ quantization, act order, and binning. 4bit is twice as fast as 8bit because llama.cpp is efficient enough to be memory bound, not compute bound, even on modest processors. I have not seen comparisons of ONNX CPU speeds to llama.cpp for the same quantization level, but Hugging Face Transformers is roughly 20x slower than llama.cpp. I suspect ONNX is about as efficient as HF Transformers. |
How important is CPU cache size to llama.cpp's performance? Do llama's memory access patterns cause the cache to be evicted often (naive me assumes yes but I really don't know). |
A: doesn't seem super important: #778 |
i think you can do it with --mtest parameter |
Wish me luck, Imm running 65B with 6 cores nd 32 gigs of ram |
@raghav-deepsource luck is what you need. you need at least ~60gigs of ram for the 65B model. :) |
Got it chugging at about 30 seconds per token with "recite the alphabet backwards". Interestingly, my memory usage didn't go up by much. feels like the code may be paging the weights into memory to reduce usage or something |
CPU: E5-2680v4 MEM: 64GB $ ./build/bin/Release/main.exe -m ./models/65B/ggml-model-q4_0.bin -t 14 -n 128 system_info: n_threads = 14 / 28 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | llama_print_timings: load time = 22915.12 ms |
$ ./build/bin/Release/main.exe -m ./models/llama-7B-ggml-int4/ggml-model-q4_0.bin -t 14 -n 128 system_info: n_threads = 14 / 28 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | llama_print_timings: load time = 2677.89 ms |
M1 Max, maxed GPU, 64 GB. Note that M1 Pro vs Max matters beyond core count here since memory bandwidth doubles - 200 GB/s -> 400 GB/s 10 or so Safari tabs in the background, ~6-10% idle CPU consumption observed before start of test. Script: https://gist.github.com/kiratp/18826c1c085acf732f480e726b32686c
|
I3-9100 |
For the 56 threads of 2 NUMA, can you try the following ways to run it? $ numactl --interleave=0-1 ./main -m models/7B/ggml-model-q4_0.bin -n 512 -p "Building a website can be done in 10 simple steps:" -t 56 Reference link: |
Memory latency seems to have a significant effect. >15% difference between dual channel DDR4-3200 at CL22 vs. CL16. |
If you previously ran it the other way you have to drop the page cache once first (or reboot): # echo 3 > /proc/sys/vm/drop_caches I'll probably post something that can do better than 'numactl --interleave=all' at some point but that still has better performance with the existing code. |
Since everyone's showing off their fancy machines here's my 16GB (DDR3) i5 6500 with Linux/OpenBLAS.
7B q4_0:
13B q5_0:
Memory bandwidth:
Personally I care much more about the prompt eval time as my computer takes several minutes to ingest a 2k token prompt once it fills up the context and needs to rotate the buffer. |
Any benchmark available on Asus mining expert b250 with multiple amd rx570? Any compatibility issues? Is this type of hardware configuration suitable for running inference? |
I've recently benchmarked a 7950x-based system (was looking into how performance changed when threads vs full instances vs ram speed. Resulting data can be browsed interactively here: https://clulece.github.io/llamma-cpu-based-performance. |
7B q4_0:
on the Hetzner Cloud Arm64 Ampere, 16 VCPU |
Poco F3, (8+5)Gb, llama-7b.ggmlv3.q2_K.bin, opencl: ~1.48 (t/s) |
Anyone have access to a Sapphire Raids system and can test with MKL (on those sweet AMX units)? Intel is claiming 2048 INT8 instructions per cycle per core. |
llama-2-7b-chat.ggmlv3.q2_K.bin
Running on Ryzen 9 6900HX, 12GB 6850M XT with 32GB of ram. |
@ggerganov just to torture you a little more before yours arrives, here are benchmarks for M2 Ultra 76c GPU / 192 GB RAM as of 8183159 with only 16 CPU threads: CPU only: GPU only (-ngl 1): Oh, also, the M2 GPU coil whine is audible when doing inference, but the fan doesn't turn up at all. |
llama2_7b_chat_uncensored.ggmlv3.q2_K.bin
7B q2_K
13B q4_K_S
30B q2_K
30B q4_K_S
70B q4_K_S platypus2-70b-instruct.gguf.q4_K_S.bin
AMD Ryzen 7 7700 48GB ram on Linux |
13900k, DDR5 6400: 7B q4_0 gen 128:
7B q4_0 pp 512:
|
GCP instance: c3-highcpu-44 Compiled with:
22 threads (the hardware core count) is slower at around 18-19 t/sec Q8_0
|
This is on a GCP t2d-standartd-32 instance. 32 Milan cors with SMT/HT turned off so 1 core = 1 physical core - https://cloud.google.com/compute/docs/general-purpose-machines#t2d_machines
This line from llama.cpp seems to explain the pp speed curve. Line 1841 in 1f0bccb
|
Currently, I can get a macbook pro with a M1 max for similar $ as a M2 pro. Following the discussion up here, what would be better for Llama inference? |
wooow m i missing somthing here, i have i7 12700h , 64gb ram and rtx 3050 4g vram , and i dont get nearly half this proformence , how are you getting this crazy results |
@myname36: M2 Ultra has the GPU equivalent of a 3070 with >100 GB of VRAM and no need to copy over PCIe from CPU to GPU. |
damn. could you hook up a few of these to run the 70b? it seems at this pace Apple Silicon will dominate the ml hardware market, eventually |
Processor:
Pretty cool result for a mobile CPU that was used 4/5 generations ago, the model is totally usable at ~5 tokens per second. |
------------------------------ My tests -------------------------- Compilation gcc & clang Linux Ubuntu 22 ~/Downloads/memory-bandwidth/c$ gcc memory_bandwidth.c -o memorygcc ~/Downloads/memory-bandwidth/c$ ./memorygcc
|
This issue was closed because it has been inactive for 14 days since being marked as stale. |
base command ~/llama.cpp/build/bin/llama-cli -m ~/models/Llama-3.2-1B-Instruct-Q4_K_M.gguf -p "what is the meaning of
life?" -n 128 orangepi 3B: Rockchip RK3566 quad-core 64-bit processor | 4 GB LPDDR4 llama_perf_sampler_print: sampling time = 170.97 ms / 136 runs ( 1.26 ms per token, 795.48 tokens per second)
llama_perf_context_print: load time = 4450.58 ms
llama_perf_context_print: prompt eval time = 4245.68 ms / 8 tokens ( 530.71 ms per token, 1.88 tokens per second)
llama_perf_context_print: eval time = 124973.60 ms / 127 runs ( 984.04 ms per token, 1.02 tokens per second)
llama_perf_context_print: total time = 129561.47 ms / 135 tokens intel core i5 13400F | 64GB DDR5 llama_perf_sampler_print: sampling time = 8.69 ms / 137 runs ( 0.06 ms per token, 15772.51 tokens per second)
llama_perf_context_print: load time = 1397.88 ms
llama_perf_context_print: prompt eval time = 88.24 ms / 9 tokens ( 9.80 ms per token, 102.00 tokens per second)
llama_perf_context_print: eval time = 2226.40 ms / 127 runs ( 17.53 ms per token, 57.04 tokens per second)
llama_perf_context_print: total time = 2343.64 ms / 136 tokens intel core i5 6300u | 16GB DDR4 | WSL llama_perf_sampler_print: sampling time = 22.59 ms / 137 runs ( 0.16 ms per token, 6064.63 tokens per second)
llama_perf_context_print: load time = 920.13 ms
llama_perf_context_print: prompt eval time = 339.01 ms / 9 tokens ( 37.67 ms per token, 26.55 tokens per second)
llama_perf_context_print: eval time = 8976.43 ms / 127 runs ( 70.68 ms per token, 14.15 tokens per second)
llama_perf_context_print: total time = 9384.25 ms / 136 tokens |
Where are the benchmarks for various hardware - eg. apple silicon
The text was updated successfully, but these errors were encountered: