-
Notifications
You must be signed in to change notification settings - Fork 0
Mistral Large benchmarking
Oleksandr Kuvshynov edited this page Aug 14, 2024
·
4 revisions
Hardware:
- Apple M2 Ultra
- 192GB
- 76Core GPU
- 24Core CPU
Models:
- Main model: mistral large 123B instruct, Q8
- draft model: mistral-7b-v0.3 instruct, various quant level
Context: 4096t
time ./llama-cli -m ../llms/gguf/mistral2407.q8.gguf -n 512 -c 4096 -f ../llama_duo/test_prompt.txt -ngl 99
llama_print_timings: load time = 6360.38 ms
llama_print_timings: sample time = 10.40 ms / 512 runs ( 0.02 ms per token, 49211.84 tokens per second)
llama_print_timings: prompt eval time = 1845.44 ms / 108 tokens ( 17.09 ms per token, 58.52 tokens per second)
llama_print_timings: eval time = 102148.51 ms / 511 runs ( 199.90 ms per token, 5.00 tokens per second)
llama_print_timings: total time = 104026.03 ms / 619 tokens
...
1.25s user 5.25s system 5% cpu 1:50.48 total
time ./_build/duo -m ../llms/gguf/mistral2407.q8.gguf -md ../llms/gguf/mistral7b.Q4.inst.gguf -f ./test_prompt.txt -n 512 --draft 8 -ngl 99 -ngld 0 -c 4096
tokens: 516 tps: 6.09992
736.18s user 11.97s system 800% cpu 1:33.46 total
time ./_build/duo -m ../llms/gguf/mistral2407.q8.gguf -md ../llms/gguf/mistral7b.Q3.inst.gguf -f ./test_prompt.txt -n 512 --draft 8 -ngl 99 -ngld 0 -c 4096
tokens: 519 tps: 6.23398
650.44s user 9.45s system 717% cpu 1:32.03 total