Skip to content

Mistral Large benchmarking

Oleksandr Kuvshynov edited this page Aug 14, 2024 · 4 revisions

Configuration

Hardware:

  • Apple M2 Ultra
  • 192GB
  • 76Core GPU
  • 24Core CPU

Models:

  • Main model: mistral large 123B instruct, Q8
  • draft model: mistral-7b-v0.3 instruct, various quant level

Context: 4096t

No speculation

time ./llama-cli -m ../llms/gguf/mistral2407.q8.gguf -n 512 -c 4096 -f ../llama_duo/test_prompt.txt -ngl 99

llama_print_timings:        load time =    6360.38 ms
llama_print_timings:      sample time =      10.40 ms /   512 runs   (    0.02 ms per token, 49211.84 tokens per second)
llama_print_timings: prompt eval time =    1845.44 ms /   108 tokens (   17.09 ms per token,    58.52 tokens per second)
llama_print_timings:        eval time =  102148.51 ms /   511 runs   (  199.90 ms per token,     5.00 tokens per second)
llama_print_timings:       total time =  104026.03 ms /   619 tokens

...
1.25s user 5.25s system 5% cpu 1:50.48 total

Q4 draft, 8 tokens

time ./_build/duo -m ../llms/gguf/mistral2407.q8.gguf -md ../llms/gguf/mistral7b.Q4.inst.gguf -f ./test_prompt.txt -n 512 --draft 8 -ngl 99 -ngld 0 -c 4096

tokens: 516 tps: 6.09992
736.18s user 11.97s system 800% cpu 1:33.46 total

Q3 draft, 8 tokens

time ./_build/duo -m ../llms/gguf/mistral2407.q8.gguf -md ../llms/gguf/mistral7b.Q3.inst.gguf -f ./test_prompt.txt -n 512 --draft 8 -ngl 99 -ngld 0 -c 4096

tokens: 519 tps: 6.23398
650.44s user 9.45s system 717% cpu 1:32.03 total

Clone this wiki locally