Mistral Large benchmarking

Configuration

Hardware:

Apple M2 Ultra
192GB
76Core GPU
24Core CPU

Models:

Main model: mistral large 123B instruct, Q8
draft model: mistral-7b-v0.3 instruct, various quant level

Context: 4096t

No speculation

time ./llama-cli -m ../llms/gguf/mistral2407.q8.gguf -n 512 -c 4096 -f ../llama_duo/test_prompt.txt -ngl 99

llama_print_timings:        load time =    6360.38 ms
llama_print_timings:      sample time =      10.40 ms /   512 runs   (    0.02 ms per token, 49211.84 tokens per second)
llama_print_timings: prompt eval time =    1845.44 ms /   108 tokens (   17.09 ms per token,    58.52 tokens per second)
llama_print_timings:        eval time =  102148.51 ms /   511 runs   (  199.90 ms per token,     5.00 tokens per second)
llama_print_timings:       total time =  104026.03 ms /   619 tokens

...
1.25s user 5.25s system 5% cpu 1:50.48 total

Q4 draft, 8 tokens

time ./_build/duo -m ../llms/gguf/mistral2407.q8.gguf -md ../llms/gguf/mistral7b.Q4.inst.gguf -f ./test_prompt.txt -n 512 --draft 8 -ngl 99 -ngld 0 -c 4096

tokens: 516 tps: 6.09992
736.18s user 11.97s system 800% cpu 1:33.46 total

Q3 draft, 8 tokens

time ./_build/duo -m ../llms/gguf/mistral2407.q8.gguf -md ../llms/gguf/mistral7b.Q3.inst.gguf -f ./test_prompt.txt -n 512 --draft 8 -ngl 99 -ngld 0 -c 4096

tokens: 519 tps: 6.23398
650.44s user 9.45s system 717% cpu 1:32.03 total

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mistral Large benchmarking

Configuration

No speculation

Q4 draft, 8 tokens

Q3 draft, 8 tokens

Clone this wiki locally