Speculative Decoding in Exllama v2 and llama.cpp comparison : r/LocalLLaMA #391
Labels
llm-benchmarks
testing and benchmarking large language models
llm-experiments
experiments with large language models
llm-inference-engines
Software to run inference on large language models
New-Label
Choose this option if the existing labels are insufficient to describe the content accurately
Speculative Decoding in Exllama v2 and llama.cpp Comparison
Discussion
We discussed speculative decoding (SD) in a previous thread. For those who are not aware of this feature, it allows LLM loaders to use a smaller "draft" model to help predict tokens for a larger model. In that thread, someone asked for tests of speculative decoding for both Exllama v2 and llama.cpp. Although I generally only run models in GPTQ, AWQ, or exl2 formats, I was interested in doing the exl2 vs. llama.cpp comparison.
Test Setup
The tests were run on a 2x 4090, 13900K, DDR5 system. The screen captures of the terminal output of both are available below. If someone has experience with making llama.cpp speculative decoding work better, please share.
Exllama v2 Results
Model: Xwin-LM-70B-V0.1-4.0bpw-h6-exl2
Draft Model: TinyLlama-1.1B-1T-OpenOrca-GPTQ
Performance can be highly variable, but it goes from ~20 t/s without SD to 40-50 t/s with SD.
No SD
With SD
Suggested labels
{ "key": "speculative-decoding", "value": "Technique for using a smaller 'draft' model to help predict tokens for a larger model" }
The text was updated successfully, but these errors were encountered: