Speculative Decoding in Exllama v2 and llama.cpp comparison : r/LocalLLaMA #391

irthomasthomas · 2024-01-18T15:19:52Z

Speculative Decoding in Exllama v2 and llama.cpp comparison : r/LocalLLaMA

Speculative Decoding in Exllama v2 and llama.cpp Comparison

Discussion

We discussed speculative decoding (SD) in a previous thread. For those who are not aware of this feature, it allows LLM loaders to use a smaller "draft" model to help predict tokens for a larger model. In that thread, someone asked for tests of speculative decoding for both Exllama v2 and llama.cpp. Although I generally only run models in GPTQ, AWQ, or exl2 formats, I was interested in doing the exl2 vs. llama.cpp comparison.

Test Setup

The tests were run on a 2x 4090, 13900K, DDR5 system. The screen captures of the terminal output of both are available below. If someone has experience with making llama.cpp speculative decoding work better, please share.

Exllama v2 Results

Model: Xwin-LM-70B-V0.1-4.0bpw-h6-exl2

Draft Model: TinyLlama-1.1B-1T-OpenOrca-GPTQ

Performance can be highly variable, but it goes from ~20 t/s without SD to 40-50 t/s with SD.

No SD

Prompt processed in 0.02 seconds, 4 tokens, 200.61 tokens/second
Response generated in 10.80 seconds, 250 tokens, 23.15 tokens/second

With SD

Prompt processed in 0.03 seconds, 4 tokens, 138.80 tokens/second
Response generated in 5.10 seconds, 250 tokens, 49.05 tokens/second

Suggested labels

{ "key": "speculative-decoding", "value": "Technique for using a smaller 'draft' model to help predict tokens for a larger model" }

This was referenced Feb 27, 2024

Guide to choosing quants and engines : r/LocalLLaMA #641

Open

At the Intersection of LLMs and Kernels - Research Roundup #655

Open

self-speculative-decoding/README.md at main · dilab-zju/self-speculative-decoding #680

Open

ShellLM removed the llama label May 9, 2024

irthomasthomas removed the ExLlamaV2 label Sep 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speculative Decoding in Exllama v2 and llama.cpp comparison : r/LocalLLaMA #391

Speculative Decoding in Exllama v2 and llama.cpp comparison : r/LocalLLaMA #391

irthomasthomas commented Jan 18, 2024

Speculative Decoding in Exllama v2 and llama.cpp comparison : r/LocalLLaMA #391

Speculative Decoding in Exllama v2 and llama.cpp comparison : r/LocalLLaMA #391

Comments

irthomasthomas commented Jan 18, 2024

Speculative Decoding in Exllama v2 and llama.cpp Comparison

Discussion

Test Setup

Exllama v2 Results

No SD

With SD

Suggested labels

{ "key": "speculative-decoding", "value": "Technique for using a smaller 'draft' model to help predict tokens for a larger model" }