Possible Speculative Decoding Bug #3919

williammm001 · 2023-11-02T20:28:39Z

I updated the latest code and when using the ./speculative to run speculative inference, the acceptance rate will always be 100%. The same setting in the previous version will not lead to that results.

ggml-ci

ggerganov · 2023-11-03T07:43:54Z

The recent min-p sampling (#3841) interferes with the default logic of the example. After it was introduced, we were drafting only very-very confident draft tokens (p_accept = 0.8) so the draft was almost always right but rarely speculates.

I've reduced p_accept to 0.5 by default and added CLI args to control it: -pa 0.5 -ps 0.1
See the commit for more info

Results should be better now - let me know if you observe more issues

williammm001 · 2023-11-07T02:27:15Z

Thank you for your reply. The acceptance rate is not 100% now but there are still some issues. First, the draft size "--draft N" does not affect the acceptance rate when using different value of N. Second, I tried to run ./speculative and ./main for the same prompts on the same target model (I tried different llama-2 7B/13B), it seems that the speculative decoding is even slower than not using that which should not happen.

ggerganov · 2023-11-07T08:27:05Z

First, the draft size "--draft N" does not affect the acceptance rate when using different value of N

Likely the draft sampling hits the acceptance threshold and stops drafting, so it won't matter if you increase N. You can set -pa 0.0 to never stop drafting, but likely the result would be worse.

Second, I tried to run ./speculative and ./main for the same prompts on the same target model

It depends on what models, parameters and backend you use. Speculative decoding works best with large F16 target model and small quantum draft model for CUDA and Metal. For quantum target models, the Metal backend also works well, though you need some manual adjustments to the constants, while CUDA currently is suboptimal in this case due to inefficient quantum batched implementation.

ggml-ci

github-actions · 2024-04-02T01:12:23Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

williammm001 added the bug-unconfirmed label Nov 2, 2023

ggerganov added a commit that referenced this issue Nov 3, 2023

speculative : change default p_accept to 0.5 + CLI args (#3919)

8f961ab

ggml-ci

olexiyb pushed a commit to Sanctum-AI/llama.cpp that referenced this issue Nov 23, 2023

speculative : change default p_accept to 0.5 + CLI args (ggerganov#3919)

d98979c

ggml-ci

github-actions bot added the stale label Mar 19, 2024

github-actions bot closed this as completed Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible Speculative Decoding Bug #3919

Possible Speculative Decoding Bug #3919

williammm001 commented Nov 2, 2023

ggerganov commented Nov 3, 2023 •

edited

Loading

williammm001 commented Nov 7, 2023

ggerganov commented Nov 7, 2023

github-actions bot commented Apr 2, 2024

Possible Speculative Decoding Bug #3919

Possible Speculative Decoding Bug #3919

Comments

williammm001 commented Nov 2, 2023

ggerganov commented Nov 3, 2023 • edited Loading

williammm001 commented Nov 7, 2023

ggerganov commented Nov 7, 2023

github-actions bot commented Apr 2, 2024

ggerganov commented Nov 3, 2023 •

edited

Loading