Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible Speculative Decoding Bug #3919

Closed
williammm001 opened this issue Nov 2, 2023 · 4 comments
Closed

Possible Speculative Decoding Bug #3919

williammm001 opened this issue Nov 2, 2023 · 4 comments

Comments

@williammm001
Copy link

I updated the latest code and when using the ./speculative to run speculative inference, the acceptance rate will always be 100%. The same setting in the previous version will not lead to that results.

@ggerganov
Copy link
Owner

ggerganov commented Nov 3, 2023

The recent min-p sampling (#3841) interferes with the default logic of the example. After it was introduced, we were drafting only very-very confident draft tokens (p_accept = 0.8) so the draft was almost always right but rarely speculates.

I've reduced p_accept to 0.5 by default and added CLI args to control it: -pa 0.5 -ps 0.1
See the commit for more info

Results should be better now - let me know if you observe more issues

@williammm001
Copy link
Author

Thank you for your reply. The acceptance rate is not 100% now but there are still some issues. First, the draft size "--draft N" does not affect the acceptance rate when using different value of N. Second, I tried to run ./speculative and ./main for the same prompts on the same target model (I tried different llama-2 7B/13B), it seems that the speculative decoding is even slower than not using that which should not happen.

@ggerganov
Copy link
Owner

First, the draft size "--draft N" does not affect the acceptance rate when using different value of N

Likely the draft sampling hits the acceptance threshold and stops drafting, so it won't matter if you increase N. You can set -pa 0.0 to never stop drafting, but likely the result would be worse.

Second, I tried to run ./speculative and ./main for the same prompts on the same target model

It depends on what models, parameters and backend you use. Speculative decoding works best with large F16 target model and small quantum draft model for CUDA and Metal. For quantum target models, the Metal backend also works well, though you need some manual adjustments to the constants, while CUDA currently is suboptimal in this case due to inefficient quantum batched implementation.

olexiyb pushed a commit to Sanctum-AI/llama.cpp that referenced this issue Nov 23, 2023
@github-actions github-actions bot added the stale label Mar 19, 2024
Copy link
Contributor

github-actions bot commented Apr 2, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants