-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible Speculative Decoding Bug #3919
Comments
The recent min-p sampling (#3841) interferes with the default logic of the example. After it was introduced, we were drafting only very-very confident draft tokens ( I've reduced Results should be better now - let me know if you observe more issues |
Thank you for your reply. The acceptance rate is not 100% now but there are still some issues. First, the draft size "--draft N" does not affect the acceptance rate when using different value of N. Second, I tried to run ./speculative and ./main for the same prompts on the same target model (I tried different llama-2 7B/13B), it seems that the speculative decoding is even slower than not using that which should not happen. |
Likely the draft sampling hits the acceptance threshold and stops drafting, so it won't matter if you increase N. You can set
It depends on what models, parameters and backend you use. Speculative decoding works best with large F16 target model and small quantum draft model for CUDA and Metal. For quantum target models, the Metal backend also works well, though you need some manual adjustments to the constants, while CUDA currently is suboptimal in this case due to inefficient quantum batched implementation. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
I updated the latest code and when using the ./speculative to run speculative inference, the acceptance rate will always be 100%. The same setting in the previous version will not lead to that results.
The text was updated successfully, but these errors were encountered: