Replies: 4 comments 5 replies
-
Very interesting concept. Subscribing myself to this discussion for checking out later. 🙂 |
Beta Was this translation helpful? Give feedback.
-
This looks like a poor mans beam search which just continues with one (arbitrary) single beam after some prescribed number of tokens >=1 have been decoded. The paper seems over complicated for such a simple concept that can be described in one sentence. Not that I don't like the idea, its a very good idea to add a parameter to a beam search to halt full beam after N decoded tokens. Alternatives such as dropping to K beams K<N or even staging an arbitrary programmable sequence (N0 beams for first token, N1<=N0 for second, N2 <=N1 for third, etc.) could easily be explored and might give some decode benefits while reducing complexity of a full beam search. Based on my experience I agree with the comments in the paper that the early tokens can make a large disproportionate difference on the continuation quality. I havent done extensive testing with beam search but the limited tests I have run show it can sometimes give correct answers to problems that will not be answered correctly with single beam greedy decode with only 2 to 3 beams. Simply healing the last prompt token before generation will also give a guaranteed completion probability equal or greater than the non healed last prompt token decode. This can also make the difference between a right and wrong answer showing how important the early decode tokens are due to the autoregressive feedback mechanism in the decoder. |
Beta Was this translation helpful? Give feedback.
-
I am also interested in how CoT-decoding may influence creative writing, aside from also wanting it implemented for reasoning tasks. Upvoted. |
Beta Was this translation helpful? Give feedback.
-
It will be good to have it in llama.cpp as the approach does show improvements over baseline for models on tasks like GSMK. I have already implemented it in optillm. But users have asked for it to be available via llama.cpp as well - codelion/optillm#65 |
Beta Was this translation helpful? Give feedback.
-
As mentioned in this paper, CoT-decoding explores alternative top-k tokens for higher confidence answers, resulting in measurable and significant performance gains and longer chain-of-thought reasoning from even non-instruct LLMs.
A post on reddit detailed their own implementation of CoT-decoding in Python (source code). Using Qwen 2.5 0.5B they achieved a 41% improvement in the GSM8K (22.82 before, 32.37 after).
All models tested in the paper, both base and instruct-tuned, showed significant improvement in accuracy and reasoning with unprompted and prompted zero-shot problems.
Beta Was this translation helpful? Give feedback.
All reactions