-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Big speed regression with top-p sampling #246
Comments
Sorting all the probabilities seems an overkill. Picking the top K largest values can be done in O(n) by a selection algorithm, like std::nth_element in std library. Eventually we can keep full sorting but after a top-k step. It could be possible to chain top-K and top-P as described here:
For instance a top-K with k=60 before top-p will reduce final sorting from 30K elements of vocab to just 60....with a very high probability of getting the same result of a full sort. |
./run qlora\llama2.c> 1..3 | ForEach-Object { 'runmingw.exe', 'rungcc.exe', 'run.exe' | ForEach-Object { $env:OMP_NUM_THREADS=4; & "./$_" '../out/model110M.bin' -t 0 -n 0 -p "Once upon a time" -s 42 -o 1.0 -k 0 -f 0 -l 0 -b 1 -d inbox } } achieved tok/s: 89.689637 for MINGW achieved tok/s: 90.555015 for MINGW achieved tok/s: 90.187781 for MINGW qlora\llama2.c> 1..3 | ForEach-Object { 'runmingw.exe', 'rungcc.exe', 'run.exe' | ForEach-Object { $env:OMP_NUM_THREADS=4; & "./$_" '../out/model110M.bin' -t 0 -n 0 -p "Once upon a time" -s 42 -o .5 -k 0 -f 0 -l 0 -b 1 -d inbox } } achieved tok/s: 88.000000 for MINGW using top_p achieved tok/s: 88.594440 for MINGW using top_p achieved tok/s: 87.879048 for MINGW using top_p Hard to say since being a cpu bound system with so much other overhead but running it with a variety of builds and over a period of time I would probably concur with these findings. fwiw. (only a single run of 3 shown for each test but after a number of such runs these ones are as representative as any other) |
Well, I submitted a patch, not sure if it's the way to go but seems to help. |
I think fixed now ty |
On my system from 40 tok/s down to 33 tok/s, almost 20% slower...
Slowdown is even more dramatic on smaller models
The text was updated successfully, but these errors were encountered: