-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster AVX2 prompt processing for k-quants and IQ4_XS #394
Conversation
This is a remarkable change @ikawrakow. I'm very happy to see that the best quantized formats will now go the fastest. For prompt processing, I'm consistently seeing speedups between 1.2x - 2.0x on x86-64 machines. You even managed to make token generation go faster (which I've found much more difficult), in some cases by as much as 1.33x! Here are my measurements, on three different computers, for three different models. Iwan Kawrakow's new GEMM function for K-quantsBefore: 89c189e prompt evalution speed (a.k.a. prefill) in tokens per second
text generation speed (a.k.a. prediction) in tokens per second
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Once I get a release out, how would you like to announce it the world? I would like to write a blog post. If you write your own, then I'm happy to tweet that.
@ikawrakow thank you for this major contribution to the project! |
I'm not much into blogging, so if you like writing about this, please go ahead. |
As discussed elsewhere, here is a PR that improves AVX2 prompt processing for k-quants and
IQ4_XS
by a large margin. I did not manage to get the speed gains via tinyBlas, so I just added a call inllamafile_sgemm()
to a separate function that performs the matrix multiplication.The table shows a comparison between prompt processing speed on master and with this PR. Not having the
llama-bench
tool here and not knowing how to better measure performance, I just used theperplexity
tool to measure time for a batch of 512 tokens to get these values. Tested on a 16-core Ryzen-7950X CPU with a 7B LLaMA modelFor reference, here is what I measure on my system for
fp16
and quants not affected by this PR:I.e., all k-quants and
IQ4_XS
are now faster thanfp16
!The speedup in this PR is in most cases better compared to what I reported here due to some additional refinements that I have added since this post, but a few percent slower compared to what I get in my private
llama.cpp
fork (withQ2_K_S
having the most noticeable difference as I get 178 t/s there). Being new tollamafile
, I'm not sure what is causing such performance differences for the exact same matrix multiplication implementation.The same approach as here results in huge performance gains for the other i-quants (
IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_S
). But having modified these quants in my repository in ways that make them incompatible with mainlinellama.cpp
i-quants, I have left this part for a future PR.The Ryzen-7950X implements various parts of the
AVX512
specification. To make sure that this PR provides speedup on non-AVX512
CPUs, I also tested on an older Ryzen-5975WX 32-core CPU. Here I get the following performance forfp16
and unaffected quants:For k-quants and
IQ4_XS
we have