-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimized matrix multiplications for i-quants on __aarch64__ #464
Conversation
This carries over what I had done within llama.cpp. In llamafile we have nice performance gains for PP, but we get performance regression for TG. For now, just adjusted iq2_xxs to also outperform in TG (~10% beter @ 4 and 8 threads). Will tackle the other quants next.
So, improving TG speed results in a drop of performance for PP. Before I had PP-512 = 56.78 t/s, TG-128 = 12.42 t/s @ 8 threads. Now we have PP-512 = 52.77 t/s, TG-128 = 15.97 t/s @ 8 threads.
Improved TG from 4.96 t/s yto 5.43 t/s. Still ~3.5$ slower than mainline. PP-512 became slightly better (47.9 vs 46.8 t/s). This is 3.9X mainline (!)
PP stays the same - 3.67X mainline. TG improves slightly to 5.05 t/s from 4.74 t/s @ 4 threads. This is still 15% slower than mainline.
We get 3.32X mainline for PP. TG is, sadly, 0.92X @ 4 threads
We get 2.87X mainline for PP. TG is, sadly, 0.95X @ 4 threads
Turns out we can improve quite a bit by explicitely asking the compiler to never inline some functions, and to always inline some other. With that, PP performance gains are > 3X for all i-quants, reacing 4.3X for iq3_s. TG is also always better, except for iq3_xxs, where it is 0.99X, so re-enabled iql_mul_mat for Ny = 1.
Turns out changing one method of a quant affects the performance of other qunts(s). Is the compiler somehow trying to optimize all template instantiations together? Anyway, with this version I have this: | cpu_info | model_filename | size | test | t/s | | ---------------------------: | -------------: | ---------: | ------: | ------: | | Apple M2 Max (+fp16+dotprod) | iq2xxs | 1.73 GiB | tg128 | 9.02 | | Apple M2 Max (+fp16+dotprod) | iq2xxs | 1.73 GiB | pp512 | 61.31 | | Apple M2 Max (+fp16+dotprod) | iq2xs | 1.89 GiB | tg128 | 10.58 | | Apple M2 Max (+fp16+dotprod) | iq2xs | 1.89 GiB | pp512 | 56.11 | | Apple M2 Max (+fp16+dotprod) | iq2m | 2.20 GiB | tg128 | 7.07 | | Apple M2 Max (+fp16+dotprod) | iq2m | 2.20 GiB | pp512 | 45.78 | | Apple M2 Max (+fp16+dotprod) | iq3xxs | 2.41 GiB | tg128 | 6.40 | | Apple M2 Max (+fp16+dotprod) | iq3xxs | 2.41 GiB | pp512 | 47.51 | | Apple M2 Max (+fp16+dotprod) | iq3m | 2.90 GiB | tg128 | 5.97 | | Apple M2 Max (+fp16+dotprod) | iq3m | 2.90 GiB | pp512 | 47.98 | TG is with 4 threads, PP with 8.
With this version we get | cpu_info | model_filename | size | test | t/s | | ---------------------------: | -------------: | ---------: | -----: | ------: | | Apple M2 Max (+fp16+dotprod) | iq2xxs | 1.73 GiB | tg128 | 10.83 | | Apple M2 Max (+fp16+dotprod) | iq2xxs | 1.73 GiB | pp512 | 60.82 | | Apple M2 Max (+fp16+dotprod) | iq2xs | 1.89 GiB | tg128 | 10.79 | | Apple M2 Max (+fp16+dotprod) | iq2xs | 1.89 GiB | pp512 | 57.10 | | Apple M2 Max (+fp16+dotprod) | iq2m | 2.20 GiB | tg128 | 7.45 | | Apple M2 Max (+fp16+dotprod) | iq2m | 2.20 GiB | pp512 | 46.39 | | Apple M2 Max (+fp16+dotprod) | iq3xxs | 2.41 GiB | tg128 | 6.77 | | Apple M2 Max (+fp16+dotprod) | iq3xxs | 2.41 GiB | pp512 | 48.74 | | Apple M2 Max (+fp16+dotprod) | iq3m | 2.90 GiB | tg128 | 5.97 | | Apple M2 Max (+fp16+dotprod) | iq3m | 2.90 GiB | pp512 | 48.59 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I'm happy to see more ARM improvements. To support your work, I've been focusing on getting llamafile to run on Android these past few days. ARM just said 70% of inference on Android happens on CPU, so it's potentially the most impactful audience for your work.. https://www.theregister.com/2024/05/30/arm_cortex_x925_ai_cores/?td=rt-3a
i-quants offer better quantization quality than k-quants in the 2- and 3-bpw range, but are notoriously slow on the CPU. This PR brings a significant speedup on Arm CPU's, particularly for prompt processing. Performance is still lower than k-quants, but the performance gap is now substantially smaller.
The following table compares performance between the main branch and this PR for a 7B LLaMA model on an M2 Max CPU.