Optimized matrix multiplications for i-quants on aarch64 #464

ikawrakow · 2024-06-03T10:51:39Z

i-quants offer better quantization quality than k-quants in the 2- and 3-bpw range, but are notoriously slow on the CPU. This PR brings a significant speedup on Arm CPU's, particularly for prompt processing. Performance is still lower than k-quants, but the performance gap is now substantially smaller.

The following table compares performance between the main branch and this PR for a 7B LLaMA model on an M2 Max CPU.

cpu_info	model_filename	size	threads	test	t/s (main)	t/s (PR)	Speedup
M2 Max (+fp16+dotprod)	iq2xxs	1.73 GiB	8	pp512	16.50	61.16	3.707
M2 Max (+fp16+dotprod)	iq2xs	1.89 GiB	8	pp512	19.09	57.42	3.008
M2 Max (+fp16+dotprod)	iq2m	2.20 GiB	8	pp512	13.32	46.37	3.481
M2 Max (+fp16+dotprod)	iq3xxs	2.41 GiB	8	pp512	12.30	48.60	3.951
M2 Max (+fp16+dotprod)	iq3m	2.90 GiB	8	pp512	12.11	49.70	4.104
M2 Max (+fp16+dotprod)	iq2xxs	1.73 GiB	4	tg128	7.73	11.03	1.427
M2 Max (+fp16+dotprod)	iq2xxs	1.73 GiB	8	tg128	14.64	20.09	1.372
M2 Max (+fp16+dotprod)	iq2xs	1.89 GiB	4	tg128	8.56	10.72	1.252
M2 Max (+fp16+dotprod)	iq2xs	1.89 GiB	8	tg128	16.17	19.91	1.231
M2 Max (+fp16+dotprod)	iq2m	2.20 GiB	4	tg128	6.34	7.44	1.174
M2 Max (+fp16+dotprod)	iq2m	2.20 GiB	8	tg128	12.03	13.60	1.106
M2 Max (+fp16+dotprod)	iq3xxs	2.41 GiB	4	tg128	5.98	6.78	1.134
M2 Max (+fp16+dotprod)	iq3xxs	2.41 GiB	8	tg128	10.93	11.94	1.092
M2 Max (+fp16+dotprod)	iq3m	2.90 GiB	4	tg128	5.62	5.95	1.059
M2 Max (+fp16+dotprod)	iq3m	2.90 GiB	8	tg128	10.39	10.71	1.031

This carries over what I had done within llama.cpp. In llamafile we have nice performance gains for PP, but we get performance regression for TG. For now, just adjusted iq2_xxs to also outperform in TG (~10% beter @ 4 and 8 threads). Will tackle the other quants next.

So, improving TG speed results in a drop of performance for PP. Before I had PP-512 = 56.78 t/s, TG-128 = 12.42 t/s @ 8 threads. Now we have PP-512 = 52.77 t/s, TG-128 = 15.97 t/s @ 8 threads.

Improved TG from 4.96 t/s yto 5.43 t/s. Still ~3.5$ slower than mainline. PP-512 became slightly better (47.9 vs 46.8 t/s). This is 3.9X mainline (!)

PP stays the same - 3.67X mainline. TG improves slightly to 5.05 t/s from 4.74 t/s @ 4 threads. This is still 15% slower than mainline.

We get 3.32X mainline for PP. TG is, sadly, 0.92X @ 4 threads

We get 2.87X mainline for PP. TG is, sadly, 0.95X @ 4 threads

Turns out we can improve quite a bit by explicitely asking the compiler to never inline some functions, and to always inline some other. With that, PP performance gains are > 3X for all i-quants, reacing 4.3X for iq3_s. TG is also always better, except for iq3_xxs, where it is 0.99X, so re-enabled iql_mul_mat for Ny = 1.

Turns out changing one method of a quant affects the performance of other qunts(s). Is the compiler somehow trying to optimize all template instantiations together? Anyway, with this version I have this: | cpu_info | model_filename | size | test | t/s | | ---------------------------: | -------------: | ---------: | ------: | ------: | | Apple M2 Max (+fp16+dotprod) | iq2xxs | 1.73 GiB | tg128 | 9.02 | | Apple M2 Max (+fp16+dotprod) | iq2xxs | 1.73 GiB | pp512 | 61.31 | | Apple M2 Max (+fp16+dotprod) | iq2xs | 1.89 GiB | tg128 | 10.58 | | Apple M2 Max (+fp16+dotprod) | iq2xs | 1.89 GiB | pp512 | 56.11 | | Apple M2 Max (+fp16+dotprod) | iq2m | 2.20 GiB | tg128 | 7.07 | | Apple M2 Max (+fp16+dotprod) | iq2m | 2.20 GiB | pp512 | 45.78 | | Apple M2 Max (+fp16+dotprod) | iq3xxs | 2.41 GiB | tg128 | 6.40 | | Apple M2 Max (+fp16+dotprod) | iq3xxs | 2.41 GiB | pp512 | 47.51 | | Apple M2 Max (+fp16+dotprod) | iq3m | 2.90 GiB | tg128 | 5.97 | | Apple M2 Max (+fp16+dotprod) | iq3m | 2.90 GiB | pp512 | 47.98 | TG is with 4 threads, PP with 8.

With this version we get | cpu_info | model_filename | size | test | t/s | | ---------------------------: | -------------: | ---------: | -----: | ------: | | Apple M2 Max (+fp16+dotprod) | iq2xxs | 1.73 GiB | tg128 | 10.83 | | Apple M2 Max (+fp16+dotprod) | iq2xxs | 1.73 GiB | pp512 | 60.82 | | Apple M2 Max (+fp16+dotprod) | iq2xs | 1.89 GiB | tg128 | 10.79 | | Apple M2 Max (+fp16+dotprod) | iq2xs | 1.89 GiB | pp512 | 57.10 | | Apple M2 Max (+fp16+dotprod) | iq2m | 2.20 GiB | tg128 | 7.45 | | Apple M2 Max (+fp16+dotprod) | iq2m | 2.20 GiB | pp512 | 46.39 | | Apple M2 Max (+fp16+dotprod) | iq3xxs | 2.41 GiB | tg128 | 6.77 | | Apple M2 Max (+fp16+dotprod) | iq3xxs | 2.41 GiB | pp512 | 48.74 | | Apple M2 Max (+fp16+dotprod) | iq3m | 2.90 GiB | tg128 | 5.97 | | Apple M2 Max (+fp16+dotprod) | iq3m | 2.90 GiB | pp512 | 48.59 |

jart

Nice! I'm happy to see more ARM improvements. To support your work, I've been focusing on getting llamafile to run on Android these past few days. ARM just said 70% of inference on Android happens on CPU, so it's potentially the most impactful audience for your work.. https://www.theregister.com/2024/05/30/arm_cortex_x925_ai_cores/?td=rt-3a

Kawrakow added 13 commits June 3, 2024 10:31

Arm for i-quants

bba8b88

This carries over what I had done within llama.cpp. In llamafile we have nice performance gains for PP, but we get performance regression for TG. For now, just adjusted iq2_xxs to also outperform in TG (~10% beter @ 4 and 8 threads). Will tackle the other quants next.

Arm for i-quants: iq2_xxs

9347a82

So, improving TG speed results in a drop of performance for PP. Before I had PP-512 = 56.78 t/s, TG-128 = 12.42 t/s @ 8 threads. Now we have PP-512 = 52.77 t/s, TG-128 = 15.97 t/s @ 8 threads.

Arm for i-quants: iq3_s

ffcb297

Improved TG from 4.96 t/s yto 5.43 t/s. Still ~3.5$ slower than mainline. PP-512 became slightly better (47.9 vs 46.8 t/s). This is 3.9X mainline (!)

Arm for i-quants: iq3_xxs

79b3ba9

PP stays the same - 3.67X mainline. TG improves slightly to 5.05 t/s from 4.74 t/s @ 4 threads. This is still 15% slower than mainline.

Arm for i-quants: iq2_s

6960310

We get 3.32X mainline for PP. TG is, sadly, 0.92X @ 4 threads

Arm for i-quants: iq2_xs

3fc3cd0

We get 2.87X mainline for PP. TG is, sadly, 0.95X @ 4 threads

Arm for i-quants: abandoning special-casing Ny = 1

bf442bb

Arm for i-quants: cleanup and disable iqk_mul_mat for Ny = 1

6f60746

Arm for i-quants: cleanup and comments

7396ce0

Remove forgotten experimental change in q3_K implementation

7a962c7

github-actions bot added the llamafile label Jun 3, 2024

jart approved these changes Jun 6, 2024

View reviewed changes

jart merged commit c38feb4 into Mozilla-Ocho:main Jun 8, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized matrix multiplications for i-quants on aarch64 #464

Optimized matrix multiplications for i-quants on aarch64 #464

ikawrakow commented Jun 3, 2024

jart left a comment

Optimized matrix multiplications for i-quants on __aarch64__ #464

Optimized matrix multiplications for i-quants on __aarch64__ #464

Conversation

ikawrakow commented Jun 3, 2024

jart left a comment

Choose a reason for hiding this comment

Optimized matrix multiplications for i-quants on aarch64 #464

Optimized matrix multiplications for i-quants on aarch64 #464