-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Continuous layouts for quantization q4_0c #1073
Continuous layouts for quantization q4_0c #1073
Conversation
48c84f7
to
4f149c2
Compare
What do you think about having two separate arrays, one for qs and one for scales? |
@unbounded I hope in the next couple of days we confirm that we will proceed with these quantization strategies and merge the ARM NEOM implementations to |
Not sure what you mean here, you mean keeping them as two separate allocations? That would somewhat simplify alignment, but I think it would be hard to do that generally for different formats, e.g. q4_1 might use three "arrays" |
const float dy0 = yds[dst0]; | ||
const float dy1 = yds[dst1]; | ||
|
||
// NOTE: having these as plain int triggers a bug with AVX512 on GCC 12.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://godbolt.org/z/zj4x6Kz8o
Let me know if you see something wrong with this code, but this looks like a compiler bug to me...
GCC spots that it can vectorize with vpdpbusd
(impressive), but it looks like it forgot that the first vector has to be unsigned, (just like I tend to).
4f149c2
to
64a6a29
Compare
I'll hold a bit until it stabilizes, but it should be straightforward to test the same approach for the q4_2 format. Added SIMD for Arm Neon - it's almost identical to q4_0 except we don't need the I don't have an M1 to test on, but got some timings on an Ampere Altra VM:
q4_0c 7B:
|
Prefetching seems to be important with this layout, probably the extra memory accesses confuse the hardware prefetcher. AVX-512:
q4_0c 7B, with prefetch:
Arm Neon:
q4_0c 7B, with prefetch:
|
On M1 Pro,
Btw, without prefetching (i.e. previous commit)
|
Hm, sounds like it is sensitive to what prefetch distance to use then, that's unfortunate. Thanks for checking! |
Somewhat related to this is the fact that Q8_0 as it is after #1083, #1109 now has two floats that go to waste for Q4_0 and Q4_2, at least for the AVX2 implementation. This makes quantization slower due to calculating unused values, and the vector dot product slower, as it has to churn through more memory. We could define a new format, but this again makes the source code longer: #define QK8_0 32
typedef struct {
float d; // delta
int8_t qs[QK8_0]; // quants
} block_q8_0;
#define QK8_1 32
typedef struct {
float d; // delta
float s0; // d * sum(qs[i]) low
float s1; // d * sum(qs[i]) high
int8_t qs[QK8_1]; // quants
} block_q8_1; Edit: this was done in #1179 |
Introduce alternative quantized formats q4_0c and q8_0c, corresponding exactly to q4_0 and q8_0, except that quantized values and scales are laid out continuously in memory, and the nibbles in q4_0 are rearranged. This should simplify SIMD implementations, at the expense of slighly more complex scalar implementations.
Mostly copied from the q4_0 implementation
Seems significant especially for evaluation time
58e10f2
to
d53f767
Compare
Not supported on some GCC versions
I started a branch for the same approach with the q4_2 data format: For AMD64 w AVX512:
q4_2c:
|
I am a bit torn about the proposed big block size of I was wondering, if we proceed with Would this work? |
@ggerganov That is a thing I can test the performance impact of, but it will probably be a little while before I get around to it. I don't see any reason we couldn't add a "rest" handler for non-block sizes - main disadvantage would be the padding and the bit of extra code. |
q4_2 timings on Ampere Altra, 7B q4_2:
q4_2c:
|
Did some performance testing on M1 and updated the q4_2c branch: q4_2:
q4_2c:
|
Some miscellaneous performance observations: I saw no performance difference using 64-byte aligned loads with AVX-512. Prefetching instructions give no benefit at all on M1 processors. Some very interesting work in #1256, the "super blocks" mentioned there are probably large enough to capture most of the benefit of this layout, if they are properly arranged. Also uses drastically fewer Float16 numbers, so there is less benefit in doing 2 or 4 F16->F32 conversions at once like we can here. |
Adds a q4_0c type that corresponds to the q4_0 layout but with a different memory layout.
Draft status, currently only accelerated for AVX-512, will add a PoC of Neon acceleration but wanted to put this out there since there is some experimentation with quantization formats going on now.
The layout consists of all the quantized values first in blocks of 128 nibbles, followed by all the scales.
The nibbles within a block are laid out consecutively in the lower nibbles, and then consecutively in the higher nibbles.
For dot products we use a q8_0c format, with all the qs bytes followed by all the scales.
The big win is for architectures with larger registers like AVX-512, that can now get two continuous blocks of qs by doing roughly
The dot product implementation here borrows from @dfyz's implementation in #933, but becomes simpler because we don't need to do tricks with the byte layout.
Besides the simplified implementation there is also a small improvement in performance:
llama_print_timings: prompt eval time = 665.66 ms / 6 tokens ( 110.94 ms per token)
llama_print_timings: total time = 15398.10 ms
vs
llama_print_timings: prompt eval time = 449.19 ms / 6 tokens ( 74.86 ms per token)
llama_print_timings: total time = 13557.80 ms
The SIMD implementation with 128-bit registers like Neon should look very similar to the current implementations, with similar speeds. Possibly some benefit from doing only aligned loads.
The scalar implementations are slightly more complex but I do not see any degraded performance.
Perplexity should be exactly the same as q4_0.
Example timings (7B)
system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
Current master q4_0:
q4_0 with #933
continuous layout q4_0c:
Todos:
Future improvements: