-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Q4_0 implementation using 2x F16 instead of 1x F32 #1026
Conversation
ec870de
to
b4c74b7
Compare
I would have hoped the new format would be defined like this:
That is, don't force what are essentially two blocks into one struct, and also define a new version number. Pros:
Cons:
Of course, in the long run, we might decide to stop supporting What do you think? Would that really be slower? |
@sw Btw, I'm again reconsidering keeping the SIMD Dropping SIMD quantization support would make changes as in this PR much simpler. |
Here is a quick test of the performance impact of quantization on LoRA:
The quantization happens on the CPY operation, and currently this represents about 40% of the time to apply a layer. Changing
Without SIMD, the CPY takes 80% of the time. Caveat: |
That said, I absolutely agree that |
If we change the
The timing should be much closer I think. We can reduce #define GGML_TENSOR_SIZES_SRC0(x) // defines ne00, ne01, ..., nb00, nb01, etc..
#define GGML_TENSOR_SIZES_SRC1(x) // defines ne10, ne11, ..., nb10, nb11, etc..
#define GGML_TENSOR_SIZES_DST(x) // defines ne0, ne1, ..., nb0, nb1, etc..
#define GGML_GET_PTR_ROW(x, i) // get ptr to ith row using strides
etc.. We can do this refactoring soon, let's say after the quantization work is done. But, the SIMD remains a problem because even if you had multiple small files or Edit:
Hm, actually that's not true. I guess because I was editing over the original |
This is just for a single tensor, applying it to the entire model can take a lot of time, especially with the larger models. Applying the entire LoRA (this is baize-lora-7B) takes 30 seconds on my machine. This is a bit of a pathological case since it also modifies the feed-forward tensors (usually LoRAs only modify the attention tensors), but still, this is very slow for something that has to be done every time.
We can consider the SIMD implementations of functions like quantize and dequantize very low priority and just plainly ignore them in experiments, and only if the experiment is successful, we can allow other people to implement them later on in other PRs. I think this is what we are already doing in practice. Edit: removing the roundf from quantize is indeed much faster:
|
One thing to consider is openmp which supports multi-thread and SIMD today. It will keep the code simple, portable while leveraging latest hardware features (multi core and AVX). |
Reimplementation continues in #1046 |
ref #959
ARM NEON only implementation
Timing
Time per token
~55 ms
Up from
~50 ms
onQ4_0
master
Perplexity
Without
BLAS
25 iters: 6.5251
655 iters: 6.2319
With
BLAS
:25 iters: 6.5146
655 iters: 6.2316
The new 7B perplexity on this branch with BLAS enabled is:
6.2316
We can expect similar value without BLAS thanks to the #951
The perplexity on
master
for the same setup is:6.2897
Therefore we observe a delta of
-0.0581
thanks to the 2x F16 scale factors inQ4_0
Somehow was hoping for a value closer to the
Q4_1
6.0863
reported in #896The current RMSE is:
Which is much higher than the reported one in #896 for this approach:
Either the claim in #896 that RMSE optimization brings only
0.02
ppl is not entirely correct or my expectation that2x F16 Q4_0
would be similar toQ4_1
onmaster
was not correct.Took the 2x F16 model from #896 and running perplexity with the current branch. The result is:
655 iters: 6.2039
So indeed, RMSE optimization leads to just about
-0.03
perplexity gain.I guess, I don't feel so confident about dropping
Q4_1
given these results.Conclusions
The new 2x F16
Q4_0
format is viable: improves 7B perplexity by-0.0581
and has almost same inference speed as the original format (50ms per token
vs55ms per token
on M1)Next steps
Q4_2
master
and have the other arches merged as wellQ4_1
adding 8-bit intermediate results as in Add Q8_0 quantization for intermediate results #951 and potentially implementingQ4_3
- similar to the approach in this PRoutput
usingQ4_2
andQ4_3
(see Measure perplexity delta between Q4_0 and F16 "output" tensor #1003) and decide if more quantization improvements should be pursued