-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More accurate Q4_0 and Q4_1 quantizations #896
Conversation
in quantize_row_q4_0_reference and quantize_row_q4_1_reference. This reduces the difference to the vectorized versions to ~10% for quantize_row_q4_0 and <15% for quantize_row_q4_1 on the two CPU's I have tried (Ryzen 7950X and M2 Max).
As for (edit: looks like your RMS errors are higher than those by @unbounded posted here: #835 (comment). Maybe it's because you don't seem to be using the value -8. But why is your perplexity so much lower?)
I can confirm that it's faster, however it changes the output, somewhat subverting the meaning of "reference". I would find it better to make it a separate PR. Checksums for the 7B model compared to master (3e6e70d):
Great job on this, however again, this should probably be a separate PR. It could also be made to benefit the existing formats (ftypes 2, 3). |
But we should eventually switch back to nearestInt() and adapt the test.
if (df0 > 0) { | ||
kmin = nmax-2; kmax = nmax + 1; | ||
} else { | ||
kmin = nmax/2; kmax = nmax+1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kmax
is the same in both cases, move outside the if/else or eliminate entirely in favor of nmax+1
(if that's what you intended).
All in all, this function would benefit from some explanatory comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
df0
is the negative of the cost function derivative with respect to the scale at where we started. If greater than 0, we expect that the search range beyond nmax
will be greater, and indeed it is. On occasion one can get a better solution by going to nmax+2
or even nmax+3
. In practice, the gain in MSE is so marginal that the added extra computation time is just not worth it. But I have left kmax
defined explicitly in both branches to remind us that we may want to explore ways to find such marginal improvements more efficiently (and then we would change kmax
in the df0 > 0
branch correspondingly).
Could you please share full perplexity result? |
Somehow I had it hard-wired in my brain that quants need to be in -7...7 to be comparable to the original Q4_0. But this is clearly not the case, and if we relax this requirement this simple change brings the rmse down to 0.001966 at the expense of a somewhat longer computation (~67 seconds vs 49 seconds for the 7B model on M2 Max). Perplexity test is still running but it looks like the improvement compared to the previous version will be quite modest ~0.03) despite the significant improvement in MSE. The change does not affect Q4_1 as there we already use the full range of 16 possible int values.
For completeness, here the perplexity runs: Q4_0, 7B
Q4_1, 7B
|
The RMSE of the 7B model becomes 0.00185228. It looks like the perplexity will end up being around 6.27-6.28.
Here the
|
Basically, we use two Q4_0 quantizations, each having 16 weights, to a quantize a set of 32 weights. We get two separate scaling factors, which we store as fp16, ending up using the exact same 5 bits per weight as the current Q4_0. We end up witn an rmse of ~0.00159, so basically the same as the improved Q4_1. But this should run faster than `Q4_1` (unless fp16 -> fp32 conversion is somehow very slow).
As last commit, but Q4_1 type, using the same memory as existing Q4_1 via fp16. We end up with rmse 0.00125125, maxerr 0.11657715, 95pct<0.0024, median<0.0010 after a quantize - dequantize roundtrip. This is quite a bit better than Q4_1 with groups of 32 weights, but by far not as good as 5-bit quantization that uses the same amount of memory where we had rmse 0.00076131, maxerr 0.05273438, 95pct<0.0016, median<0.0006
@ggerganov I'm going on vacation today and it is unlikely I will have time to work on this when I come back. There is some interesting stuff in here but I leave it up to you to decide if you want to close/merge/cherry-pick bits and pieces of it. From what I have seen in these few days, improvements to "classic" 4-bit quantization (i.e., |
Thank you for the analysis - there are definitely interesting results and techniques here.
Btw, I just realized something that might address your second point. I will add this idea to #909 |
q8_0 : rmse 0.00010729, maxerr 0.01030385, 95pct<0.0002, median<0.0002
Great point. I had a few minutes before leaving for the airport and tried 8-bit quantization. Just the simplest possible (and very fast) variant
which shows that you will indeed get a massive gain in accuracy if you quantized directly into 8 bits. |
} | ||
}; | ||
int nthread = std::min(nchunk, int(std::thread::hardware_concurrency())); | ||
std::vector<std::thread> workers(nthread-1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use std::jthread so you don't have to loop join them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we are targeting C++11, and std::jthread seems to be C++20.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, no.
Line 85 in c12b14b
CXXFLAGS += -std=c++23 -DGGML_BIG_ENDIAN |
For quantize-stats we get q4_2: rmse 0.00159301, maxerr 0.17480469, 95pct<0.0030, median<0.0012 For 7B perplexity with BLAS enabled we get 6.2038 after 655 chunks. Quantization is slow (~90 seconds on my Mac for 7B) as not multi-threaded as in PR #896.
* Q4_2 quantization with rmse-optimized scale and quants For quantize-stats we get q4_2: rmse 0.00159301, maxerr 0.17480469, 95pct<0.0030, median<0.0012 For 7B perplexity with BLAS enabled we get 6.2038 after 655 chunks. Quantization is slow (~90 seconds on my Mac for 7B) as not multi-threaded as in PR #896. * ggml : satisfy the sanitizer builds Not sure why this makes them fail * Better follow ggml conventions for function names * Fixed type as per reviewer comment --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
I think we can close this now. Most of what was here is now in PR #1106 implemented in |
Hi! I am hacing the following errors while trying to build it: I llama.cpp build info: cc -I. -O3 -DNDEBUG -std=c11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -c ggml.c -o ggml.o Any Idea? |
Your version of gcc is too old, check #1120. |
I haven't noticed that my gcc was old in the docker image I was using. I'm sorry! Thank you very much! |
This allows local build options (like LLAMA_*) to be set in the local file instead of having to edit Makefile, or provide a long gmake command line on every build. Using '-include' avoids generating a warning if Makefile.local doesn't exist.
Update
After seeing PR #835, I pushed some more changes that only affect the
Q4_0
results. I now getfor the 7B model. Perplexity becomes
6.2644
. This is the result on my MacBook with M2 Max. Running the same quantization on a Ryzen 7950X gives completely different results. The test is still running, but so far it looks like it will end up with a ~0.3 higher perplexity.I guess, there is a problem with the AVX2 version that is being used there. @ggerganov tells me that the difference I'm observing is simply due to using BLAS on the Mac and not using BLAS on the Ryzen 7950X.Update 2
OK, it looks like the low perplexities I'm getting are simply due to the fact that I'm running on the Mac, where BLAS is enabled by default. So, basically, most of the reduction in perplexity I'm observing is simply due to the full precision in matrix multiplications. I will rerun perplexity without BLAS (or with BLAS using the reference
Q4_0
quantization) and will post the results. This will better tell us how much one can gain from improving the quantization.Update 3
Perplexity of the 7B model with reference
Q4_0
quantization and BLAS enabled is6.2838
after 655 chunks. So, basically, the ~25% reduction in MSE of the quantized weights results in a 0.02 improvement in perplexity. In contrast, full precision in matrix multiplications via BLAS improves perplexity by ~0.3. Which basically means that this PR is pretty pointless.Update 4
Perplexity results for 7B and 13B with
Q4_0
andQ4_1
are available hereA also added a POC for 5-bit quantization. Memory/disk usage is the same as the current
Q4_1
by using twofp16
floats instead of twofp32
floats. For each quantized value in a set of 32 weights, it stores 4 of the 5 bits asQ4
. It then uses the 32 bits that are now not used due tofp16
to store a flag if the corresponding value has the 5th bit set. The encoding/decoding ends up being not too bad at the end.The improvement in
rmse
compared toQ4_1
is dramatic. I getafter a full round trip of quantization - dequantization.
main: total time = 391533.50 ms
Update 5
Added a
Q4_0
-like quantization scheme that ends up with a RMSE of ~0.00159 (so basically the same as the bestQ4_1
). Basically, we split a group of 32 weights into two group of 16 and quantize these separately. We store the two scaling factors asfp16
, ending up using the exact same amount of memory as the currentQ4_0
(5 bits per weight). A round trip of quantization - dequantization results inDescription
This PR adds new methods for
Q4_0
andQ4_1
quantization that (almost) exactly solve the mixed integer minimization problemwhere the
x_i
are the original weights, thel_i
are the quantized weights, anda, b
are the conversion coefficients. It is almost exact because in some very rare degenerate cases the method may not find the global minimum. Guaranteeing that the global minimum is obtained is not worth because the difference in mean-square-error (MSE) to the guaranteed minimum is less than0.01%
but costs at least 2-fold increase in computation time.On the 7B model, the improved
Q4_0
quantization achieves ~14% reduction in MSE compared to the existing implementation. The improvedQ4_1
quantization is even better, achieving ~25% reduction in MSE compared to the existingQ4_1
.So far I have only measured perplexity for the the 7B model. I get
6.3539
forQ4_0
and6.0863
forQ4_1
with the default context size.For the sake of compatibility, I have kept the format of the existing
Q4_0
andQ4_1
quantization (i.e., one or two 32-bit floats followed by 16uint8_t
containing the quants), so that a model quantized with these new methods can be used without any other changes to the code. This is quite wasteful as the same 6 bits per weight that are used inQ4_1
lead to a massive reduction in MSE if one switched to 5 bit quantization andfp16
coefficients.The new quantization methods are meant to be used for quantizing the original model only. They are by far not fast enough for quantization of intermediate results (in single-threaded mode the new
Q4_0
is ~25 times slower and the newQ4_1
about ~50 times slower than the corresponding existing implementations).The quantization function will automatically use multi-threading if the chunk of weights given for quantization is large enough. Plugged into the
quantize
example, it gets the job done in about 49 seconds forQ4_0
on my MacBook (M2 Max) for the 7B model.Q4_1
quantization of the 7B model takes ~190 seconds.I have also added a change to the reference (i.e., scalar) versions of the
Q4_0
andQ4_1
quantization implementations: replacing theroundf()
function with a better conversions toint
speeds the scalar implementation quite a bit, especially onX86_64
(andx86
) where the slowness ofround
is legendary. After this change, the reference implementation is only ~10% slower than the vectorized quantization on the two CPU's I have tried (M2 Max and Ryzen 7950X).