-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Q4_0 and Q4_1 quantization breaks RWKV due to weight/activation outliers #12
Comments
If we look at percentiles of 3B model, 0.001-0.999 percentiles have modest values:
And min 0.001/max 0.999 among all matrices are Looks like huge values are rare outliers, and most other values are in range |
Some links that may be useful for research/impl:
|
Here is a hacky way to deal with outlier weights/activations: commit in an experimental branch (do not use unless you know what you are doing!) What I did:
It is now slower because of deoptimization, but perplexity is significantly lower:
I did not test it with larger models yet. I'll continue the experiments... |
Results so far for all models (169M to 7B, no data for 7B FP16 because not enough memory):
Experimental Q4_1 (as in previous message -- stores outliers in a block as-is and does not quantize activations) clearly reduces perplexity as compared to vanilla Q4_1, but it is still not worth running 7B INT4 instead of 3B FP16 or even 1.5B FP16. I have ideas about performance optimization, but performance does not matter when it's still better for quality to run smaller models. I'll try to invent some new ideas for changing quantization format even further. |
hahnyuan/RPTQ4LLM#1 RPTQ method may be worth keeping an eye on, their repo is new quantized SOTA Hope this helps fuel your implementation too |
Can try this for INT4: compute "mx my rx ry" as in https://github.com/BlinkDL/ChatRWKV/blob/main/rwkv_pip_package/src/rwkv/model.py Basically: rescale all rows & columns of w --> compute INT4 x @ w --> rescale result. Probably you only need rx & ry, and you can compute them using max(abs(w)). And probably only need them for att.output.weight (maybe ffn.value.weight too). |
In the end, I decided to create a separate format in Comparison of PerplexityLower is better. See "Perplexity measuring setup" section below for set-up. Overall, rule "it's better to use quantized model X than FP16 model X-1" holds.
Performance (per-token latency in ms)Lower is better. Tests done on 16 GB RAM, 4 core/8 thread machine. Overall,
Disk/memory consumption
Reflection and future workWhat did work (in order of decreasing importance):
What did not work:
What I did not try (this is left for the future work):
Moreover, AVX2 implementation of Perplexity measuring setup
|
I've been measuring loss and perplexity of different sizes and data types on a very small private dataset:
The measuring method may not be entirely correct, but these huge losses and perplexities really do show in the quality of generated text -- it is almost incoherent.
Of course, we need proper measuring on WikiText; but it would be very slow on my hardware, and WikiText is not representative of my use case.
Interesting thing to note are min and max values of RWKV matrix weights:
For comparison, llama 7B min and max values are around -2.5 2.5!
As a next step, I'll try to determine whether these huge values are outliers, or most weights really are distributed in this range.
I guess we need alternative quantization scheme for RWKV.
The text was updated successfully, but these errors were encountered: