Q4_0 and Q4_1 quantization breaks RWKV due to weight/activation outliers #12

saharNooby · 2023-04-03T15:03:58Z

I've been measuring loss and perplexity of different sizes and data types on a very small private dataset:

rwkv.cpp-169M-Q4_0.bin                      averages: loss [3.629], perplexity  37.691
rwkv.cpp-169M-Q4_1.bin,                     averages: loss [3.163], perplexity  23.642
rwkv.cpp-169M-float16.bin,                  averages: loss [2.699], perplexity  14.861
rwkv.cpp-169M.bin,                          averages: loss [2.699], perplexity  14.861

RWKV-4-Pile-430M-20220808-8066-q4_0.bin,    averages: loss [2.911], perplexity  18.375
RWKV-4-Pile-430M-20220808-8066-q4_1.bin,    averages: loss [2.631], perplexity  13.885
RWKV-4-Pile-430M-20220808-8066-FP16.bin,    averages: loss [2.377], perplexity  10.777
RWKV-4-Pile-430M-20220808-8066-FP32.bin,    averages: loss [2.377], perplexity  10.777

RWKV-4-Pile-1B5-20220929-ctx4096-Q4_0.bin,  averages: loss [3.079], perplexity  21.745
RWKV-4-Pile-1B5-20220929-ctx4096-Q4_1.bin,  averages: loss [2.655], perplexity  14.231
RWKV-4-Pile-1B5-20220929-ctx4096-FP16.bin,  averages: loss [2.060], perplexity   7.847
RWKV-4-Pile-1B5-20220929-ctx4096-FP32.bin,  averages: loss [2.060], perplexity   7.847

RWKV-4-Pile-3B-20221110-ctx4096-Q4_0.bin,   averages: loss [4.689], perplexity 108.724
RWKV-4-Pile-3B-20221110-ctx4096-Q4_1.bin,   averages: loss [2.916], perplexity  18.475
RWKV-4-Pile-3B-20221110-ctx4096-FP16.bin,   averages: loss [2.067], perplexity   7.901

RWKV-4-Pile-7B-20230109-ctx4096-Q4_0.bin,   averages: loss [6.296], perplexity 542.322
RWKV-4-Pile-7B-20230109-ctx4096-Q4_1.bin,   averages: loss [3.017], perplexity  20.423

The measuring method may not be entirely correct, but these huge losses and perplexities really do show in the quality of generated text -- it is almost incoherent.

Of course, we need proper measuring on WikiText; but it would be very slow on my hardware, and WikiText is not representative of my use case.

Interesting thing to note are min and max values of RWKV matrix weights:

169M: -13.8750 14.0000
430M: -14.5000 14.9375
1.5B: -27.2500 27.3750
3B: -12.6875 14.1250

For comparison, llama 7B min and max values are around -2.5 2.5!

As a next step, I'll try to determine whether these huge values are outliers, or most weights really are distributed in this range.

I guess we need alternative quantization scheme for RWKV.

The text was updated successfully, but these errors were encountered:

saharNooby · 2023-04-03T15:34:50Z

If we look at percentiles of 3B model, 0.001-0.999 percentiles have modest values:

blocks.22.att.key.weight [2048, 2048]
	| 0.001=-0.9727 | 0.01=-0.6680 | 0.05=-0.4414 | 0.10=-0.3340 | 0.50=-0.0010 | 0.90=0.3320 | 0.95=0.4395 | 0.99=0.6641 | 0.999=0.9961
blocks.22.att.value.weight [2048, 2048]
	| 0.001=-1.4922 | 0.01=-1.0234 | 0.05=-0.6836 | 0.10=-0.5195 | 0.50=0.0003 | 0.90=0.5195 | 0.95=0.6836 | 0.99=1.0234 | 0.999=1.4922
blocks.22.att.receptance.weight [2048, 2048]
	| 0.001=-0.7812 | 0.01=-0.5469 | 0.05=-0.3672 | 0.10=-0.2773 | 0.50=0.0046 | 0.90=0.2871 | 0.95=0.3750 | 0.99=0.5508 | 0.999=0.7812
blocks.22.att.output.weight [2048, 2048]
	| 0.001=-1.2500 | 0.01=-0.8789 | 0.05=-0.6016 | 0.10=-0.4609 | 0.50=0.0001 | 0.90=0.4609 | 0.95=0.6016 | 0.99=0.8789 | 0.999=1.2500
blocks.22.ffn.key.weight [8192, 2048]
	| 0.001=-0.9805 | 0.01=-0.7227 | 0.05=-0.5039 | 0.10=-0.3906 | 0.50=0.0005 | 0.90=0.3906 | 0.95=0.5039 | 0.99=0.7227 | 0.999=0.9844
blocks.22.ffn.receptance.weight [2048, 2048]
	| 0.001=-0.8047 | 0.01=-0.6016 | 0.05=-0.4238 | 0.10=-0.3281 | 0.50=-0.0008 | 0.90=0.3262 | 0.95=0.4199 | 0.99=0.5977 | 0.999=0.8008
blocks.22.ffn.value.weight [2048, 8192]
	| 0.001=-0.9688 | 0.01=-0.7109 | 0.05=-0.4961 | 0.10=-0.3848 | 0.50=0.0001 | 0.90=0.3867 | 0.95=0.4980 | 0.99=0.7148 | 0.999=0.9727

And min 0.001/max 0.999 among all matrices are -1.5312 1.5312, respectively.

Looks like huge values are rare outliers, and most other values are in range -1.53 .. 1.53.

saharNooby · 2023-04-03T15:41:46Z

Some links that may be useful for research/impl:

https://github.com/TimDettmers/bitsandbytes (LLM.int8())
https://github.com/mit-han-lab/smoothquant (SmoothQuant)
https://arxiv.org/pdf/2210.17323.pdf (GPT-Q)
https://arxiv.org/abs/2304.01089 (RPTQ, reorder-based)

saharNooby · 2023-04-04T15:25:26Z

Here is a hacky way to deal with outlier weights/activations: commit in an experimental branch (do not use unless you know what you are doing!)

What I did:

stored outliers in Q4_1 block explicitly: an outlier is just a single absmax value in a block; it is not quantized and stored as-is
deoptimized Q4_1 matmul: it was quantize activations -> quantized dot, now it is dequantize weights row -> FP32 dot
disabled quantization of emb.weight -- for 14B model, we would save 305 MB by quantizing it from FP16, I guess it is not worth the quality decrease

It is now slower because of deoptimization, but perplexity is significantly lower:

rwkv.cpp-169M-Q4_0.bin,                      averages: loss [3.629], perplexity  37.691
rwkv.cpp-169M-Q4_1.bin,                      averages: loss [3.163], perplexity  23.642
rwkv.cpp-169M-EXPERIMENTAL-q4_1-....bin,     averages: loss [2.787], perplexity  16.231
rwkv.cpp-169M-float16.bin,                   averages: loss [2.699], perplexity  14.861
rwkv.cpp-169M.bin,                           averages: loss [2.699], perplexity  14.861

I did not test it with larger models yet. I'll continue the experiments...

saharNooby · 2023-04-05T15:17:55Z

Results so far for all models (169M to 7B, no data for 7B FP16 because not enough memory):

169M-Q4_0.bin,                  loss [3.629], perplexity  37.691
169M-Q4_1.bin,                  loss [3.163], perplexity  23.642
169M-EXPERIMENTAL-q4_1-....bin, loss [2.787], perplexity  16.231
169M-float16.bin,               loss [2.699], perplexity  14.861

430M-20220808-8066-q4_0.bin,    loss [2.911], perplexity  18.375
430M-20220808-8066-q4_1.bin,    loss [2.631], perplexity  13.885
430M-...-EXPERIMENTAL-q4_1.bin, loss [2.452], perplexity  11.614
430M-20220808-8066-FP16.bin,    loss [2.377], perplexity  10.777

1B5-20220929-ctx4096-Q4_0.bin,  loss [3.079], perplexity  21.745
1B5-20220929-ctx4096-Q4_1.bin,  loss [2.655], perplexity  14.231
1B5-...-EXPERIMENTAL-Q4_1.bin,  loss [2.204], perplexity   9.060
1B5-20220929-ctx4096-FP16.bin,  loss [2.060], perplexity   7.847

3B-20221110-ctx4096-Q4_0.bin,   loss [4.689], perplexity 108.724
3B-20221110-ctx4096-Q4_1.bin,   loss [2.916], perplexity  18.475
3B-...-EXPERIMENTAL-Q4_1.bin,   loss [2.406], perplexity  11.093
3B-20221110-ctx4096-FP16.bin,   loss [2.067], perplexity   7.901

7B-20230109-ctx4096-Q4_0.bin,   loss [6.296], perplexity 542.322
7B-20230109-ctx4096-Q4_1.bin,   loss [3.017], perplexity  20.423
7B-...-EXPERIMENTAL-Q4_1.bin,   loss [2.381], perplexity  10.815

Experimental Q4_1 (as in previous message -- stores outliers in a block as-is and does not quantize activations) clearly reduces perplexity as compared to vanilla Q4_1, but it is still not worth running 7B INT4 instead of 3B FP16 or even 1.5B FP16.

I have ideas about performance optimization, but performance does not matter when it's still better for quality to run smaller models.

I'll try to invent some new ideas for changing quantization format even further.

bennmann · 2023-04-05T15:37:19Z

hahnyuan/RPTQ4LLM#1 RPTQ method may be worth keeping an eye on, their repo is new quantized SOTA

Hope this helps fuel your implementation too

BlinkDL · 2023-04-07T00:30:36Z

Can try this for INT4: compute "mx my rx ry" as in https://github.com/BlinkDL/ChatRWKV/blob/main/rwkv_pip_package/src/rwkv/model.py

Basically: rescale all rows & columns of w --> compute INT4 x @ w --> rescale result.

Probably you only need rx & ry, and you can compute them using max(abs(w)).

And probably only need them for att.output.weight (maybe ffn.value.weight too).

saharNooby · 2023-04-08T11:53:22Z

In the end, I decided to create a separate format in ggml called Q4_1_O, and finish experimenting with quantization for now.

Comparison of Q4_1_O with FP16:

Perplexity

Lower is better. See "Perplexity measuring setup" section below for set-up.

Overall, rule "it's better to use quantized model X than FP16 model X-1" holds.

Parameters, M	Q4_1_O	FP16
169	16.700	15.623
430	12.278	11.802
1500	8.962	8.609
3000	8.288	7.736
7000	7.539	OOM

Performance (per-token latency in ms)

Lower is better. Tests done on 16 GB RAM, 4 core/8 thread machine.

Overall, Q4_1_O is 2x slower than FP16 — at the level of FP32.

Parameters, M	Q4_1_O	FP16
169	18	13
430	57	36
1500	232	124
3000	453	248
7000	1141	OOM

Disk/memory consumption

Q4_1_O has the same overhead as Q4_1 — 24 byte block stores 32 quantized values, which gives 0.75 bytes per value. For comparison, FP16 is 2 bytes per value, perfect INT8 is 1 byte, perfect INT4 is 0.5 bytes.

Reflection and future work

What did work (in order of decreasing importance):

doing dot product in FP32 instead of quantized format. That way, we don't botch the activations that have outliers. It makes inference significantly slower, matching that of FP32.
saving single outlier value per block as-is, and then quantizing rest of the values as if there was no outlier. min and delta fields were reduced from FP32 to FP16 (no quality loss here — all RWKV models are in FP16 anyway), and freed up 4 bytes are used for storing outlier index and outlier value. That way, there is no increase in file size/memory consumption compared to Q4_1.
not quantizing head.weight. It takes not much space in bigger models, but when quantized, significantly increases perplexity. I figured it is not worth to quantize it.

What did not work:

scaling columns only — it slightly improves Q4_0 and Q4_1, slightly degrades Q4_1_O. I decided it was not worth the additional code complexity. The code is available in column-scaling branch.

What I did not try (this is left for the future work):

splitting regular and outlier matmul: do regular matmul in quantized, and outlier matmul in FP16/FP32, like in LLM.int8.
scaling some weight and activation channels, like in SmoothQuant.
doing some complicated linear algebra like in GPT-Q (not a critique of the paper, rather a critique of my own level of understanding)
sort/reorder columns like in RPTQ.
scaling the whole matrix, as BlinkDL suggested.

Moreover, AVX2 implementation of Q4_1_O matmul does not look and feel optimal, and probably can be improved with some asm magic. It's at the limit of my ability to write vectorized code tho.

Perplexity measuring setup

rwkv.cpp commit 874826
measure_pexplexity.py code
models: downloaded from BlinkDL's Hugging Face, file name matches that of models on HF; converted to FP16 with convert_pytorch_to_ggml.py
text file with 4757 tokens
first 1024 tokens were not accounted for in the loss, to let the model take some context in before starting to measure it

saharNooby mentioned this issue Apr 3, 2023

generate_completions feedback #11

Open

saharNooby changed the title ~~Q4_0 and Q_1 quantization breaks RWKV (probably, due to huge absolute weights)~~ Q4_0 and Q_1 quantization breaks RWKV due to weight/activation outliers Apr 4, 2023

saharNooby changed the title ~~Q4_0 and Q_1 quantization breaks RWKV due to weight/activation outliers~~ Q4_0 and Q4_1 quantization breaks RWKV due to weight/activation outliers Apr 4, 2023

saharNooby pinned this issue Apr 5, 2023

saharNooby mentioned this issue Apr 8, 2023

rwkv.cpp server #17

Open

saharNooby closed this as completed Apr 8, 2023

saharNooby unpinned this issue Apr 8, 2023

This was referenced Apr 13, 2023

Support for RWKV rustformers/llm#75

Open

Pulling new quantization format Q4_1_O into upstream ggml ggerganov/ggml#89

Closed

BrutalCoding mentioned this issue Apr 28, 2023

Fix typos and some grammar in README BrutalCoding/shady.ai#20

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Q4_0 and Q4_1 quantization breaks RWKV due to weight/activation outliers #12

Q4_0 and Q4_1 quantization breaks RWKV due to weight/activation outliers #12

saharNooby commented Apr 3, 2023 •

edited

Loading

saharNooby commented Apr 3, 2023

saharNooby commented Apr 3, 2023 •

edited

Loading

saharNooby commented Apr 4, 2023 •

edited

Loading

saharNooby commented Apr 5, 2023 •

edited

Loading

bennmann commented Apr 5, 2023

BlinkDL commented Apr 7, 2023 •

edited

Loading

saharNooby commented Apr 8, 2023 •

edited

Loading

Q4_0 and Q4_1 quantization breaks RWKV due to weight/activation outliers #12

Q4_0 and Q4_1 quantization breaks RWKV due to weight/activation outliers #12

Comments

saharNooby commented Apr 3, 2023 • edited Loading

saharNooby commented Apr 3, 2023

saharNooby commented Apr 3, 2023 • edited Loading

saharNooby commented Apr 4, 2023 • edited Loading

saharNooby commented Apr 5, 2023 • edited Loading

bennmann commented Apr 5, 2023

BlinkDL commented Apr 7, 2023 • edited Loading

saharNooby commented Apr 8, 2023 • edited Loading

Perplexity

Performance (per-token latency in ms)

Disk/memory consumption

Reflection and future work

Perplexity measuring setup

saharNooby commented Apr 3, 2023 •

edited

Loading

saharNooby commented Apr 3, 2023 •

edited

Loading

saharNooby commented Apr 4, 2023 •

edited

Loading

saharNooby commented Apr 5, 2023 •

edited

Loading

BlinkDL commented Apr 7, 2023 •

edited

Loading

saharNooby commented Apr 8, 2023 •

edited

Loading