Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Q4_1_O quantization format that preserves outliers in weights and does dot in FP32 #16

Merged
merged 11 commits into from
Apr 8, 2023

Conversation

saharNooby
Copy link
Collaborator

Q4_1_O is like Q4_1, but with two important differences:

  • for each block, a single outlier is selected (absmax value) and stored separately, as-is; remaining values are quantized as if there was no outlier at all
  • during inference, dot product in matmul is done in FP32, following weight dequantization; in contrast to Q4_1, which quantized activations and does quantized dot

This format greatly improves perplexity as compared to Q4_1, but the cost is inference that is as slow as FP32.

Perplexity comparison on a private dataset (less is better):

1B5-20220929-ctx4096-Q4_0.bin,   loss [3.079], perplexity  21.745
1B5-20220929-ctx4096-Q4_1.bin,   loss [2.655], perplexity  14.231
1B5-20220929-ctx4096-Q4_1_O.bin, loss [2.204], perplexity   9.060
1B5-20220929-ctx4096-FP16.bin,   loss [2.060], perplexity   7.847

3B-20221110-ctx4096-Q4_0.bin,    loss [4.689], perplexity 108.724
3B-20221110-ctx4096-Q4_1.bin,    loss [2.916], perplexity  18.475
3B-20221110-ctx4096-Q4_1_O.bin,  loss [2.406], perplexity  11.093
3B-20221110-ctx4096-FP16.bin,    loss [2.067], perplexity   7.901

Performance comparison (per-token latency, less is better):

1B5 FP32:   213 ms per token
1B5 FP16:   115 ms per token
1B5 Q4_0:   159 ms per token
1B5 Q4_1:   110 ms per token
1B5 Q4_1_O: 207 ms per token

@saharNooby saharNooby linked an issue Apr 7, 2023 that may be closed by this pull request
@saharNooby saharNooby mentioned this pull request Apr 8, 2023
@saharNooby saharNooby merged commit 84e0698 into master Apr 8, 2023
@iacore
Copy link

iacore commented Apr 22, 2023

README.md has changed from this commit.

Current:

4: Q4_1_O, OK quality, moderately fast (20% slower than FP16).
3: Q4_1, worst quality, fast (comparable to FP16).
2: Q4_0, poor quality, very fast.

Which one is correct?

@saharNooby
Copy link
Collaborator Author

@iacore Current version in README.md is correct.

Note that I'm working on pulling Q4_2 and Q4_3 formats from ggml, latest measurements are here. This is not merged into master yet.

@saharNooby saharNooby deleted the outliers-preserving-quantization-PR branch April 22, 2023 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Typo in README.md
3 participants