Blind testing different quants: initial results + I need your help! #5962

Artefact2 · 2024-03-09T14:17:08Z

Artefact2
Mar 9, 2024
Collaborator

I've built a small interface to vote on different quantized versions of the same model (in this case, Mistral 7B Instruct). I then use the results to rank the models by their relative strength (using the Bradley-Terry model).

The idea is to see how much you're losing, by using human preference as a benchmark instead of purely synthetic metrics like perplexity or KL divergence.

So far, using mostly my own votes (around 500), I can say with 99% confidence that IQ1_S is noticeably worse than all the others. The other quants still have too large error bars, which is why I need your help submitting votes.

If you want to help (even if it's just a single vote or two), please head here and read the rules carefully: https://freya.artefact2.com/llm-eval/

The results so far (the pale blue is the margin of error for 99% confidence):

Some info about the methodology:

I picked around 4,000 prompts from these datasets: https://huggingface.co/datasets/DIBT/10k_prompts_ranked ; https://huggingface.co/datasets/HydraLM/GPTeacher_roleplay_standardized ; https://huggingface.co/datasets/Squish42/bluemoon-fandom-1-1-rp-cleaned
I used the server example to generate answers (max. 512 tokens) to these prompts, for all the quants, using the same seed and the same sampling settings.
If multiple people vote on the same prompt for the same model pair, I take the median of all their votes.

Some ideas I want to try out in the future if this initial experiment is successful:

Separate results based on prompt type (general knowledge, logic, math, creativity, roleplay, etc.), this would require a lot more votes which I don't have right now.
Try with bigger models (eg Mixtral Instruct), this would require a lot more compute than I have at home.
Choose different models but with a similar size in GiB (eg 70B IQ2_XS vs 46B IQ3_S vs 30B Q5_K).

ggerganov · 2024-03-09T16:03:36Z

ggerganov
Mar 9, 2024
Maintainer

Awesome! I'll add a link to the README to get some more attention

0 replies

sorasoras · 2024-03-09T16:28:25Z

sorasoras
Mar 9, 2024

You could also make comparison between different imatrix with the same quant.

0 replies

stduhpf · 2024-03-09T17:06:32Z

stduhpf
Mar 9, 2024

Very interesting. I can't believe how often I ended up chosing the q3 or even q2 over the f16. I thought such a difference in precision would be easier to notice, but apparently not.

0 replies

ggerganov · 2024-03-09T18:05:11Z

ggerganov
Mar 9, 2024
Maintainer

Would be useful to display the total number of votes accumulated so far

1 reply

Artefact2 Mar 9, 2024
Collaborator Author

Good suggestion! It's been added.

netrunnereve · 2024-03-09T18:39:10Z

netrunnereve
Mar 9, 2024
Collaborator

As a heads up the warning at the start of the test is actually valid, and some of the roleplay prompts will make the LLM write (descriptive) sexual content! I voted on around fifteen items and got two of those already.

0 replies

ngxson · 2024-03-09T22:16:22Z

ngxson
Mar 9, 2024
Collaborator

++respect @Artefact2 for including NSFW contents in the test. This kind of content really captures the creativity of the model that no other type of benchmark can do ;-)

1 reply

clort81 Jun 10, 2024

That's an admission that YOU can't think of creative ideas without relying on sexual deviant fantasies.

IsaiahGossner · 2024-03-10T18:50:17Z

IsaiahGossner
Mar 10, 2024

If you get me the list of prompts, and sampler settings I don't mind hosting Mixtral (feel free to respond with any other major models you think might be relevant or insightful and I'll look into it if it ends up not being too expensive) and running them by it at the various quants (assuming that you're pregenerating the responses).

I'm pretty sure that given the deterministic nature of matrix multiplication it shouldn't matter if I run it through CUDA on a few A6000s, right?

1 reply

sorasoras Mar 12, 2024

There are some different with different hardware with top token

p-e-w · 2024-03-12T10:23:13Z

p-e-w
Mar 12, 2024

This is wonderful. Well thought out, and a beautiful interface on top of that. We need many more public experiments like this one!

I'm a little shocked by how poorly IQ1_S performs though. It seems broken rather than just bad. Many of the responses I've seen contained complete garbage, such as a long text with a hyphen after every single word! Even 125m models work better than that. Are you sure the quantization went right?

1 reply

Artefact2 Mar 12, 2024
Collaborator Author

I'm pretty confident the IQ1_S quantized model isn't buggy. PPL over wikitext at 4k is 10.1146 +/- 0.06558, which is close to the master results in #5971 (comment). I was surprised by the results too, I just don't think anyone tested 7B IQ1_S on real prompts much outside of perplexity benchmarks.

sorasoras · 2024-03-12T10:34:47Z

sorasoras
Mar 12, 2024

@Artefact2 you need retest IQ1S with recent two pr

0 replies

darkacorn · 2024-03-14T09:10:24Z

darkacorn
Mar 14, 2024

i think speed could play a role in a cognitive bias on that

if stuff goes faster with similar output - people assume its higher in quality

so for that test to matter you gotta remove the generation speed - not sure if you do

tests need to be blind in terms of what people compare. otherwise solid

3 replies

stduhpf Mar 14, 2024

From what I understand, the answers are all pre-generated. The text is appearing progressively because of a CSS animation that plays at the same speed regardless of the model.

darkacorn Mar 14, 2024

problem here is mostly that people can pick what they compare against, so the test isnt unbiased

stduhpf Mar 14, 2024

that's kinda true. I could induce a bias in the overall estimated model strength, but the preference rate matrix should still be okay, unless for some reason there is a corellation between willingness of testing all pairs and model preference.

darkacorn · 2024-03-14T09:16:13Z

darkacorn
Mar 14, 2024

i work on a distributed quant automatism - pretty much what tom (thebloke) did but i want now fully automated .. and easy for new guys with 0 knowledge (if they can spin up a docker they can quant) .. with sane defaults .. id love to get some insights about i- matrix and get some sane defaults for the base system - you can find me as "MrDragonFox" in discord if you could spare some insights

0 replies

YavorGIvanov · 2024-03-15T10:38:52Z

YavorGIvanov
Mar 15, 2024

This is a great addition. Hope it gets popular so more people cast their votes. I think it would also be useful to include models quantized with different so-called SOTA techniques in the future, thus comparing both across techniques and quantization types. However, this may need a separate board.
Great stuff :)

0 replies

evrial · 2024-03-19T16:08:50Z

evrial
Mar 19, 2024

Can you also make the same but for various model parameter sizes? 7b, 13b, 70b
Also logic reasoning and math very interesting

0 replies

Jipok · 2024-04-01T18:33:15Z

Jipok
Apr 1, 2024

IQ4_XS > QK_6 for now
When I compared these two models myself, I noticed that IQ4_XS offers less information(especially specific numbers) in the answer. And although at first glance Q6_K looks better due to more details, its details, upon brief analysis, often turn out to be plausible lies. Therefore, I chose the option that is less erroneous(i.e. less details).

Are there any other explanations for these results?
This could be explained by different quantization methods, but Q5_K also shows the best result.

1 reply

Artefact2 Apr 2, 2024
Collaborator Author

Are there any other explanations for these results?

Probably just noise from low number of samples. On the live page, the light blue bars represent the possible ranges for 99% confidence. As you can see, all the bars overlap except IQ1S, which is statistically significantly below all others.

David-AU-github · 2024-04-02T07:01:58Z

David-AU-github
Apr 2, 2024

I have tested 450+ GGUF models against one another and different Qs/IQs of the same model as well tested against same model in format GPTQ, AWQ, and EXL2 .

By Model:
There is sometimes a DRASTIC difference between Q6 and Q8 ; sometimes none at all and sometimes Q6 beats Q8. I have even had circumstances where a Q6/Q8 beat an FP16 in terms of instruction following / output.

Mixtrals (2x7b, 4X7B, 8X7B) are off the scale powerful. Mixes/Merges can be hit and miss but when they "hit" is it often out of the park.
However GGUF compression issues appear with some models ; especially IQ compressions.

By the "Qs":
At 7B models there is a very noticeable difference between Q2 to Q8 and FP 16 in most cases.
The differences between the Qs is even more pronounced when testing 3B and 1B.
At 1B and 3B ... brainpower suffers the lower Q you go... until BOOM.
At 7B and above -> creativity / quality output drops the lower Q you go... and nuance drops.

13B models:
Sometimes Q4 is perfect (and shows little difference between it and Q5,Q6) ; whereas other times the difference (nuance - instruction following and output) between Q4 and Q6 can be huge.

MIN "Q" / "IQ" :
For models 13 B and below : Min Q6, and/or Q8.
Generally I test the following: Q5_K_M (because it is "unbalanced"), Q6 (completely balanced) and Q8 (max bits).
For IQ: IQ4-NL / IQ4-XS at this level.

For 20B to 34B:
Up to Q6 , after initial test(s) at Q4/Q5.
IQ: 3 or higher [ IQ4-NL preferred ] , however have had surprising results at low as IQ2_XS.

70B:
This is the limit of what I can test. Recently tested 29 models at IQ1_S.
RESULTS: 1/4 worked great, 1/4 good, 1/4 okay, last 1/4 ... crash and burn.
IQ2XX2 of 70B is a large jump in quality from IQ1.
Generally found IQ1 does not work on below 70B sized models.

T/S:
Generally this is the last consideration. Once a model's quality is assessed , I then download other Qs / IQs to determine the smallest model / highest T/S acceptable for use cases. This is on a per model basis.

I have found through hard experience (and TBs of download data) that applying a min "Q" / max "Q" to all models just doesn't work.

Hope this helps out a bit...

5 replies

Artefact2 Apr 2, 2024
Collaborator Author

Can you share a bit more about the methodology? You can get drastic changes from the exact same prompt by just changing the seed (unless top_k=1).

Current data seems to show no obvious human preference for any particular model except IQ1_S. It could be that we just need more votes, or that a few malicious users vote randomly, or that the models really are not distinguishable on average.

stduhpf Apr 2, 2024

@Artefact2 Also users might not have all the same criteria when evaluating the preferred answers. Some might check the validity of the statements, while other will just select the one with better formatting, or better prompt following, or just go 100% by which one "felt" better with no particular scrutiny.
I'm sure with a larger sample size, this will eventually lead to a distinct ranking of the models, but It might have converged faster if asking users to compare each answear with some particular criteria in mind. (ex: prompt following, grammar, fact checking, character consistecy...)

David-AU-github Apr 2, 2024

Methodology:

Exact same parameters for testing all models.
NO system role or other changes (this is done in 2nd level testing).
One shot ... but up to two regens to verify results (in case of odd 1st results - ie low token output).
Same model comparisons - 3 one shot generations per "Q" / "IQs" - output checked carefully for repeats, form, structure and so on.
Long context testing of 1000+ tokens.
Same input prompt used each time. No variations.

Input prompt is designed to test long form creative generation with specific parameters which test reasoning, planning, creativity, instruction following as well as nuance. It is a really tough test for any model.

Roughly speaking each model is tested in it's "naked" form, on a "fresh load", on a single prompt in chat form - one shot.
Subjective grading on the output, with notes.

Example:
Testing "Starling 7B" - all Q's, as well as GTPQ, EXL2 and AWQ as well as FP16.
Starling was chosen because of the quality of the model, and benchmarks.
Improvement over Q from Q2 on up can be easily seen using these testing methods.
This testing also shows where GGUF meets or exceeds AWQ, EXL2, GPTQ and FP16.

NOTE: High testing benchmark models tend to function better at lower "Qs".

Likewise testing has shown at 70B / IQ1 - > the same results. Models that test high in benchmarks and general tests tend to "hold together" at lower... or as IQ1 - extreme compression.

David-AU-github Apr 3, 2024

Just a quick comment here about Mistral Instruct (also applies to Mixtral Instruct).
For whatever reason the INSTRUCT versions of these models are magnitudes above their "chat" cousins.
This means anything above IQ1 will work almost perfectly. (unless you know what to look for).
You might get a wider range of results between "Qs" / "IQs" if you use the chat model for your survey.

You might want to try "Tinydolphin-2.8-1.1b" which is available in all IQs/Qs - this will give you an extreme test and contrast the differences between IQ/Qs in much greater detail. I did download and test these as an exercise. Only IQ1 did not work. IQ2s were really touch and go... IQ4NL - was working good to great (relative to Q6/Q8)... at 170 T/S.

presidenten May 30, 2024

@David-AU-github
So if I dont have a lot of time for testing, would it be fair to say that for

7B it would be a safe bet to go with Q8? (or is fp_16 the way to go for smaller models?)
13B-22B a safe bet is Q6 or Q8
34B Q6?

ddh0 · 2024-04-12T05:18:19Z

ddh0
Apr 12, 2024

Very cool! However, I'm getting this error popping up a lot:

Backend request errored out (error: {"status":"client-error","error":"session_id mismatch"}). Open the console for more details.

1 reply

Artefact2 Apr 12, 2024
Collaborator Author

Are you using a roaming connection or maybe a browser extension that shuffles your user-agent string between requests?

kinchahoy · 2024-04-15T01:37:46Z

kinchahoy
Apr 15, 2024

I wonder if it's worth extending this with a static evaluation of "right / wrong" evals (basically the results of classic benchmarks) so it's easy to see the sorts of questions 6Q gets right that 4Q does not, or even what kinds of classic benchmark questions 2Q can get right.

0 replies

thedavidyoungblood · 2024-05-22T00:56:19Z

thedavidyoungblood
May 22, 2024

[My current favorite output...]

Quantized Model Evaluation Project: A Human-Centric Approach

Project Overview

This project aims to quantify the impact of quantization on the Mistral 7B Instruct v0.2 language model, specifically focusing on how human preferences change across different quantization levels. The ultimate goal is to determine the optimal balance between model size/speed and output quality.

Methodology

Prompt Curation:

A diverse set of 4,000 prompts was compiled from three reputable sources:

DIBT/10k_prompts_ranked
HydraLM/GPTeacher_roleplay_standardized
Squish42/bluemoon-fandom-1-1-rp-cleaned

Model Quantization and Generation:

The base Mistral 7B Instruct v0.2 model was quantized using the following levels:

IQ1_S
IQ2_XS
IQ3_XS
IQ4_XS
Q5_K
Q6_K

Responses were generated for each prompt using the same seed and sampling settings across all quantization levels to ensure comparability.

Human Evaluation Interface:

A dedicated interface (https://freya.artefact2.com/llm-eval/) was created to facilitate blind pairwise comparisons between the base model and its quantized versions.

Users are presented with a prompt and two model responses and asked to select the preferred one.
Statistical Analysis:

The Bradley-Terry model is used to estimate the relative strengths of the models based on the collected votes.
99% confidence intervals are calculated to assess the statistical significance of the observed differences.

Current Results (as of May 21, 2024)

The results presented in the image reveal a clear ranking of model preference:

Mistral-7B-Instruct-v0.2 (FP16) (Original model)

Q6_K and Q5_K: These quantizations are nearly indistinguishable from the original model, suggesting minimal quality loss.
IQ4_XS: This quantization performs slightly worse than the top contenders but is still close in preference.
IQ3_XS and IQ2_XS: These quantizations show a noticeable drop in preference compared to the higher levels.
IQ1_S: This quantization is significantly less preferred, with a large margin separating it from the other options.

Statistical Significance: The 99% confidence intervals indicate that the observed preference differences are statistically significant, except for the potential overlap between the top-performing quantizations (Q6_K, Q5_K, and potentially IQ4_XS).

Limitations: While the current sample size is relatively large (500+ votes), more votes are needed to refine the estimates for the less popular quantizations and confirm the observed trends.

Ongoing Discussion and Future Directions

The project has sparked discussions around:

Methodology Refinement: Participants have suggested incorporating diverse evaluation criteria (e.g., factual accuracy, creativity, coherence) and including different quantization techniques (e.g., GPTQ, AWQ, EXL2).
Bias Mitigation: Concerns have been raised about potential biases in human evaluation, and strategies like randomized presentation order and explicit evaluation guidelines are being considered.

Model and Prompt Diversity: There is interest in extending the analysis to larger models (e.g., Mistral Instruct) and exploring how preferences vary across different prompt types.

Overall, this project demonstrates the value of human feedback in assessing the impact of model compression techniques. By incorporating diverse perspectives and refining the methodology, the community can gain valuable insights into the trade-offs between model size, speed, and quality, ultimately leading to more informed decisions about model deployment and optimization.

Quantized Model Evaluation Project: A Human-Centric Approach

Project Overview

This project investigates the impact of quantization on language model (LM) performance, focusing on human preference as the primary evaluation metric. We aim to understand how different quantization techniques affect model quality in real-world applications, beyond traditional metrics like perplexity.

Methodology

Data Collection:

Curated a diverse set of 4,000 prompts from established datasets, encompassing general knowledge, logic, math, creativity, and roleplay.

Model Generation:

Utilized the Mistral 7B Instruct model as a baseline.
Applied various quantization levels (IQ1_S, Q2_XS, Q3_S, Q5_K, Q6_K) to create multiple quantized versions.
Generated responses for each prompt using the same seed and sampling settings for consistency.

Evaluation Interface:

Developed an interactive interface (https://freya.artefact2.com/llm-eval/) enabling users to compare model outputs side-by-side and vote on their preferred response.

Data Analysis:

Employed the Bradley-Terry model to rank models based on collected votes, with 99% confidence intervals.
Current Findings

Preliminary results with 500+ votes indicate IQ1_S performs significantly worse than other quantizations.
Other quantized models currently lack sufficient data for definitive conclusions.
Detailed results can be found on the evaluation interface.

Future Work
Analyze results based on prompt type (e.g., knowledge, logic, creativity).
Extend evaluation to larger models (e.g., Mixtral Instruct) if resources permit.
Compare models of similar size but different quantization levels to assess trade-offs.

Contributing
Your participation is crucial! We encourage you to:

Cast your votes: Visit the evaluation interface and share your preferences.
Share your insights: Contribute to discussions about methodology, findings, and potential improvements.
Offer resources: If you have access to computational resources or expertise, consider contributing to the project.

Additional Notes
We acknowledge potential biases in human evaluation and are exploring ways to mitigate them.
Please be aware that some prompts may generate NSFW content due to the nature of the roleplay dataset.
We welcome collaboration and feedback to refine this project and enhance our understanding of quantized models.

For consideration:

## Evaluating Quantized Language Models with Human Preference: A Technical Guide

### Introduction

This document details the technical aspects and considerations involved in conducting a blind test to evaluate the performance of quantized language models (LMs), with a specific focus on using human preference as the primary metric. Quantization is a technique used to reduce the size and computational requirements of LMs, enabling efficient deployment on resource-constrained devices. However, it's crucial to assess how quantization affects the quality of the model's output.

### Methodology

1. **Data Collection:**
   * Gather a diverse set of prompts from multiple datasets, ensuring a variety of topics and styles. This could include general knowledge questions, logical reasoning problems, mathematical tasks, creative writing prompts, and roleplay scenarios.

2. **Model Selection:**
   * Choose a base language model to be quantized, such as Mistral 7B Instruct.
   * Apply different quantization techniques (e.g., Q2_XS, Q3_S, Q5_K) to create multiple quantized versions of the model.

3. **Response Generation:**
   * Use a consistent sampling method and seed value to generate responses for each prompt and model. This ensures that any differences in output are due to the quantization process, not randomness.
   * Limit response length to a reasonable size (e.g., 512 tokens) to prevent overly verbose outputs that might bias human preference.

4. **Interface Development:**
   * Create a user-friendly interface that allows users to compare responses from different models side-by-side and vote on their preferred option. The interface should clearly display the prompt and the responses without revealing which model generated each response.

5. **Data Collection and Analysis:**
   * Collect votes from a diverse group of participants. Encourage them to provide feedback on their reasoning behind their preferences.
   * Use a statistical model like the Bradley-Terry model to rank the models based on the collected votes. This model can also provide confidence intervals to assess the statistical significance of the rankings.
   * Analyze the data to identify any patterns related to specific prompt types or quantization techniques.

### Future Considerations

* **Prompt Type Analysis:** As more data is collected, consider analyzing the results separately for different prompt types. This could reveal which quantization techniques are better suited for specific tasks.
* **Larger Model Evaluation:** Extend the experiment to larger models like Mistral Instruct. This will require additional computational resources but could provide valuable insights into the scalability of quantization methods.
* **Diverse Model Comparison:** Compare models with similar sizes in GiB but different quantization levels (e.g., 70B IQ2_XS vs 46B IQ3_S vs 30B Q5_K). This can help determine the trade-offs between model size and performance.

### Potential Challenges and Mitigation Strategies

* **Bias:** Participants might be biased towards responses that appear more sophisticated or well-formatted, even if they are less accurate or relevant. To mitigate this, consider providing clear instructions to participants on what criteria to focus on when evaluating responses.
* **Malicious Voting:** Some users might vote randomly or intentionally skew the results. Consider implementing measures to detect and filter out such behavior.
* **Noise:** With a small sample size, the results might be influenced by random noise. Encourage more participants to vote and collect a larger dataset to reduce noise and increase confidence in the rankings.

### Conclusion

This guide provides a comprehensive framework for evaluating quantized language models using human preference as a key metric. By following these guidelines, researchers and practitioners can gain valuable insights into the impact of quantization on model performance and make informed decisions about choosing the most appropriate quantization techniques for their specific use cases.

ALTERNATIVE ITERATION

Quantized Model Evaluation Project: A Human-Centric Approach
Project Overview
This project aims to understand how different quantization techniques affect the quality of large language models (LLMs) in real-world applications, focusing on human preference as the primary evaluation metric.

Methodology
Data Collection:

4,000 prompts were selected from diverse datasets (DIBT, GPTeacher, BlueMoon) covering general knowledge, logic, math, creativity, and roleplay.
Responses were generated for each prompt using the Mistral 7B Instruct model and its quantized versions (IQ1_S, IQ2_XS, IQ3_XS, IQ4_XS, Q5_K, Q6_K) with consistent settings.
Evaluation Interface:

An interactive interface (https://freya.artefact2.com/llm-eval/) was created to facilitate blind comparisons between model outputs, allowing users to vote on their preferred responses.
Data Analysis:

The Bradley-Terry model was used to analyze the collected votes and rank the models, with 99% confidence intervals to assess statistical significance.
Current Status and Findings
Votes: Over 500 votes have been collected so far, primarily from the project initiator. More participation is needed to increase the statistical significance of the results.
Preliminary Results:
IQ1_S is significantly worse than other quantization levels.
The remaining quantization levels (IQ2_XS, IQ3_XS, IQ4_XS, Q5_K, Q6_K) show similar performance within the margin of error.
Estimated Model Strength
The image reveals the following estimated model strength based on user votes:

Model	Estimated Strength
f16	Strongest
Q6_K	Very strong
Q5_K	Very strong
IQ4_XS	Strong
IQ3_XS	Moderate
IQ2_XS	Weak
IQ1_S	Weakest

Community Feedback and Future Directions
Community contributions:
Users have volunteered to contribute computational resources for testing larger models and evaluating different quantization matrices.
A community member is developing a distributed quantization automation tool to streamline the quantization process.
Feedback:
Users have expressed a desire to see evaluations extended to include different model sizes (7B, 13B, 70B).
The impact of various quantization techniques (GPTQ, AWQ, EXL2) on models of different sizes has been discussed.
There's a need for standardized evaluation criteria to ensure consistency in human judgments.
The potential impact of model biases and user preferences on evaluation results has been highlighted.
Challenges and Considerations
Statistical significance: More votes are needed to achieve statistically significant results for all models.
Evaluation biases: Human evaluation can be subjective and influenced by factors like response formatting and personal preferences. Clearer evaluation criteria may help mitigate this.
Model variations: Different models may perform differently at different quantization levels, and some models may be more suitable for certain quantization techniques than others.
Call to Action
Your participation is crucial! Please contribute to the project by:

Casting your votes: Help us collect more data by participating in the blind comparison on the evaluation interface.
Sharing your insights: Join the discussion on the forum or Discord server to share your thoughts and expertise.
Offering resources: If you have computational resources or skills to contribute, please reach out to the project maintainers.
Together, we can gain valuable insights into the impact of quantization on language models and make informed decisions about their deployment in real-world applications.

2 replies

thedavidyoungblood May 22, 2024

TLDR:

Mistral-7B-Instruct-v0.2 (FP16) (Original model)

Q6_K and Q5_K: These quantizations are nearly indistinguishable from the original model, suggesting minimal quality loss.
IQ4_XS: This quantization performs slightly worse than the top contenders but is still close in preference.
IQ3_XS and IQ2_XS: These quantizations show a noticeable drop in preference compared to the higher levels.
IQ1_S: This quantization is significantly less preferred, with a large margin separating it from the other options.

Statistical Significance: The 99% confidence intervals indicate that the observed preference differences are statistically significant, except for the potential overlap between the top-performing quantizations (Q6_K, Q5_K, and potentially IQ4_XS).

David-AU-github Jun 11, 2024

Some feedback here...

As someone that has tested 1000s of quants I want to bring some issues to your attention:

IQ1_S varies in performance directly with : dataset used for imatrix, model quality at time of quanting and most significantly the size of the model.

IE: 1B: lucky it it will work, 3B coin toss, 7B depends on model quality/dataset, 13B... now we are talking, 20B getting there... 34B quality/dataset -> but some work well, .. 70B -> about 25% work well.

IQ1_M however is LIGHT YEARS ahead of IQ1_S.
In terms of perplexity ; it is 2x times LOWER than IQ1_S.
(I am talking about the difference between 30 and 15 here)

Yet like IQ1_S - dataset is CRITICAL.
Likewise IQ2_XS is far above IQ1_M.

DATASET used during Imatrix (including proper prep of the data in the dataset) it critical.
I know this because I have tested each quant in depth with different datasets, with models at different parameter levels.

And at the lowest quant levels it often takes more than one dataset for lower quants to operate at full power.
IE: a dataset for IQ1_S, and a different one for IQ1_M.

Another issue is actual quanting methods (beyond dataset issues). IF you quant at F32 VS F16 many times you get stronger quants from IQ1 right up to Q8 - this is the "copy of a copy" issue -> rounding errors that destroy model function.

I know of these factors because I have rebuild models from "scratch" and carried forward maximum precision at every step of the journey... resulting in vast increases in performance... they are at my repo.

In terms of testing:

To test => Set TEMP=0 ; then test quants with questions with NO CORRECT answer.
This will quickly reveal the differences between quants in a model.
This will test instruction following, word comprehension, nuance, sentence and prose construction and creativity (which fully tested the strengths and/or weaknesses of the model.) plus default behaviors as well as output length.

This will also reveal issues (and differences) with dataset choice(s) in terms of imatrix quants.

AartBluestoke · 2024-07-31T00:26:27Z

AartBluestoke
Jul 31, 2024

a suggestion - could you include a 70b q1 result set? -- this is because the 7b f16 and 70b q1 use about the same gpu ram

0 replies

charlesrwest · 2024-08-21T12:35:45Z

charlesrwest
Aug 21, 2024

If I may ask, would it be possible to do similar testing with HQQ/HQQ+? They claim quite strong results.

0 replies

Blind testing different quants: initial results + I need your help! #5962

Artefact2 Mar 9, 2024 Collaborator

Replies: 20 comments · 16 replies

ggerganov Mar 9, 2024 Maintainer

ggerganov Mar 9, 2024 Maintainer

Artefact2 Mar 9, 2024 Collaborator Author

netrunnereve Mar 9, 2024 Collaborator

ngxson Mar 9, 2024 Collaborator

Artefact2 Mar 12, 2024 Collaborator Author

Artefact2 Apr 2, 2024 Collaborator Author

Artefact2 Apr 2, 2024 Collaborator Author

Artefact2 Apr 12, 2024 Collaborator Author

Quantized Model Evaluation Project: A Human-Centric Approach

Project Overview

Methodology

Prompt Curation:

Model Quantization and Generation:

Human Evaluation Interface:

Current Results (as of May 21, 2024)

Mistral-7B-Instruct-v0.2 (FP16) (Original model)

Ongoing Discussion and Future Directions

ALTERNATIVE ITERATION

Mistral-7B-Instruct-v0.2 (FP16) (Original model)

Artefact2
Mar 9, 2024
Collaborator

Replies: 20 comments 16 replies

ggerganov
Mar 9, 2024
Maintainer

ggerganov
Mar 9, 2024
Maintainer

Artefact2 Mar 9, 2024
Collaborator Author

netrunnereve
Mar 9, 2024
Collaborator

ngxson
Mar 9, 2024
Collaborator

Artefact2 Mar 12, 2024
Collaborator Author

Artefact2 Apr 2, 2024
Collaborator Author

Artefact2 Apr 2, 2024
Collaborator Author

Artefact2 Apr 12, 2024
Collaborator Author