Blind testing different quants: initial results + I need your help! #5962
Replies: 20 comments 16 replies
-
Awesome! I'll add a link to the README to get some more attention |
Beta Was this translation helpful? Give feedback.
-
You could also make comparison between different imatrix with the same quant. |
Beta Was this translation helpful? Give feedback.
-
Very interesting. I can't believe how often I ended up chosing the q3 or even q2 over the f16. I thought such a difference in precision would be easier to notice, but apparently not. |
Beta Was this translation helpful? Give feedback.
-
Would be useful to display the total number of votes accumulated so far |
Beta Was this translation helpful? Give feedback.
-
As a heads up the warning at the start of the test is actually valid, and some of the roleplay prompts will make the LLM write (descriptive) sexual content! I voted on around fifteen items and got two of those already. |
Beta Was this translation helpful? Give feedback.
-
++respect @Artefact2 for including NSFW contents in the test. This kind of content really captures the creativity of the model that no other type of benchmark can do ;-) |
Beta Was this translation helpful? Give feedback.
-
If you get me the list of prompts, and sampler settings I don't mind hosting Mixtral (feel free to respond with any other major models you think might be relevant or insightful and I'll look into it if it ends up not being too expensive) and running them by it at the various quants (assuming that you're pregenerating the responses). I'm pretty sure that given the deterministic nature of matrix multiplication it shouldn't matter if I run it through CUDA on a few A6000s, right? |
Beta Was this translation helpful? Give feedback.
-
This is wonderful. Well thought out, and a beautiful interface on top of that. We need many more public experiments like this one! I'm a little shocked by how poorly IQ1_S performs though. It seems broken rather than just bad. Many of the responses I've seen contained complete garbage, such as a long text with a hyphen after every single word! Even 125m models work better than that. Are you sure the quantization went right? |
Beta Was this translation helpful? Give feedback.
-
@Artefact2 you need retest IQ1S with recent two pr |
Beta Was this translation helpful? Give feedback.
-
i think speed could play a role in a cognitive bias on that if stuff goes faster with similar output - people assume its higher in quality so for that test to matter you gotta remove the generation speed - not sure if you do tests need to be blind in terms of what people compare. otherwise solid |
Beta Was this translation helpful? Give feedback.
-
i work on a distributed quant automatism - pretty much what tom (thebloke) did but i want now fully automated .. and easy for new guys with 0 knowledge (if they can spin up a docker they can quant) .. with sane defaults .. id love to get some insights about i- matrix and get some sane defaults for the base system - you can find me as "MrDragonFox" in discord if you could spare some insights |
Beta Was this translation helpful? Give feedback.
-
This is a great addition. Hope it gets popular so more people cast their votes. I think it would also be useful to include models quantized with different so-called SOTA techniques in the future, thus comparing both across techniques and quantization types. However, this may need a separate board. |
Beta Was this translation helpful? Give feedback.
-
Can you also make the same but for various model parameter sizes? 7b, 13b, 70b |
Beta Was this translation helpful? Give feedback.
-
Are there any other explanations for these results? |
Beta Was this translation helpful? Give feedback.
-
I have tested 450+ GGUF models against one another and different Qs/IQs of the same model as well tested against same model in format GPTQ, AWQ, and EXL2 . By Model: Mixtrals (2x7b, 4X7B, 8X7B) are off the scale powerful. Mixes/Merges can be hit and miss but when they "hit" is it often out of the park. By the "Qs": 13B models: MIN "Q" / "IQ" : For 20B to 34B: 70B: T/S: I have found through hard experience (and TBs of download data) that applying a min "Q" / max "Q" to all models just doesn't work. Hope this helps out a bit... |
Beta Was this translation helpful? Give feedback.
-
Very cool! However, I'm getting this error popping up a lot:
|
Beta Was this translation helpful? Give feedback.
-
I wonder if it's worth extending this with a static evaluation of "right / wrong" evals (basically the results of classic benchmarks) so it's easy to see the sorts of questions 6Q gets right that 4Q does not, or even what kinds of classic benchmark questions 2Q can get right. |
Beta Was this translation helpful? Give feedback.
-
[My current favorite output...] Quantized Model Evaluation Project: A Human-Centric ApproachProject OverviewThis project aims to quantify the impact of quantization on the Mistral 7B Instruct v0.2 language model, specifically focusing on how human preferences change across different quantization levels. The ultimate goal is to determine the optimal balance between model size/speed and output quality. MethodologyPrompt Curation:A diverse set of 4,000 prompts was compiled from three reputable sources:
Model Quantization and Generation:The base Mistral 7B Instruct v0.2 model was quantized using the following levels:
Responses were generated for each prompt using the same seed and sampling settings across all quantization levels to ensure comparability. Human Evaluation Interface:A dedicated interface (https://freya.artefact2.com/llm-eval/) was created to facilitate blind pairwise comparisons between the base model and its quantized versions. Users are presented with a prompt and two model responses and asked to select the preferred one. The Bradley-Terry model is used to estimate the relative strengths of the models based on the collected votes. Current Results (as of May 21, 2024)The results presented in the image reveal a clear ranking of model preference: Mistral-7B-Instruct-v0.2 (FP16) (Original model)Q6_K and Q5_K: These quantizations are nearly indistinguishable from the original model, suggesting minimal quality loss. Statistical Significance: The 99% confidence intervals indicate that the observed preference differences are statistically significant, except for the potential overlap between the top-performing quantizations (Q6_K, Q5_K, and potentially IQ4_XS). Limitations: While the current sample size is relatively large (500+ votes), more votes are needed to refine the estimates for the less popular quantizations and confirm the observed trends. Ongoing Discussion and Future DirectionsThe project has sparked discussions around: Methodology Refinement: Participants have suggested incorporating diverse evaluation criteria (e.g., factual accuracy, creativity, coherence) and including different quantization techniques (e.g., GPTQ, AWQ, EXL2). Model and Prompt Diversity: There is interest in extending the analysis to larger models (e.g., Mistral Instruct) and exploring how preferences vary across different prompt types. Overall, this project demonstrates the value of human feedback in assessing the impact of model compression techniques. By incorporating diverse perspectives and refining the methodology, the community can gain valuable insights into the trade-offs between model size, speed, and quality, ultimately leading to more informed decisions about model deployment and optimization. Quantized Model Evaluation Project: A Human-Centric Approach Project Overview This project investigates the impact of quantization on language model (LM) performance, focusing on human preference as the primary evaluation metric. We aim to understand how different quantization techniques affect model quality in real-world applications, beyond traditional metrics like perplexity. Methodology Data Collection: Curated a diverse set of 4,000 prompts from established datasets, encompassing general knowledge, logic, math, creativity, and roleplay. Model Generation: Utilized the Mistral 7B Instruct model as a baseline. Evaluation Interface: Developed an interactive interface (https://freya.artefact2.com/llm-eval/) enabling users to compare model outputs side-by-side and vote on their preferred response. Data Analysis: Employed the Bradley-Terry model to rank models based on collected votes, with 99% confidence intervals. Preliminary results with 500+ votes indicate IQ1_S performs significantly worse than other quantizations. Future Work Contributing Cast your votes: Visit the evaluation interface and share your preferences. Additional Notes For consideration:
ALTERNATIVE ITERATION
|
Beta Was this translation helpful? Give feedback.
-
a suggestion - could you include a 70b q1 result set? -- this is because the 7b f16 and 70b q1 use about the same gpu ram |
Beta Was this translation helpful? Give feedback.
-
If I may ask, would it be possible to do similar testing with HQQ/HQQ+? They claim quite strong results. |
Beta Was this translation helpful? Give feedback.
-
I've built a small interface to vote on different quantized versions of the same model (in this case, Mistral 7B Instruct). I then use the results to rank the models by their relative strength (using the Bradley-Terry model).
The idea is to see how much you're losing, by using human preference as a benchmark instead of purely synthetic metrics like perplexity or KL divergence.
So far, using mostly my own votes (around 500), I can say with 99% confidence that IQ1_S is noticeably worse than all the others. The other quants still have too large error bars, which is why I need your help submitting votes.
If you want to help (even if it's just a single vote or two), please head here and read the rules carefully: https://freya.artefact2.com/llm-eval/
The results so far (the pale blue is the margin of error for 99% confidence):
Some info about the methodology:
server
example to generate answers (max. 512 tokens) to these prompts, for all the quants, using the same seed and the same sampling settings.Some ideas I want to try out in the future if this initial experiment is successful:
Beta Was this translation helpful? Give feedback.
All reactions