-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add quantize-stats command for testing quantization #728
Conversation
Command that calculates some statistics over the errors introduced by quantization, at the moment mean square error and max error for layer weights. Should be useful for testing quantization improvements. Needs some internal state from ggml and llama that should not be part of the public API.
Show some error percentiles, should be less noisy than just the max error.
It was chosen by random, so we should disregard it anyway. This PR tests on the real data, right? |
Does it make sense to also show histogram of the bins? Or is showing histogram of error distributions enough for analysis? |
I'm running out of memory here, is there an easy way this could be made to run on 16GB? I should get around to buying some more... Do you think this could be done more like quantize.cpp, or even as an optional mode of that existing example program, to do batch processing without loading the entire model at once? |
Here's a quick & dirty hack to llama.cpp so that the statistics will be collected in the existing quantize.cpp: sw@cec7cb9 Simply omit the output file parameter:
If the implementation could be made cleaner, I would prefer this, as it doesn't load the entire model in memory. That doesn't mean I'm against this PR as it is. We might also take the opportunity to cut down on quantize.cpp's verbosity in normal operation. Most people would probably like to see a simple progress indicator. |
That's a good point. It didn't occur to me earlier that this PR loads the whole model in the memory and there. If that's the case, the functionality indeed should be rolled into quantize.cpp which does not do this. |
Thank you for your comments, I think it is a good idea to show some more statistics in I thought about showing the weight histogram, but I expect people would rather investigate the model in Python or similar. This should be focused on the needs specific to this project, like testing a new quantization implementation. I don't think loading the entire model into memory should be much of an issue because of the mmap changes - it should evict the model from memory again as needed. |
Test quantization in smaller chunks instead of layer-at-a-time.
Show RMSE instead of MSE - keeps similar range to the other metrics. Regex match on layer pattern.
@unbounded : with the latest changes I'm now able to run it with 16GB, thanks for that. Though the resident size still grows to over 12GB. It would be nice if this could be made smaller, but I don't consider it a deal-breaker. You might add This PR is somewhat incompatible with #729 because
If we merge this PR, then I'd say afterwards we better move the verbose output from |
Expose reference quantization implementation and add option to use it for tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expose the internal stuff through the regular .h
files.
Just add comments that this is for debugging / testing purposes and one should not rely on these functions in their projects
Move into main header with comment not to use, per PR feedback
Removed the internal header files. The mixing of C and C++ definitions makes it a bit messy but I can't really think of a good way to make it cleaner. |
|
@unbounded |
Adds a
quantize-stats
binary that calculates some statistics over the errors introduced by quantization of a given model.At the moment it shows mean square error and max error for layer weights, as well as a quantization error histogram. Should be useful for testing quantization improvements without having to do a full perplexity run.
Needs some internal state from ggml and llama that should not be part of the public API, so I moved those out to internal headers - not the prettiest solution but could be useful for other tests as well.
Simple example - short summary of quantization format errors for all layers except .
Another example - quicker test on a single layer, with detailed output:
output