-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot save compressed binary or ternary weights, saved as float32 parameters #806
Comments
Could you try again with a To check if the conversion was successful, I recommend opening the |
@Tombana It does not seem to recognize QuantConv2D either: Assuming I manage to get QuantConv2D working, can I then use it to replace the dense layers and benefit from the weight compression with a kernel size of 1x1? I am asking because I am not sure if a 1x1 kernel will allow efficient bit packing, I don't know how it is implemented in LCE. Also, if you think it should not be too hard to implement QuantDense in LCE, I would perhaps be interested in implementing it since I would like to do ternary LLM inference on Android, but at the moment I can't even save a TFLite file of the model because it is bigger than 2GB if it is unquantized. |
Yes. It will benefit from the weight compression. However, depending on the shape parameters it might not benefit from the speed improvement because there are currently no optimized kernels implemented for that.
We welcome pull requests for this. Implementing this would consist of two main parts:
|
@Tombana I don't know if this changes anything, but I forgot to say that the inputs to the binary and ternary layers are not quantized, only the kernel weights are. I was assuming that there would be some kind of general mechanism to store quantized weights efficiently, independently of the layer, and then unquantize them on-device to get back to the world of classical f32 operations that I assume are all already supported. But of course implementing a ternary matmul algorithm would probably be best, since even for f32 inputs, the binary and ternary matmul can be implemented with only additions and subtraction operations. It should be more efficient than just unquantizing the kernel weights and fall back to the normal f32 matmul algorithm. I will look into the files you linked, but I'm afraid I know nothing about MLIR, TableGen or NEON assembly. |
The LCE binary conv2d layer only supports the case where both inputs and weights are binary.
This is a good idea. In theory this is supported because there would be a set of binary weights, followed by a However in practice I think that does not work because the MLIR converter will do 'constant folding' : when it detects this |
As a workaround, maybe there is a way to prevent constant folding by adding an Input variable that will be just zero at inference time and added to the weights before Is there any way to store bit-packed weights without relying on |
Something like that might work. Perhaps you have to use the input variable even before the Note that this |
@Tombana I tried using the I also tried having a fully binary dense module inside of my custom layer and just use its kernel instead of calling the layer, but it has the same behavior with and without the trick, the quantization is lost. Do you have any other idea how to preserve quantization on disk with the current API? In any case, if it is possible I suggest creating additional separate functions for quantization and dequantization in the python API to have a fine grained control over the process and storage of parameters when needed. If such functions can be created in user space, I am very interested as I need a working solution urgently. |
I don't know if this will work, but this is the only other idea I have left:
Then rebuild the converter (just a heads up, this can take several hours) and see if it works. |
After lots of tries, I found that I haven't tried to change |
I think it might be a nice feature to be able to run regular layers but with bitpacked weights that get converted to float/int8 at runtime. I'm not sure it has many use cases: binarizing weights often requires a new model architecture that is specifically tuned for this, but perhaps it works for your neural network. The Larq team currently does not have time to pick this up, but perhaps someone else in the community can work on it. |
It is useful in the cases where having a small memory footprint is most important. The TFLite converter even fails if the resulting TFLite file weighs more than 2GB, which prevents using most LLMs on a smartphone through TFLite or Larq. Quantization of activations is apparently more destructive than quantization of weights, so it makes sense to only quantize weights in a post-training quantization step. Currently, only models trained from scratch with both quantized weights and activations can benefit from the Larq primitives, if I understand correctly, but supporting mixed quantization schemes would also bring fast inference and reduced model size to post-training quantized models. I am currently writing my own packing/unpacking algorithm with tensorflow primitives, I fear that the runtime cost might be too large though, I see it is creating lots of nodes in the computational graph. |
I am trying to save a quantized ternary model to a
.tflite
file, but larq doesn't seem to save the weights using datatypes with a reduced precision and thus compress the file size.However, after converting and writing to disk, the size of the file is about the same as the one predicted by
larq.models.summary
in float32 parameters.Even if I try to do the same thing with a simple
QuantDense
layer, the weights are saved in float32.I am using this kind of code:
Am I doing something wrong?
The text was updated successfully, but these errors were encountered: