You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some models in Transformers have merged qkv_proj Linear modules, like Phi-3, so our current scheme of adding output activation observers to separate k_proj and v_proj Linear modules will not work.
This allows us to implement our own QuantizedCache with a quantize and dequantize function, where we can calculate the statistics needed for our kv_cache_quant scheme.
Some models in Transformers have merged
qkv_proj
Linear modules, like Phi-3, so our current scheme of adding output activation observers to separatek_proj
andv_proj
Linear modules will not work.We should be able to use the
Cache
andQuantizedCache
classes in the HF Transformers library, which has been added to most modeling definitions as the class to managepast_key_values
: https://github.com/huggingface/transformers/blob/8820fe8b8c4b9da94cf1e4761876f85c562e0efe/src/transformers/cache_utils.py#L770This allows us to implement our own QuantizedCache with a quantize and dequantize function, where we can calculate the statistics needed for our kv_cache_quant scheme.
Prototype QuantizedCacheConfig impl #87
The text was updated successfully, but these errors were encountered: