[Feature] Change `kv_cache_scheme` to HF QuantizedCache rather than Linear.output_scale #132

mgoin · 2024-08-14T18:57:45Z

Some models in Transformers have merged qkv_proj Linear modules, like Phi-3, so our current scheme of adding output activation observers to separate k_proj and v_proj Linear modules will not work.

We should be able to use the Cache and QuantizedCache classes in the HF Transformers library, which has been added to most modeling definitions as the class to manage past_key_values: https://github.com/huggingface/transformers/blob/8820fe8b8c4b9da94cf1e4761876f85c562e0efe/src/transformers/cache_utils.py#L770

This allows us to implement our own QuantizedCache with a quantize and dequantize function, where we can calculate the statistics needed for our kv_cache_quant scheme.

Prototype QuantizedCacheConfig impl #87

The text was updated successfully, but these errors were encountered:

mgoin assigned horheynm Aug 14, 2024

mgoin mentioned this issue Sep 21, 2024

[KV-Cache] Make k_scale, v_scale as attributes of self_attn using HFCache #148

Merged

mgoin closed this as completed in #148 Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Change `kv_cache_scheme` to HF QuantizedCache rather than Linear.output_scale #132

[Feature] Change `kv_cache_scheme` to HF QuantizedCache rather than Linear.output_scale #132

mgoin commented Aug 14, 2024 •

edited

Loading

[Feature] Change kv_cache_scheme to HF QuantizedCache rather than Linear.output_scale #132

[Feature] Change kv_cache_scheme to HF QuantizedCache rather than Linear.output_scale #132

Comments

mgoin commented Aug 14, 2024 • edited Loading

[Feature] Change `kv_cache_scheme` to HF QuantizedCache rather than Linear.output_scale #132

[Feature] Change `kv_cache_scheme` to HF QuantizedCache rather than Linear.output_scale #132

mgoin commented Aug 14, 2024 •

edited

Loading