Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Change kv_cache_scheme to HF QuantizedCache rather than Linear.output_scale #132

Closed
mgoin opened this issue Aug 14, 2024 · 0 comments · Fixed by #148
Closed

[Feature] Change kv_cache_scheme to HF QuantizedCache rather than Linear.output_scale #132

mgoin opened this issue Aug 14, 2024 · 0 comments · Fixed by #148
Assignees

Comments

@mgoin
Copy link
Member

mgoin commented Aug 14, 2024

Some models in Transformers have merged qkv_proj Linear modules, like Phi-3, so our current scheme of adding output activation observers to separate k_proj and v_proj Linear modules will not work.

We should be able to use the Cache and QuantizedCache classes in the HF Transformers library, which has been added to most modeling definitions as the class to manage past_key_values: https://github.com/huggingface/transformers/blob/8820fe8b8c4b9da94cf1e4761876f85c562e0efe/src/transformers/cache_utils.py#L770

This allows us to implement our own QuantizedCache with a quantize and dequantize function, where we can calculate the statistics needed for our kv_cache_quant scheme.

Prototype QuantizedCacheConfig impl #87

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants