Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

weight only int4 is slower than cutlass int4 #362

Open
zhoutianzi666 opened this issue Mar 19, 2024 · 1 comment
Open

weight only int4 is slower than cutlass int4 #362

zhoutianzi666 opened this issue Mar 19, 2024 · 1 comment

Comments

@zhoutianzi666
Copy link

zhoutianzi666 commented Mar 19, 2024

https://github.com/ModelTC/lightllm/blob/main/lightllm/common/basemodel/triton_kernel/dequantize_gemm_int4.py

The algorithm in the above file implements weight only int4, but its speed is only 50% of cutpass int4. How can this be resolved?

@hiworldwzj
Copy link
Collaborator

You can compile your own operator interface implemented with pybind, and accelerate the inference by modifying the source code to replace the implementation used during inference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants