-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: use MMQ instead of cuBLAS by default #8075
CUDA: use MMQ instead of cuBLAS by default #8075
Conversation
Nice work making MMQ so fast! Are IQ-Quants supported by the recent speedups? If not, perhaps it's possible to still use cublas for these by default, as many people like to use iq-quants. |
Only legacy quants and K-quants have a MMQ implementation at all. For all other data formats cuBLAS is the only option available and there is no change. |
Would it be possible to have a command line argument to chose mmq, or cublas, as long as the corresponding architectures are compiled? It'd be great for simplicity of choice, and also for downstream implementations like in KoboldCPP. |
In what cases would you want to use cuBLAS? Command line options have to go through llama.cpp, which requires changes to the llama.cpp API, and then they have to be passed to the backend, which requires adding more exceptions for some backends. They should not be added unless there is a very good reason to do so. |
It could maybe be done via environment variables instead which would require no changes to the CLI. But with the current structure where the choice is made at compile time you can skip some kernel variants that you know will never be used so there would be an increase in compilation time and binary size if you were to make it dynamic. |
@slaren : in case MMQ doesn't work or performs badly for some reason, Cublas other might, that's my simple "user based" thinking. If everything is always optimal by default as long as the proper architectures are compiled, then my request is irrelevant, but is it always the case? This being said, I understand well enough your argument and its precedence. @JohannesGaessler That would be great, especially if much simpler to implement and maintain. Compiling time or binary size doesn't bother me, as long as the resulting binaries offer a maximum amount of flexibility to the final users with an existing but even more modest tech litteracy than my own. |
An environment variable would be much less intrusive, but I don't think it is a good idea to add more environment variables as a preventive measure. |
cc02976
to
5479853
Compare
5479853
to
61f3cb6
Compare
This reverts commit a818f30.
This PR makes it so that by default mul_mat_q instead of FP16 cuBLAS GEMM is used unless
__dp4a
instruction is unavailable (P100 or older).Performance comparisons can be found in #8062 . To make the new kernels actually available I added compute capability 7.5 to CMake. I added a new compilation option
LLAMA_CUDA_FORCE_CUBLAS
with which cuBLAS is always used. I moved code fromcommon.cuh
to more specialized headers (which is unproblematic becauseggml-cuda.cu
includes them all). I refactored the logic ofggml_cuda_mul_mat
and moved the MMQ selection logic to a functionggml_cuda_should_use_mmq
.