Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA: use MMQ instead of cuBLAS by default #8075

Merged

Conversation

JohannesGaessler
Copy link
Collaborator

This PR makes it so that by default mul_mat_q instead of FP16 cuBLAS GEMM is used unless

  • there is no int8 tensor core MMQ implementation available but FP16 tensor cores are (V100, RDNA3) or
  • the __dp4a instruction is unavailable (P100 or older).

Performance comparisons can be found in #8062 . To make the new kernels actually available I added compute capability 7.5 to CMake. I added a new compilation option LLAMA_CUDA_FORCE_CUBLAS with which cuBLAS is always used. I moved code from common.cuh to more specialized headers (which is unproblematic because ggml-cuda.cu includes them all). I refactored the logic of ggml_cuda_mul_mat and moved the MMQ selection logic to a function ggml_cuda_should_use_mmq.

@github-actions github-actions bot added build Compilation issues Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jun 23, 2024
@Dampfinchen
Copy link

Nice work making MMQ so fast!

Are IQ-Quants supported by the recent speedups? If not, perhaps it's possible to still use cublas for these by default, as many people like to use iq-quants.

@JohannesGaessler
Copy link
Collaborator Author

Only legacy quants and K-quants have a MMQ implementation at all. For all other data formats cuBLAS is the only option available and there is no change.

@Nexesenex
Copy link
Contributor

Nexesenex commented Jun 23, 2024

Would it be possible to have a command line argument to chose mmq, or cublas, as long as the corresponding architectures are compiled? It'd be great for simplicity of choice, and also for downstream implementations like in KoboldCPP.
Also, to mention in the CMakeList what arch is compatible / faster for each NVIDIA chips generation since Kepler/Maxwell.
And, to make it clear for the profane, does mmvq automatically triggers if mmq mode is on and the blas batch size =< 8?

@slaren
Copy link
Collaborator

slaren commented Jun 23, 2024

In what cases would you want to use cuBLAS? Command line options have to go through llama.cpp, which requires changes to the llama.cpp API, and then they have to be passed to the backend, which requires adding more exceptions for some backends. They should not be added unless there is a very good reason to do so.

@JohannesGaessler
Copy link
Collaborator Author

It could maybe be done via environment variables instead which would require no changes to the CLI. But with the current structure where the choice is made at compile time you can skip some kernel variants that you know will never be used so there would be an increase in compilation time and binary size if you were to make it dynamic.

@Nexesenex
Copy link
Contributor

@slaren : in case MMQ doesn't work or performs badly for some reason, Cublas other might, that's my simple "user based" thinking. If everything is always optimal by default as long as the proper architectures are compiled, then my request is irrelevant, but is it always the case?

This being said, I understand well enough your argument and its precedence.

@JohannesGaessler That would be great, especially if much simpler to implement and maintain. Compiling time or binary size doesn't bother me, as long as the resulting binaries offer a maximum amount of flexibility to the final users with an existing but even more modest tech litteracy than my own.

@slaren
Copy link
Collaborator

slaren commented Jun 23, 2024

An environment variable would be much less intrusive, but I don't think it is a good idea to add more environment variables as a preventive measure.

ggml-cuda.cu Outdated Show resolved Hide resolved
@mofosyne mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Jun 24, 2024
@JohannesGaessler JohannesGaessler mentioned this pull request Jun 24, 2024
4 tasks
@JohannesGaessler JohannesGaessler merged commit a818f30 into ggerganov:master Jun 24, 2024
63 checks passed
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Jun 26, 2024
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jun 30, 2024
MagnusS0 pushed a commit to MagnusS0/llama.cpp-normistral-tokenizer that referenced this pull request Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Compilation issues ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants