CUDA: use MMQ instead of cuBLAS by default #8075

JohannesGaessler · 2024-06-23T10:47:48Z

This PR makes it so that by default mul_mat_q instead of FP16 cuBLAS GEMM is used unless

there is no int8 tensor core MMQ implementation available but FP16 tensor cores are (V100, RDNA3) or
the __dp4a instruction is unavailable (P100 or older).

Performance comparisons can be found in #8062 . To make the new kernels actually available I added compute capability 7.5 to CMake. I added a new compilation option LLAMA_CUDA_FORCE_CUBLAS with which cuBLAS is always used. I moved code from common.cuh to more specialized headers (which is unproblematic because ggml-cuda.cu includes them all). I refactored the logic of ggml_cuda_mul_mat and moved the MMQ selection logic to a function ggml_cuda_should_use_mmq.

Dampfinchen · 2024-06-23T13:33:25Z

Nice work making MMQ so fast!

Are IQ-Quants supported by the recent speedups? If not, perhaps it's possible to still use cublas for these by default, as many people like to use iq-quants.

JohannesGaessler · 2024-06-23T13:37:52Z

Only legacy quants and K-quants have a MMQ implementation at all. For all other data formats cuBLAS is the only option available and there is no change.

Nexesenex · 2024-06-23T14:55:50Z

Would it be possible to have a command line argument to chose mmq, or cublas, as long as the corresponding architectures are compiled? It'd be great for simplicity of choice, and also for downstream implementations like in KoboldCPP.
Also, to mention in the CMakeList what arch is compatible / faster for each NVIDIA chips generation since Kepler/Maxwell.
And, to make it clear for the profane, does mmvq automatically triggers if mmq mode is on and the blas batch size =< 8?

slaren · 2024-06-23T16:33:36Z

In what cases would you want to use cuBLAS? Command line options have to go through llama.cpp, which requires changes to the llama.cpp API, and then they have to be passed to the backend, which requires adding more exceptions for some backends. They should not be added unless there is a very good reason to do so.

JohannesGaessler · 2024-06-23T16:36:49Z

It could maybe be done via environment variables instead which would require no changes to the CLI. But with the current structure where the choice is made at compile time you can skip some kernel variants that you know will never be used so there would be an increase in compilation time and binary size if you were to make it dynamic.

Nexesenex · 2024-06-23T18:35:57Z

@slaren : in case MMQ doesn't work or performs badly for some reason, Cublas other might, that's my simple "user based" thinking. If everything is always optimal by default as long as the proper architectures are compiled, then my request is irrelevant, but is it always the case?

This being said, I understand well enough your argument and its precedence.

@JohannesGaessler That would be great, especially if much simpler to implement and maintain. Compiling time or binary size doesn't bother me, as long as the resulting binaries offer a maximum amount of flexibility to the final users with an existing but even more modest tech litteracy than my own.

slaren · 2024-06-23T18:44:46Z

An environment variable would be much less intrusive, but I don't think it is a good idea to add more environment variables as a preventive measure.

ggml-cuda.cu

This reverts commit a818f30.

github-actions bot added build Compilation issues Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jun 23, 2024

slaren reviewed Jun 24, 2024

View reviewed changes

ggml-cuda.cu Outdated Show resolved Hide resolved

JohannesGaessler force-pushed the cuda-mmq-default branch from cc02976 to 5479853 Compare June 24, 2024 06:51

mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Jun 24, 2024

JohannesGaessler mentioned this pull request Jun 24, 2024

gfx908 optimizations #8082

Closed

4 tasks

slaren approved these changes Jun 24, 2024

View reviewed changes

CUDA: use MMQ instead of cuBLAS by default

61f3cb6

JohannesGaessler force-pushed the cuda-mmq-default branch from 5479853 to 61f3cb6 Compare June 24, 2024 12:35

JohannesGaessler merged commit a818f30 into ggerganov:master Jun 24, 2024
63 checks passed

ggerganov mentioned this pull request Jun 24, 2024

llama : reorganize source code + improve CMake #8006

Merged

27 tasks

JohannesGaessler mentioned this pull request Jun 24, 2024

CUDA: fix matrix multiplication algorithm choice #8102

Merged

isaac-mcfadyen mentioned this pull request Jun 25, 2024

Clarify default MMQ for CUDA and LLAMA_CUDA_FORCE_MMQ flag #8115

Merged

4 tasks

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Jun 26, 2024

Revert "CUDA: use MMQ instead of cuBLAS by default (ggerganov#8075)"

1d78061

This reverts commit a818f30.

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jun 30, 2024

CUDA: use MMQ instead of cuBLAS by default (ggerganov#8075)

17f64b8

MagnusS0 pushed a commit to MagnusS0/llama.cpp-normistral-tokenizer that referenced this pull request Jul 1, 2024

CUDA: use MMQ instead of cuBLAS by default (ggerganov#8075)

f3dea67

isaac-mcfadyen mentioned this pull request Jul 7, 2024

Feature Request: add indicator for which of MMQ vs cuBLAS is used #8350

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: use MMQ instead of cuBLAS by default #8075

CUDA: use MMQ instead of cuBLAS by default #8075

JohannesGaessler commented Jun 23, 2024

Dampfinchen commented Jun 23, 2024

JohannesGaessler commented Jun 23, 2024

Nexesenex commented Jun 23, 2024 •

edited

Loading

slaren commented Jun 23, 2024

JohannesGaessler commented Jun 23, 2024

Nexesenex commented Jun 23, 2024

slaren commented Jun 23, 2024

CUDA: use MMQ instead of cuBLAS by default #8075

CUDA: use MMQ instead of cuBLAS by default #8075

Conversation

JohannesGaessler commented Jun 23, 2024

Dampfinchen commented Jun 23, 2024

JohannesGaessler commented Jun 23, 2024

Nexesenex commented Jun 23, 2024 • edited Loading

slaren commented Jun 23, 2024

JohannesGaessler commented Jun 23, 2024

Nexesenex commented Jun 23, 2024

slaren commented Jun 23, 2024

Nexesenex commented Jun 23, 2024 •

edited

Loading