Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : reuse quantum structs across backends #5943

Merged
merged 5 commits into from
Mar 12, 2024
Merged

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Mar 8, 2024

Trying to see if we can have the quantum structs declared in a single place (e.g. ggml-common.h)

  • The #define GGML_COMMON_AGGR data is needed because nvcc does not allow members with constructors (half) in anonymous struct
  • ggml-common.h is structured in such a way to allow the following usage:
// some-file.h
#define GGML_COMMON_DECL_C
#include "ggml-common.h"
// some-file.c
#include "some-file.h"

#define GGML_COMMON_IMPL_C
#include "ggml-common.h"
  • On the CPU, block_q8_1 now uses fp16 instead of fp32 for d and s. This was already the case for CUDA and to make things similar, I decided to apply the same format on the CPU. This type is used only to quantize activations, so it is not breaking any existing models

TODO:

  • CPU
  • CUDA
  • Metal
  • SYCL
  • Vulkan (might not be applicable here)
  • Kompute (might not be applicable here)

@ggerganov ggerganov added the demo Demonstrate some concept or idea, not intended to be merged label Mar 8, 2024
@ggerganov ggerganov force-pushed the gg/ggml-common-decl branch 3 times, most recently from 40b0a47 to 7a8c050 Compare March 9, 2024 08:35
Base automatically changed from gg/ggml-common to master March 9, 2024 10:47
@ggerganov
Copy link
Owner Author

I wonder if this change is worth it. @slaren what do you think?

We reduce code duplication, but maybe the macros are bit too much. Any better way to do this?

@slaren
Copy link
Collaborator

slaren commented Mar 9, 2024

I am not sure if this would be better, but it is also possible to access the data as a half2 with a reinterpret cast, ie. *(half2*)&x.d. That wouldn't require changes to the structs for CUDA.

@ggerganov ggerganov force-pushed the gg/ggml-common-decl branch 3 times, most recently from c5fa4ed to f33ab14 Compare March 10, 2024 20:37
@ggerganov ggerganov added need feedback Testing and feedback with results are needed and removed demo Demonstrate some concept or idea, not intended to be merged labels Mar 10, 2024
@ggerganov
Copy link
Owner Author

I am not sure if this would be better, but it is also possible to access the data as a half2 with a reinterpret cast, ie. *(half2*)&x.d. That wouldn't require changes to the structs for CUDA.

Yup, that's an option, but I'm also not sure it would be better

I'm 50/50 about merging this. We can potentially adopt this pattern and use it for other stuff that can be shared by introducing more GGML_COMMON_XXX guards. But on the other hand it is a little ugly. So I don't know

@slaren
Copy link
Collaborator

slaren commented Mar 11, 2024

Avoiding code duplication is definitely good. We can always iterate over it in the future if we figure a cleaner way to do this.

@ggerganov ggerganov marked this pull request as ready for review March 11, 2024 13:06

#define QK4_0 32
#define QI4_0 (QK4_0 / (4 * QR4_0))
#define QR4_0 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these CUDA-specific macros (QIX, QRX) shouldn't be propagated to the other backends. If it so happens that another back-end uses these, it would be better to just duplicate there.

@ggerganov
Copy link
Owner Author

Ok, will merge this after the CI is green

@ggerganov
Copy link
Owner Author

These thread sanitizer failures seem unrelated to our changes:

https://github.com/ggerganov/llama.cpp/actions/runs/8246447769/job/22552498578?pr=5943#step:5:27

Maybe related to this: google/sanitizers#1716?
Though the runner seems to be running kernel 6.5.0 so not sure

@ggerganov ggerganov merged commit 8030da7 into master Mar 12, 2024
68 of 69 checks passed
@ggerganov ggerganov deleted the gg/ggml-common-decl branch March 12, 2024 12:27
NeoZhangJianyu pushed a commit to NeoZhangJianyu/llama.cpp that referenced this pull request Mar 12, 2024
* ggml : reuse quant blocks across backends

ggml-ci

* ggml : define helper constants only for CUDA and SYCL

ggml-ci

* ggml : define helper quantum constants for SYCL

ggml-ci
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Mar 13, 2024
* ggml : reuse quant blocks across backends

ggml-ci

* ggml : define helper constants only for CUDA and SYCL

ggml-ci

* ggml : define helper quantum constants for SYCL

ggml-ci
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* ggml : reuse quant blocks across backends

ggml-ci

* ggml : define helper constants only for CUDA and SYCL

ggml-ci

* ggml : define helper quantum constants for SYCL

ggml-ci
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need feedback Testing and feedback with results are needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants