-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLBlast: q5_0, q5_1, q8_0 dequant kernels #1225
CLBlast: q5_0, q5_1, q8_0 dequant kernels #1225
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
For Lines 134 to 137 in c3ca7a5
|
@ggerganov I did, that happens during the transfer to the GPU. It works just fine for Q5_1. Q5_0 has a different problem that I can only suspect is related to memory alignment. |
ggml-opencl.c
Outdated
cl_host_b = (cl_block_q5_0*) malloc(sizeof(cl_block_q5_0) * global / 32); | ||
for (size_t i = 0; i < global / 32; i++) { | ||
cl_host_b[i].d = ggml_fp16_to_fp32(b[i].d); | ||
memcpy(&cl_host_b[i].qh, b[i].qh, sizeof(uint32_t) + QK5_0 / 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use 2x memcpy
- one for qh
and one for qs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Btw might want to add a |
It was an alignment / padding issue. the In #1422 I added an |
I had or still have an issue with q5_0 that I can't figure out. On Nvidia trying to transfer the quantized weights to the device leads to a CL_OUT_OF_RESOURCES error. On AMD and on POCL it leads to a segfault. It seems to have a problem with 22 byte structs, while 20 or 24 bytes are alright. I am not sure why this is the case.
As a workaround I copy the weights into a new struct and do the FP16 to FP32 conversion on CPU. This seems to have little overhead and works, but it should not be needed. If anyone knows what's up here please let me know.
I also moved the .cl file into the opencl.c as requested.