CLBlast: q5_0, q5_1, q8_0 dequant kernels #1225

0cc4m · 2023-04-29T08:59:29Z

I had or still have an issue with q5_0 that I can't figure out. On Nvidia trying to transfer the quantized weights to the device leads to a CL_OUT_OF_RESOURCES error. On AMD and on POCL it leads to a segfault. It seems to have a problem with 22 byte structs, while 20 or 24 bytes are alright. I am not sure why this is the case.

As a workaround I copy the weights into a new struct and do the FP16 to FP32 conversion on CPU. This seems to have little overhead and works, but it should not be needed. If anyone knows what's up here please let me know.

I also moved the .cl file into the opencl.c as requested.

LostRuins

lgtm

ggerganov · 2023-04-30T07:04:41Z

For Q5_0 you have to copy the qh bytes into a uint32_t instead of casting a pointer as we usually do:

llama.cpp/ggml-cuda.cu

Lines 134 to 137 in c3ca7a5

    
           uint32_t qh; 
        
           memcpy(&qh, x[i].qh, sizeof(qh));

0cc4m · 2023-04-30T07:43:23Z

@ggerganov I did, that happens during the transfer to the GPU.

https://github.com/0cc4m/koboldcpp/blob/369d903edabd5c0acd866337542cbb4150485940/ggml-opencl.c#L74-L79

It works just fine for Q5_1. Q5_0 has a different problem that I can only suspect is related to memory alignment.

ggerganov · 2023-04-30T07:55:31Z

ggml-opencl.c

+        cl_host_b = (cl_block_q5_0*) malloc(sizeof(cl_block_q5_0) * global / 32);
+        for (size_t i = 0; i < global / 32; i++) {
+            cl_host_b[i].d = ggml_fp16_to_fp32(b[i].d);
+            memcpy(&cl_host_b[i].qh, b[i].qh, sizeof(uint32_t) + QK5_0 / 2);


Use 2x memcpy - one for qh and one for qs

LostRuins · 2023-04-30T08:25:17Z

Btw might want to add a #include <stdlib.h> as it appears to prevent compilation on Termux without it.

SlyEcho · 2023-05-13T01:07:59Z

It was an alignment / padding issue. the half and uint were together 8 bytes, not 6 as expeted which meant the whole struct was too big and the memory was accessed out of bounds.

In #1422 I added an __attribute__((packed)) to the CL struct. It's not an extension or anything, it's part of the OpenCL language.

0cc4m added 4 commits April 29, 2023 10:58

Implement q5_0, q5_1 and q8_0

9439da6

Work around q5_0 OpenCL issue

1560c10

Fix q8_0 dequant kernel

d6be497

Move cl kernels into ggml-opencl.c

369d903

LostRuins approved these changes Apr 30, 2023

View reviewed changes

ggerganov reviewed Apr 30, 2023

View reviewed changes

Use two memcpy calls for q5_0 buffer transfer

e69c924

ggerganov approved these changes Apr 30, 2023

View reviewed changes

ggerganov merged commit 76a8849 into ggerganov:master Apr 30, 2023

0cc4m deleted the clblast-further-dequant-kernels branch April 30, 2023 20:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLBlast: q5_0, q5_1, q8_0 dequant kernels #1225

CLBlast: q5_0, q5_1, q8_0 dequant kernels #1225

0cc4m commented Apr 29, 2023

LostRuins left a comment

ggerganov commented Apr 30, 2023

0cc4m commented Apr 30, 2023

ggerganov Apr 30, 2023

0cc4m Apr 30, 2023

LostRuins commented Apr 30, 2023

SlyEcho commented May 13, 2023

CLBlast: q5_0, q5_1, q8_0 dequant kernels #1225

CLBlast: q5_0, q5_1, q8_0 dequant kernels #1225

Conversation

0cc4m commented Apr 29, 2023

LostRuins left a comment

Choose a reason for hiding this comment

ggerganov commented Apr 30, 2023

0cc4m commented Apr 30, 2023

ggerganov Apr 30, 2023

Choose a reason for hiding this comment

0cc4m Apr 30, 2023

Choose a reason for hiding this comment

LostRuins commented Apr 30, 2023

SlyEcho commented May 13, 2023