OpenCL: Fix duplication of layers in VRAM and RAM, add GPU mul kernel #1653

0cc4m · 2023-05-30T17:12:37Z

Port further improvements from the CUDA version to OpenCL, specifically:

No more duplication of layers, they will now be solely in VRAM if offloaded to a GPU
The norm calculation is now also done on GPU for that reason

…ernel call

github-actions

clang-tidy made some suggestions

ggml-opencl.cpp

github-actions · 2023-05-30T17:16:00Z

ggml-opencl.h

@@ -16,6 +17,7 @@ void * ggml_cl_host_malloc(size_t size);
 void   ggml_cl_host_free(void * ptr);

 void ggml_cl_transform_tensor(struct ggml_tensor * tensor);
+void ggml_cl_load_data(const char * fname, struct ggml_tensor * tensor, const size_t offset);


warning: parameter 'offset' is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions [readability-avoid-const-params-in-decls]

Suggested change

void ggml_cl_load_data(const char * fname, struct ggml_tensor * tensor, const size_t offset);

void ggml_cl_load_data(const char * fname, struct ggml_tensor * tensor, size_t offset);

the comment is valid

ggerganov

Haven't tested, the ggml part is OK

LostRuins · 2023-05-31T02:34:51Z

Hi, I can confirm this works when used directly on @0cc4m 's branch.

However, it fails when merged into master due to malloc changes from @howard0su done in #1612 resulting in the error ggml_opencl: clSetKernelArg(*to_fp32_cl, 0, sizeof(cl_mem), &d_Q) error -38 at ggml-opencl.cpp:1029. Reverting the malloc changes solves this.

Performance wise, there is still a minor (<5%) speed regression, but the benefit of saving of saving a large chunk of RAM may be worth the tradeoff. (Apparently CUDA's implementation has similar regressions too.)

github-actions

clang-tidy made some suggestions

ggml-opencl.cpp

0cc4m · 2023-05-31T05:30:12Z

I didn't notice the changes in #1612, should be fixed now.

SlyEcho · 2023-05-31T09:24:09Z

Tested working

LostRuins · 2023-05-31T10:03:31Z

It's working now. @SlyEcho do you observe and speed changes?

SlyEcho · 2023-05-31T10:10:53Z

I do not see a performance regression.

However I can get a massive boost by changing CL_DMMV_BLOCK_SIZE to 64.

Generation goes from 75ms/t to 58ms/t

Thisi is with both versions.

LostRuins · 2023-05-31T13:30:21Z

Hm. Changing the CL_DMMV_BLOCK_SIZE has no effect for me - but there's no slowdown either so I see no harm in changing it.

SlyEcho · 2023-05-31T14:05:42Z

I think it depends on the GPU. I can even use 128, but 256 breaks it.

0cc4m · 2023-05-31T16:02:34Z

@SlyEcho Did you use a GCN card? I see a very small reduction in speed with 64 on RDNA2, probably because of this:

GCN in all of its iterations was 64 threads wide, meaning 64 threads were bundled together into a single wavefront for execution. RDNA drops this to a native 32 threads wide.

howard0su · 2023-06-01T00:18:46Z

llama.cpp

+            fprintf(stderr, "%s: [opencl] offloading output layer to GPU\n", __func__);
+        }
+        fprintf(stderr, "%s: [opencl] total VRAM used: %zu MB\n", __func__, vram_total / 1024 / 1024);
+#else


Shall we merge two branch of #if? they are looks same except the prefix which didn't give that much information. we can add one line ahead says we are using CUDA or OpenCL to do offloading.

Probably a good idea, yeah.

howard0su · 2023-06-01T14:49:05Z

ggml-opencl.cpp

@@ -862,42 +985,46 @@ static void ggml_cl_mul_mat_q_f32(const ggml_tensor * src0, const ggml_tensor *

    for (int64_t i03 = 0; i03 < ne03; i03++) {
        for (int64_t i02 = 0; i02 < ne02; i02++) {
-            cl_event ev_sgemm;
+            size_t ev_idx = 0;
+            std::vector<cl_event> events;


suggest move this out of loop and give a reasonable initial size. You can use clear in the inner-loop

0cc4m · 2023-06-02T15:46:27Z

I've improved it according to the reviews, anything else that should be fixed? Otherwise I think it's good enough now.

LostRuins · 2023-06-04T03:55:07Z

Bump, I think this should be merged, works well for me.

0cc4m added 3 commits May 27, 2023 10:03

Use events instead of clFinish, where possible

ebc5d06

OpenCL: Don't load gpu layers into RAM, add mul_f32 kernel

97c5cca

Reduce queueing overhead for contiguous tensors by using single mul k…

ac6b49e

…ernel call

github-actions bot reviewed May 30, 2023

View reviewed changes

ggerganov approved these changes May 30, 2023

View reviewed changes

0cc4m added 2 commits May 31, 2023 06:58

Merge remote-tracking branch 'origin/master' into opencl-dev

49aaf08

Adapt to ggerganov#1612 cl_mem malloc changes

5e1eecf

github-actions bot reviewed May 31, 2023

View reviewed changes

ggml-opencl.cpp Show resolved Hide resolved

SlyEcho mentioned this pull request May 31, 2023

MinGW: clblast "variable length arrays are not supported", unicode issues in common.c #1581

Closed

howard0su reviewed Jun 1, 2023

View reviewed changes

Reduce code duplication between cuda and opencl branches

457aaf5

howard0su reviewed Jun 1, 2023

View reviewed changes

Improve implementation

24239f0

LostRuins mentioned this pull request Jun 2, 2023

Clblast fixes + enhancements to save VRAM and offload more layers #1675

Merged

howard0su approved these changes Jun 3, 2023

View reviewed changes

SlyEcho approved these changes Jun 3, 2023

View reviewed changes

0cc4m merged commit dcb2ed4 into ggerganov:master Jun 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenCL: Fix duplication of layers in VRAM and RAM, add GPU mul kernel #1653

OpenCL: Fix duplication of layers in VRAM and RAM, add GPU mul kernel #1653

0cc4m commented May 30, 2023

github-actions bot left a comment

github-actions bot May 30, 2023

howard0su Jun 1, 2023

ggerganov left a comment

LostRuins commented May 31, 2023 •

edited

Loading

github-actions bot left a comment

0cc4m commented May 31, 2023

SlyEcho commented May 31, 2023

LostRuins commented May 31, 2023

SlyEcho commented May 31, 2023

LostRuins commented May 31, 2023

SlyEcho commented May 31, 2023

0cc4m commented May 31, 2023 •

edited

Loading

howard0su Jun 1, 2023 •

edited

Loading

0cc4m Jun 1, 2023

howard0su Jun 1, 2023

0cc4m commented Jun 2, 2023

LostRuins commented Jun 4, 2023

	void ggml_cl_load_data(const char * fname, struct ggml_tensor * tensor, const size_t offset);
	void ggml_cl_load_data(const char * fname, struct ggml_tensor * tensor, size_t offset);

OpenCL: Fix duplication of layers in VRAM and RAM, add GPU mul kernel #1653

OpenCL: Fix duplication of layers in VRAM and RAM, add GPU mul kernel #1653

Conversation

0cc4m commented May 30, 2023

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot May 30, 2023

Choose a reason for hiding this comment

howard0su Jun 1, 2023

Choose a reason for hiding this comment

ggerganov left a comment

Choose a reason for hiding this comment

LostRuins commented May 31, 2023 • edited Loading

github-actions bot left a comment

Choose a reason for hiding this comment

0cc4m commented May 31, 2023

SlyEcho commented May 31, 2023

LostRuins commented May 31, 2023

SlyEcho commented May 31, 2023

LostRuins commented May 31, 2023

SlyEcho commented May 31, 2023

0cc4m commented May 31, 2023 • edited Loading

howard0su Jun 1, 2023 • edited Loading

Choose a reason for hiding this comment

0cc4m Jun 1, 2023

Choose a reason for hiding this comment

howard0su Jun 1, 2023

Choose a reason for hiding this comment

0cc4m commented Jun 2, 2023

LostRuins commented Jun 4, 2023

LostRuins commented May 31, 2023 •

edited

Loading

0cc4m commented May 31, 2023 •

edited

Loading

howard0su Jun 1, 2023 •

edited

Loading