Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenCL: Fix duplication of layers in VRAM and RAM, add GPU mul kernel #1653

Merged
merged 7 commits into from
Jun 4, 2023

Conversation

0cc4m
Copy link
Collaborator

@0cc4m 0cc4m commented May 30, 2023

Port further improvements from the CUDA version to OpenCL, specifically:

  • No more duplication of layers, they will now be solely in VRAM if offloaded to a GPU
  • The norm calculation is now also done on GPU for that reason

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

ggml-opencl.cpp Show resolved Hide resolved
ggml-opencl.cpp Show resolved Hide resolved
ggml-opencl.h Outdated
@@ -16,6 +17,7 @@ void * ggml_cl_host_malloc(size_t size);
void ggml_cl_host_free(void * ptr);

void ggml_cl_transform_tensor(struct ggml_tensor * tensor);
void ggml_cl_load_data(const char * fname, struct ggml_tensor * tensor, const size_t offset);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: parameter 'offset' is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions [readability-avoid-const-params-in-decls]

Suggested change
void ggml_cl_load_data(const char * fname, struct ggml_tensor * tensor, const size_t offset);
void ggml_cl_load_data(const char * fname, struct ggml_tensor * tensor, size_t offset);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comment is valid

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't tested, the ggml part is OK

@LostRuins
Copy link
Collaborator

LostRuins commented May 31, 2023

Hi, I can confirm this works when used directly on @0cc4m 's branch.

However, it fails when merged into master due to malloc changes from @howard0su done in #1612 resulting in the error ggml_opencl: clSetKernelArg(*to_fp32_cl, 0, sizeof(cl_mem), &d_Q) error -38 at ggml-opencl.cpp:1029. Reverting the malloc changes solves this.

Performance wise, there is still a minor (<5%) speed regression, but the benefit of saving of saving a large chunk of RAM may be worth the tradeoff. (Apparently CUDA's implementation has similar regressions too.)

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

ggml-opencl.cpp Show resolved Hide resolved
@0cc4m
Copy link
Collaborator Author

0cc4m commented May 31, 2023

I didn't notice the changes in #1612, should be fixed now.

@SlyEcho
Copy link
Collaborator

SlyEcho commented May 31, 2023

Tested working

@LostRuins
Copy link
Collaborator

It's working now. @SlyEcho do you observe and speed changes?

@SlyEcho
Copy link
Collaborator

SlyEcho commented May 31, 2023

I do not see a performance regression.

However I can get a massive boost by changing CL_DMMV_BLOCK_SIZE to 64.

Generation goes from 75ms/t to 58ms/t

Thisi is with both versions.

@LostRuins
Copy link
Collaborator

Hm. Changing the CL_DMMV_BLOCK_SIZE has no effect for me - but there's no slowdown either so I see no harm in changing it.

@SlyEcho
Copy link
Collaborator

SlyEcho commented May 31, 2023

I think it depends on the GPU. I can even use 128, but 256 breaks it.

@0cc4m
Copy link
Collaborator Author

0cc4m commented May 31, 2023

@SlyEcho Did you use a GCN card? I see a very small reduction in speed with 64 on RDNA2, probably because of this:

GCN in all of its iterations was 64 threads wide, meaning 64 threads were bundled together into a single wavefront for execution. RDNA drops this to a native 32 threads wide.

fprintf(stderr, "%s: [opencl] offloading output layer to GPU\n", __func__);
}
fprintf(stderr, "%s: [opencl] total VRAM used: %zu MB\n", __func__, vram_total / 1024 / 1024);
#else
Copy link
Collaborator

@howard0su howard0su Jun 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we merge two branch of #if? they are looks same except the prefix which didn't give that much information. we can add one line ahead says we are using CUDA or OpenCL to do offloading.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a good idea, yeah.

ggml-opencl.cpp Outdated
@@ -862,42 +985,46 @@ static void ggml_cl_mul_mat_q_f32(const ggml_tensor * src0, const ggml_tensor *

for (int64_t i03 = 0; i03 < ne03; i03++) {
for (int64_t i02 = 0; i02 < ne02; i02++) {
cl_event ev_sgemm;
size_t ev_idx = 0;
std::vector<cl_event> events;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest move this out of loop and give a reasonable initial size. You can use clear in the inner-loop

@0cc4m
Copy link
Collaborator Author

0cc4m commented Jun 2, 2023

I've improved it according to the reviews, anything else that should be fixed? Otherwise I think it's good enough now.

@LostRuins
Copy link
Collaborator

Bump, I think this should be merged, works well for me.

@0cc4m 0cc4m merged commit dcb2ed4 into ggerganov:master Jun 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants