-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenCL: Fix duplication of layers in VRAM and RAM, add GPU mul kernel #1653
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
ggml-opencl.h
Outdated
@@ -16,6 +17,7 @@ void * ggml_cl_host_malloc(size_t size); | |||
void ggml_cl_host_free(void * ptr); | |||
|
|||
void ggml_cl_transform_tensor(struct ggml_tensor * tensor); | |||
void ggml_cl_load_data(const char * fname, struct ggml_tensor * tensor, const size_t offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: parameter 'offset' is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions [readability-avoid-const-params-in-decls]
void ggml_cl_load_data(const char * fname, struct ggml_tensor * tensor, const size_t offset); | |
void ggml_cl_load_data(const char * fname, struct ggml_tensor * tensor, size_t offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the comment is valid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't tested, the ggml
part is OK
Hi, I can confirm this works when used directly on @0cc4m 's branch. However, it fails when merged into master due to malloc changes from @howard0su done in #1612 resulting in the error Performance wise, there is still a minor (<5%) speed regression, but the benefit of saving of saving a large chunk of RAM may be worth the tradeoff. (Apparently CUDA's implementation has similar regressions too.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
I didn't notice the changes in #1612, should be fixed now. |
Tested working |
It's working now. @SlyEcho do you observe and speed changes? |
I do not see a performance regression. However I can get a massive boost by changing Generation goes from 75ms/t to 58ms/t Thisi is with both versions. |
Hm. Changing the |
I think it depends on the GPU. I can even use 128, but 256 breaks it. |
@SlyEcho Did you use a GCN card? I see a very small reduction in speed with 64 on RDNA2, probably because of this:
|
fprintf(stderr, "%s: [opencl] offloading output layer to GPU\n", __func__); | ||
} | ||
fprintf(stderr, "%s: [opencl] total VRAM used: %zu MB\n", __func__, vram_total / 1024 / 1024); | ||
#else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we merge two branch of #if? they are looks same except the prefix which didn't give that much information. we can add one line ahead says we are using CUDA or OpenCL to do offloading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably a good idea, yeah.
ggml-opencl.cpp
Outdated
@@ -862,42 +985,46 @@ static void ggml_cl_mul_mat_q_f32(const ggml_tensor * src0, const ggml_tensor * | |||
|
|||
for (int64_t i03 = 0; i03 < ne03; i03++) { | |||
for (int64_t i02 = 0; i02 < ne02; i02++) { | |||
cl_event ev_sgemm; | |||
size_t ev_idx = 0; | |||
std::vector<cl_event> events; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest move this out of loop and give a reasonable initial size. You can use clear in the inner-loop
I've improved it according to the reviews, anything else that should be fixed? Otherwise I think it's good enough now. |
Bump, I think this should be merged, works well for me. |
Port further improvements from the CUDA version to OpenCL, specifically: