[Enhancement]: Implement optimizations used in CTranslate2 #811

janekb04 · 2023-04-06T14:13:22Z

CTranslate2 is a "competitor" to llama.cpp that advertises itself with:

Fast and efficient execution on CPU and GPU

The execution is significantly faster and requires less resources than general-purpose deep learning frameworks on supported models and tasks thanks to many advanced optimizations: layer fusion, padding removal, batch reordering, in-place operations, caching mechanism, etc.

I am no expert in LLMs and I don't know what these optimizations are, but I am asking: would it be possible/feasible and/or desirable to implement these optimizations into llama.cpp or GGML?

guillaumekln · 2023-04-08T13:33:27Z

(Hi there, I'm the author of CTranslate2.)

llama.cpp already implements similar optimizations. They often come naturally when reimplementing a model in C/C++.

In my experience the most impactful optimization is to integrate vendor specific libraries to run the matrix multiplications, which are usually the bottlenecks for these models. For example Apple Accelerate was a huge win for performance when it was first integrated in whisper.cpp. For x64 processors I recommend oneDNN which has a very good 8-bit GEMM implementation (as fast as Intel MKL).

However, I'm not aware of similar libraries providing efficient 4-bit GEMM at this time, and I also understand that llama.cpp is trying to avoid additional dependencies as much as possible.

jon-chuang · 2023-04-12T01:39:19Z

So we are already fusing and tiling the attention layer to fit in CPU-SRAM ala flash attention?

Edit: I guess it is currently being experimented on: #778

github-actions · 2024-04-11T01:06:57Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

ggerganov added the enhancement New feature or request label Apr 7, 2023

MillionthOdin16 mentioned this issue Apr 11, 2023

Question: Why prompt is being run trough the network before generating new tokens? #719

Closed

github-actions bot added the stale label Mar 25, 2024

github-actions bot closed this as completed Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement]: Implement optimizations used in CTranslate2 #811

[Enhancement]: Implement optimizations used in CTranslate2 #811

janekb04 commented Apr 6, 2023

Fast and efficient execution on CPU and GPU

guillaumekln commented Apr 8, 2023 •

edited

Loading

jon-chuang commented Apr 12, 2023 •

edited

Loading

github-actions bot commented Apr 11, 2024

[Enhancement]: Implement optimizations used in CTranslate2 #811

[Enhancement]: Implement optimizations used in CTranslate2 #811

Comments

janekb04 commented Apr 6, 2023

Fast and efficient execution on CPU and GPU

guillaumekln commented Apr 8, 2023 • edited Loading

jon-chuang commented Apr 12, 2023 • edited Loading

github-actions bot commented Apr 11, 2024

guillaumekln commented Apr 8, 2023 •

edited

Loading

jon-chuang commented Apr 12, 2023 •

edited

Loading