You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
CTranslate2 is a "competitor" to llama.cpp that advertises itself with:
Fast and efficient execution on CPU and GPU
The execution is significantly faster and requires less resources than general-purpose deep learning frameworks on supported models and tasks thanks to many advanced optimizations: layer fusion, padding removal, batch reordering, in-place operations, caching mechanism, etc.
I am no expert in LLMs and I don't know what these optimizations are, but I am asking: would it be possible/feasible and/or desirable to implement these optimizations into llama.cpp or GGML?
The text was updated successfully, but these errors were encountered:
llama.cpp already implements similar optimizations. They often come naturally when reimplementing a model in C/C++.
In my experience the most impactful optimization is to integrate vendor specific libraries to run the matrix multiplications, which are usually the bottlenecks for these models. For example Apple Accelerate was a huge win for performance when it was first integrated in whisper.cpp. For x64 processors I recommend oneDNN which has a very good 8-bit GEMM implementation (as fast as Intel MKL).
However, I'm not aware of similar libraries providing efficient 4-bit GEMM at this time, and I also understand that llama.cpp is trying to avoid additional dependencies as much as possible.
CTranslate2 is a "competitor" to llama.cpp that advertises itself with:
I am no expert in LLMs and I don't know what these optimizations are, but I am asking: would it be possible/feasible and/or desirable to implement these optimizations into llama.cpp or GGML?
The text was updated successfully, but these errors were encountered: