Incoming backends: Vulkan, Kompute, SYCL #5138
Replies: 7 comments 17 replies
-
I've mentioned it elsewhere, and this is only tangentially related, but currently GPU offload with OpenCL backend is pretty broken right now, ever since the backend rework. Model architectures for Phi, Mixtral and Falcon all no longer working as Slaren explained here: #2059 (comment) I'm not sure what operations each of these new backends will support, but I would just like to +1 Slaren's suggestion where "weights not supported by a backend are kept on the CPU instead", which would hopefully allow graceful performance degradation versus just segfaulting. |
Beta Was this translation helpful? Give feedback.
-
It would be interesting to see how does the Vulkan backend work on Android. I wish the Vulkan API gains support for NPUs in modern hardware chipsets, if it doesn't. (ahem Samsung galaxy AI s24).. |
Beta Was this translation helpful? Give feedback.
-
Really nice, I'm curious if any of these vulkan implementations will work with raspberry pi 5. Would be nice to take advantage of its gpu. 🤔 |
Beta Was this translation helpful? Give feedback.
-
can the sycl backend be used with AMD cards? also can the sycl backend let me use cpu and gpu? |
Beta Was this translation helpful? Give feedback.
-
Some benchmarks of Vulkan and Kompute on 6750XT/5800X3D. (Model is SOLAR 10.7B Q4_1.)
|
Beta Was this translation helpful? Give feedback.
-
hmm, not sure if this already implemented or not , but inference ML task in Compute shader should improve performance . |
Beta Was this translation helpful? Give feedback.
-
ref:
There are 3 new backends that are about to be merged into
llama.cpp
. The tentative plan is do this over the weekend. Due to the large amount of code that is about to be merged, I'm creating this discussion for a quick communication channel between the maintainers in case problems arise.The main goal after merging the backends is to make the CI green which would give some level of confidence that the existing stuff has not been broken. Even if the new backends don't function completely as expected, this would be acceptable as the idea is to improve over these with time. However, we want the CPU, CUDA and Metal backends to remain stable.
I'm thinking to do the merges all at once (in a batch) and sync everything back to the
ggml
andwhisper.cpp
repos.If you have any general comments / questions we can discuss here. I will keep high attention to the discussion until we finalize the merges. Will also put it in the readme for awareness. We can discuss code specifics in the respective PRs as usual and keep this discussion focused on high-priority stuff (if needed)
Beta Was this translation helpful? Give feedback.
All reactions