Neural Engine Support #336
Replies: 19 comments 12 replies
-
@ggerganov Congratuations on getting Metal inference working (#1642), and getting ggml funded! Now this project is maturing, are their more plans for Apple Neural Engine (ANE) support? Some resources:
|
Beta Was this translation helpful? Give feedback.
-
In addition, better tensor core support and support for the upcoming Meteor Lake VPUs and Ryzen AI enabled CPUs could be very beneficial. I believe work done on getting the neural engine to run could directly translate to an even better CUDA and new DirectML acceleration. |
Beta Was this translation helpful? Give feedback.
-
Running the cpp code directly on the ANE is not posible. The only solution will be to chop some parts of the network into coreml models and call them inside the cpp code. Maybe the feedforward could be converted to coreml and run in paralalel. AFAIK is not easy to do and will add a lot of complicated logic inside the code. We need to consider if its worth it given the speed up of the ANE. |
Beta Was this translation helpful? Give feedback.
-
If they are not willing to run LLM inference on iPhone, adding Neural Engine support would not be worthwhile. |
Beta Was this translation helpful? Give feedback.
-
Your statement is not right. Ive got Mac M2 and the NPU even though is small, can run conv 2x faster than GPU. Also M2 Max has a different Neural Engine compared with the IPhone. We could do some computations on the ANE in order to reduce the load of the GPU. I think is something interesting to explore, however, the integration and sincronization inside the code is not trivial. |
Beta Was this translation helpful? Give feedback.
-
@Marcelo5444 |
Beta Was this translation helpful? Give feedback.
-
NPU is faster for convolution, but it doesn’t have enough speed for
transformers. RNNs are not supported at all.
…On Mon, 12 Jun 2023 at 6:42 AM, Shouyi Wang ***@***.***> wrote:
@Marcelo5444 <https://github.com/Marcelo5444>
M2 has 10 core GPU, it is possible for a highly optimized NPU code to run
faster than GPU. But that is probably not the case for Pro and Max version
of M2 as they have the same NPU but much powerful GPU.
And do you have the source for your claim that "M2 Max has a different
Neural Engine compared with the iPhone"?
—
Reply to this email directly, view it on GitHub
<#336 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAXGU4D7NLVSULLLO6WSQVTXKZUJVANCNFSM6AAAAAAY5YZW2U>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
As someone who have been using coreml and ANE a lot lately, I can tell that maltmul (a big chunk of transformer FLOPS) can run faster on ane vs GPU in some specific cases. Here there is a good example. Here I attach a coreml model that does 100 matmuls. I get 4x speed up in Mac M2 using ANE (217ms 1316ms) (compared with GPU execution). You can easily run that with xcode This comparison is not 100% fair as llama has custom GPU kernels that have been optimized for GPU, but this shows ANE has a potential. Maybe if someone has time, we could benchmark this matmuls with llama custom kernels. |
Beta Was this translation helpful? Give feedback.
-
@okpatil4u Are supported by RNN, at least the inference. |
Beta Was this translation helpful? Give feedback.
-
This was their official response. Maybe the API has changed. |
Beta Was this translation helpful? Give feedback.
-
Yes, RNN are basically loops with matrix mult, The .mlmodel I attached previously can be seen as a RNN. So ANE supports vanilla RNN |
Beta Was this translation helpful? Give feedback.
-
what about next Intel NPU and AMD XNDA2 that are coming in new processors, from 2024 all consumer pcs will have a powefull NPU capable of 50TOPS as dictated per Windows 12 will require this and will increase year over year. Can this type of NPU acceleration be supported and speed up inference with llama? and how this 50TOPS translate to tokens per second? |
Beta Was this translation helpful? Give feedback.
-
Apple released MLX last week: https://github.com/ml-explore/mlx
It might be useful in Neural Engine support. There is an example to use it to infer Llama: https://github.com/ml-explore/mlx-examples/tree/main/llama |
Beta Was this translation helpful? Give feedback.
-
What about new Snapdragon elite X announced today it has NPU capable of 40TOPS, they said windows 24H2 new local generative AI features will require this processor and NPU, so there must be some way to leverage this NPU for AI inferencing. Apple M3 hast 18 TOPS NPU this snapdragon is more than double. Windows announced DirectML that supports NPU acceleration for machine learning models This must boost the usage of generative AI models locally without using cloud resources. Here they say Copilot AI will require 40TOPS NPU to run locally |
Beta Was this translation helpful? Give feedback.
-
Snapdragon and llama.cpp page |
Beta Was this translation helpful? Give feedback.
-
Lot of cool stuff here. |
Beta Was this translation helpful? Give feedback.
-
Is there any point in supporting such a solution? https://mythic.ai/ |
Beta Was this translation helpful? Give feedback.
-
Apple has a reference implementation for transformers on the ANE. |
Beta Was this translation helpful? Give feedback.
-
I've done some research on what would be required to utilize the Neural Engine on Apple devices as a ggml backend. Here's an example of a matmul operation using CoreML in Swift from the Apple documentation: let v1 = MLTensor([1.0, 2.0, 3.0, 4.0])
let v2 = MLTensor([5.0, 6.0, 7.0, 8.0])
let v3 = v1.matmul(v2)
v3.shape // is []
await v3.shapedArray(of: Float.self) // is 70.0
let m1 = MLTensor(shape: [2, 3], scalars: [
1, 2, 3,
4, 5, 6
], scalarType: Float.self)
let m2 = MLTensor(shape: [3, 2], scalars: [
7, 8,
9, 10,
11, 12
], scalarType: Float.self)
let m3 = t1.matmul(r2)
m3.shape // is [2, 2]
await m3.shapedArray(of: Float.self) // is [[58, 64], [139, 154]]
// Supports broadcasting
let m4 = MLTensor(randomNormal: [3, 1, 1, 4], scalarType: Float.self)
let m5 = MLTensor(randomNormal: [4, 2], scalarType: Float.self)
let m6 = t4.matmul(t5)
m6.shape // is [3, 1, 1, 2] To use the Neural Engine, the tensor operation needs to be wrapped with This is a new API that was not available previously, so using this would mean that the Neural Engine support (using CoreML) will only be supported on recent OS versions. The main benefit of this would be that we would utilize more of the compute available on Apple chips, which will allow performing more parallel operations that are also optimized on the chip level, which should lead to faster inference. I'm not sure I'll have enough time to implement this myself soon, though. |
Beta Was this translation helpful? Give feedback.
-
Would be cool to be able to lean on the neural engine. Even if it wasn't much faster, it'd still be more energy efficient I believe.
Beta Was this translation helpful? Give feedback.
All reactions