-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml : cgraph export/import/eval example + GPU support #108
Conversation
@ggerganov I'm a bit curious/interested in this approach; I like that you are trying to separate ggml and the GPU implementation layer like this. I'd be keen to make a quick attempt at executing the ggml graph output you have here using WebGPU from Zig; but I'm not sure exactly how to piece that output together (or even read it, necessarily) - so I wonder if you'd consider adding a C example or something that executes it on the CPU and validates the results it gets, so I could better understand how it works? |
@slimsag Will try to prioritise this soon and finalize the export format + a CPU and/or Metal example |
Netron supports many formats of exported graphs already. I think GGML could be easily added. |
6264c52
to
eed3eac
Compare
Bit of slow progress here, but I think it is starting to work out |
Ive been waiting for this for months, Nothing has been as easy to use as llama.cpp. |
Ok, I'm finally at the interesting part. I have the
Regarding the memory mapping, it looks like I need to use MTLHeap to map the Everything should go into a single |
Even though that command buffer takes multiple milliseconds, it won't cause a UI hitch. The Apple GPU can execute two separate command buffers concurrently from different |
This is now working as expected and can serve as a proof-of-concept for offloading a Before merging this, I will move the new import / export functions to the core After merging, the next step will be to implement LLaMA inference with the same approach. |
This is the first step towards full GPU and custom hardware inference support (see ggerganov/llama.cpp#915)
The idea is to be able to export the
ggml
computation graphs (ggml_cgraph
) into standalone.ggml
files.These files can be later imported by a separate application and evaluated based on the available hardware / framework (CUDA, Metal, WebGPU, etc.). The computation graph contains everything necessary to perform the inference:
As an example, we export the MNIST computation graph from the mnist example into the file
mnist.ggml
:Next, using the
mnist-cpu
tool, we load the graph and re-evaluate it on the CPU usingggml_graph_compute()
:Or we can run it on the Apple Silicon GPU using Metal:
Here is a sample run:
$ dot -Tpng mnist.dot -o mnist.dot.png && open mnist.dot.png
CPU (via ggml)
Metal