forked from KimHenrikOtte/candle
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
draft: i64 support (tested on clip model)
- Loading branch information
1 parent
991a1b0
commit 71fdc57
Showing
8 changed files
with
1,820 additions
and
1,140 deletions.
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
71fdc57
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the code and found a few things:
71fdc57
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback
I defiantly agree that we should switch to enums it would also solve the problem of selecting the input/output size of pipelines and reduce repeated code, the bools were added just to check if I am even capable of fixing it as I am new to everything(including rust).
all the changes in device.rs were only for debugging and getting the project running as I initially was not able to get my GPU recognized.
I also noticed that the clip demo example with 20 images loaded took around 23x the time as my cpu (4s
cpu
-> 94sgpu
) with the default(your)device.rs
withwgpu::Features::SHADER_INT64;
added with my cpu pinned at 100% and very little GPU utilization but that could also be caused by me using anIntel(R) HD Graphics 4400
Haswell GPU that doesn't have complete Vulkan support (feels less likely)what do you think could be the cause of this and could parallelism or
wgpu_core::device::resource: Device::maintain: waiting for submission index x
be related?71fdc57
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tested your fork on my pc:
cpu : 550-600ms
gpu: about the same(I got manys 550-600ms as well but also a few 700ms)
I dumped time on the gpu with: (one also needs to enable the timestamp feature queries and must use the wgpu_debug features:
--features="wgpu_Debug"
)and analysed the resulting json with "/auswertung2.ipynb"
I get the following result:
So i my case the gpu is only working for 41ms, the rest seems to be initializing and communication overhead.
It is quite hard to guess why it is so slow for you.
I guess there might be a shader that is bad compiled or too "big" for your gpu(I have seen matmul 64x64tiled with 8x8 work per thread to be much slower(~60 times) in the browser than native while other shaders (e.g. matmul 64x64, 4x4 work per thread have the same performance).
71fdc57
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just analyzed why the performance was 10 times slower than the actual gpu computation time on my machine, half of the time was spent loading the model into the gpu ram.
If I load the clip model first, and then run the same forward pass 2 times in a row, the second call (when everything was fully initialized) is in the order of the gpu computation time:
CPU:
WGPU:
71fdc57
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @KimHenrikOtte, I ran the clip model again and observed that the GPU version is running a lot faster -> 220 images in 10-11s with compared to 44s on CPU
for the GPU version I noticed that both the (model.forward or model.get_image_features took) around 10 seconds when run serially irrespective of input size with 100% CPU usage and near 0% GPU core running at 200mhz. ie for every model.forward it takes 10 more seconds
could it be because I am running a slow CPU or it loading the model to memory every time I run mode.xyz?
I will take a look at the execution time with the procedure you mentioned above and come back to you.
also I am not sure what happened yesterday that made the default clip/main.rs take 94s but maybe I think I might have put model.forward in a loop.
I would also like to confirm whether running model.forward once would make the model stay in GPU memory and is there a better/recommended way to run models to batch process data?
71fdc57
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also do you have any development roadmap in mind that we can follow so that we would be on the same page while implementing them. I will be able to work in this on Saturdays and Sundays.
and do you want to first concentrate on performance optimization or getting all the features implemented and then start optimizing for performance?
I am thinking if we build out all the features we will have a bigger picture of the problems that we could face that could force us to refactor but I am a bit too early into this project to know if that could even be a problem or not.
71fdc57
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KimHenrikOtte I did little bit of profiling and the duration % looked pretty much identical to yours +/- 5%. the CPU profiling lead me to find out that libvulkan_lvp.so (Vulkan on cpu) was the cause of the CPU usage and no GPU usage even though it was faster than the raw CPU 🧙. It looks like something I need to investigate(i am speculating duplicate images being used could be the cause of it)
71fdc57
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few things that I think are not possible or too complicated/slow to implement.
E.g. quantized matrices, f16 and proper u8 support.
While reading u8 input values can be achieved by reading a u32 and using bit shifting, writing a single u8 to a buffer is not trivial as there would be race conditions between threads.
For best performance, all shaders with u8 would need a custom impl that processes 4 values at once.
There are proposals to add f16 support for wgsl, but I think these are only for calculations inside the shader, loading and writing to a memory still needs 4 byte alignment.
Things currently missing from the wgpu fork:
-I64 and F64
- Implement all nn-implementations (and argsort).
- Add ability for custom shader
- add documentation
- fix wasm-examples (the index.html are only copies from other examples, the settings panel isnt working)
71fdc57
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just committed a first version with i64 and f64 support (KimHenrikOtte@2d8518f) with some of your changes merged in.
But the commit still has the following problem:
The current implementation expects the same alignment for all input and output bytes, but some shaders require different types on different inputs:
index_select may use 4-byte aligned index buffers and 8-byte aligned input and output buffers,
where_cond can use a different alignment for condition buffers and input/output buffers.
matmul loads inputs 16-byte aligned, but currently writes dest_buffer with 4-byte alignment.
71fdc57
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's awesome @KimHenrikOtte, I will take a look at it now,
about the missing stuff, I could pick up documentation and fix wasm-examples as the simple first step till I get the hang of the code base.
Getting the documentation done through me could require a little more effort on your part, but it could help clear any misunderstandings I have in the code base.
do you have anything in mind that has to be documented?
do you also want to list out which of the default example models work through webgpu?
can you explain what are your intentions for
Add ability for custom shader
. Is this supported by Candle or any other accelerated runtime's supporting this?71fdc57
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the ability for anyone using Candle to create and use their own shaders.
This was suggested in the webGpu candle issues:
But I do not know how easy this could be integrated. You would need a way to specify your own shaders (e.g. by a string).
In general, the code uses a lot of custom struct and enums for various operations that could be better documented what they do.
71fdc57
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @KimHenrikOtte, I will think about this as well.
71fdc57
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EdupugantiAkhil Yesterday I committed a change to optimise the cache system for growing buffers (the metavoice example showed a huge performance drop because many buffers and bindgroups were created and deleted before).
I got the following results: (CPU: AMD Ryzen 7 5800X 8-Core Processor, GPU : NVIDEA GeForce GTX 1080 Ti)
llama2-c:
cpu: 331.49 token/s
gpu: 291.99 token/s,
clip:
cpu elapsed: 250.5971ms
gpu elapsed: 293.1473ms
stable-diffusion:
cpu: 369.8279709s
gpu 33.4540036s
t5:
cpu: 32 tokens generated (67.43 token/s)
gpu: 32 tokens generated (116.59 token/s)
metavoice:
cpu: 566.7995057s
gpu: 37.9660216s
wuerstchen:
gpu: 148.8636398s
cpu: 986.7038538s
71fdc57
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is super nice @KimHenrikOtte
last week I spent time to get my GPU situation fixed and was able to get this running on Vega7(AMD 4650g) and Adreno 725(Snapdragon 7+ Gen 2) on Vulcan runtime
I also have a modified version of matmul5 that saved 10 sec(from 70s -> 58s) off my Vulkan CPU runtime for Whisper but have not tested it on a real GPU or double checked the results (will do it next week) along with the other stuff
71fdc57
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, just for documentation, so we are on the same page:
in matmul.pwgsl are mostly early versions of matmul alg:
matmul1: naive impl (may be the fastest for small matrix multiplications)
matmul1_16: naive impl with 4 simultaneous loads (only works when k_stride is 1)
matmul1_end : artifact of previous test, idea was to use a tiled shader for the output first, this version can then calculate the edge of the matrix multiplication, but other shaders (e.g. matmul7 seems to be faster in this case)
matmul5: (based on https://cnugteren.github.io/tutorial/pages/page7.html) with a WorkPerThread of 8 and 16x16 tiles. (but still no rectangular tiles and wider data loads)
matmul7: (is a modified version of matmul5 that supports edge cases (e.g. input matrices need not be divisible by 16). Performance measurements show that in some cases it is faster to use matmul7 instead of padding input matrices and using a matmul5 based shader.
Under "sgemm/matmul..." there are more sophisticated shader models:
They all use the same MatmulHelper.pwgsl base functions with different sets of defines.
They use wider data loads (width=4), rectangular tiles, WorkPerThreads in both dimensions (e.g. each thread could compute 4x4 values, theoretically prefatching, but this did not show any performance gain).
Also, your matmul5 impl seems to be wrong:
Asub here is a local variable and not shared memory in the workgroup.
Also, I think whether memory can be coalesced depends on the input stride.
In addition, I think you are not loading enough values:
For example, if you have 16x16 tiles, and Work per thread x=8, Wpty=1, then you have 2x16 threads responsible for the 16x16 tile.
To load the 16x16 values of the tile, each thread must also load 8 values here.
Perhaps it makes more sense to use a sgemm/matmul16x16 version instead of matmul5 if the input is not divisible by 32 but by 16. (if that is faster)
71fdc57
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback @KimHenrikOtte, I hope I didn't scare you too much with my code 😅
Also, I would like to ask you what resources I should look at to get my fundamentals up as I lack the knowledge even to give you a proper reply
maybe like going through https://sotrh.github.io/learn-wgpu/ , https://cnugteren.github.io/tutorial/, or any other stuff?
Can you give me recommendations from the perspective of a beginner or
If I were to learn it again how would I have done it
I also want to let you know that I really appreciate your effort Thanks again.
71fdc57
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem,
I think the wgpu-learn tutorial is mostly aimed at rendering and not computing shaders.
Traditionally for rendering you would create vertex buffers, then call vertex and fragment shaders to transform vertices and compute pixel data.
Apart from this "rendering" pipeline, there is also a "compute" pipeline to handle arbitrary data (the WGPU fork only uses this architecture).
The "Tutorial: OpenCL SGEMM tuning for Kepler" is a great tutorial for understanding the principles of optimizing Matmul for GPUs (although it uses OpenCL, the principles are 100% the same for WGSL).
For visualizations and understanding how compute shaders divide work into multiple threads, I liked this post: "https://webgpufundamentals.org/webgpu/lessons/webgpu-compute-shaders.html"
For a tutorial on how the WGSL syntax works, "https://google.github.io/tour-of-wgsl/" has short descriptions and examples of most of the operations.
71fdc57
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EdupugantiAkhil I had also looked at the article: “https://siboehm.com/articles/22/CUDA-MMM ” (optimizes matmul in cuda and has tons of nice visualization)
71fdc57
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @KimHenrikOtte I will go through them, also feel free to share any recourses you find interesting.
71fdc57
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EdupugantiAkhil I just committed some improvements for the matrix multiplication:
I optimized the shader parameters like WPTN, WPTM etc. for better performance (I just tested a few different parameters with the matmul benchmark and looked for the best ones).
The best selection I found was the following with a throughput of 3161.4 GiB/s (which is a little bit better than the best 64x648x8 I found so far which only got about 2500 GiB/s)
Unfortunately, when matrix B is transposed (matrix B k-stride = 1), the performance is much worse (only 336.65 GiB/s.).
I dont really understand why, but I found that when TSK = 8, the shader is a little slower overall, but when B is transposed, I measured again around 3000 GiB/s.
So the current impl will use 32x64b(for a transposed b-matrix) and 32x64 otherwise.
llama2-c:
gpu: 292.49 token/s
clip:
gpu elapsed: 387.5123ms
stable-diffusion:
gpu 23.9633768s
t5:
gpu: 32 tokens generated (124.99 token/s)
metavoice:
gpu: 30.6853641s
wuerstchen:
gpu:92.0862131ss
71fdc57
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @KimHenrikOtte, I just skimmed through the diffs, and I will check how the code impacts performance on the hardware I have.
It might take some time as I have some personal work that is going to keep me busy till the end of this month. I hope it didn't/doesn't cause too many problems for you. Sorry about the short notice.
https://youtu.be/QQceTDjA4f4?si=3oXrPYOKU2h9wSdK might be able to help (they talk about memory starting at the 7 min mark but it's worth a watch if you already haven't or you might already know about this)
I was also able to go through the resources you gave me. I still feel a little foggy about the webgpu stuff, but it will get clearer more I use it.
71fdc57
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @KimHenrikOtte,
I apologize for the abrupt pause in communication. I've now wrapped up my personal commitments and am ready to resume my contributions to the project. do you have anything in mind that I could help you with?
I can go through the new changes you have made and try out some of your WASM examples and maybe try my luck with fixing MutexGuard across an await points.
If possible, I’d prefer to also contact you through other means, such as Discord or email. You can reach out to me at akhile76@gmail.com.
71fdc57
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem,
I don't have much time to work on the branch at the moment.
But I think the current branch "KimHenrikOtte/candle at wgpu_cleanup" is at least in a state where it could be merged with candle.
(At least as a "beta" feature, so you can try it out).
Still, there are a few things one might want to improve/implement:
candle-core/asort impl (I tried it in the beginning with chatgpt, but I was confused about the implementation, but I think this should be doable (e.g. see candle-metal-kernels/src/sort.metal).
Candle-nn/(ropei, rope, rope_thd) impl
clippy warnings
Flash-attention?
But I think you would at least need a different bindgroup layout. (e.g. more than 3 inputs).
Which would probably result in more than one bindgroup for operation.
Performance:
Currently there are 3 implace optimisations (to reuse the buffer for unary, binary or copy operations if this is the last time the buffer is used).
new_value = a + b;
copy new_value to current_value;
a + b could write directly to current_value instead of new_value