Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tune CUDA kernels automatically #206

Merged
merged 4 commits into from
Feb 8, 2021
Merged

Conversation

simeonschaub
Copy link
Contributor

This is still quite rough around the edges, but I am putting this up for feedback. This automatically splits up the threads over leading dimensions of the ndrange for better performance if the first dimension is small.

@simeonschaub
Copy link
Contributor Author

Trying this out on my real example, it seems like this adds quite a lot of overhead, so the optimal thread size should probably be determined at kernel creation time, not every time it is invoked.

@simeonschaub
Copy link
Contributor Author

Turns out that this overhead came from having to switch from a static to a dynamic workgroupsize, so I am thinking that it might be better to have a separate API for determining the optimal workgroupsize for a kernel.

@simeonschaub
Copy link
Contributor Author

Well, that was embarassing... Turns out the weird performance issues I was seeing came from me accidentally capping the CUDA memory limit way to low. 🤦 With that fixed, this does actually give a nice speedup even if I change to a dynamic workgroupsize.

src/backends/cuda.jl Outdated Show resolved Hide resolved
src/backends/cuda.jl Outdated Show resolved Hide resolved
@vchuravy
Copy link
Member

vchuravy commented Feb 8, 2021

bors try

bors bot added a commit that referenced this pull request Feb 8, 2021
Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com>
@simeonschaub simeonschaub changed the title RFC: tune CUDA kernels automatically tune CUDA kernels automatically Feb 8, 2021
@simeonschaub simeonschaub marked this pull request as ready for review February 8, 2021 20:17
@simeonschaub
Copy link
Contributor Author

simeonschaub commented Feb 8, 2021

Ok, I think I am fairly happy with this now. If you agree with the way this works, it should be good to go from my side.

@vchuravy
Copy link
Member

vchuravy commented Feb 8, 2021

bors r+

@bors bors bot merged commit 8d50887 into JuliaGPU:master Feb 8, 2021
@simeonschaub simeonschaub deleted the sds/autotune branch February 8, 2021 21:55
@simeonschaub
Copy link
Contributor Author

Mind also tagging a release with this?

simeonschaub added a commit to simeonschaub/Tullio.jl that referenced this pull request Feb 8, 2021
JuliaGPU/KernelAbstractions.jl#206 added the ability to automatically tune the workgroupsize of CUDA kernels. This PR stops hardcoding a default workgroupsize and lets KernelAbstractions handle that. This does change the workgroupsize from being statically sized to being dynamically sized, but in my testing, even with fairly small workgroupsizes, that didn't really make a difference.
@vchuravy
Copy link
Member

vchuravy commented Feb 9, 2021

Done JuliaRegistries/General#29752

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants