-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[async] Clone offloaded tasks lazily by caching the AST #1619
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1619 +/- ##
===========================================
- Coverage 86.51% 63.22% -23.30%
===========================================
Files 19 19
Lines 3715 3717 +2
Branches 659 659
===========================================
- Hits 3214 2350 -864
- Misses 362 1239 +877
+ Partials 139 128 -11
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Looks great to me. Thank you so much!
Quick answer: taichi/taichi/transforms/simplify.cpp Lines 791 to 796 in 6514e87
(This reminds me about #1059...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! LGTM!
To avoid having to clone the offloaded task upon every kernel launch, I've cached the cloned offloaded AST as well. The cached AST, along with its hash, is put inside a struct called
OffloadedCachedData
, and is used as a read-only template. This optimization works because for most of the times, we are just reading the offloaded task AST without any modification.For places that do change the AST, the clone is done via a copy-on-write way. This clone is necessary so as to keep the template untouched. Please see
KernelLaunchRecord::clone_stmt_on_write()
.For now, we only have two places that will modify the AST:
compilation_workers
. Some IR passes will modify it.FPS improvement:
mpm88
: 6 -> 11mpm99
: 12 -> 26Profiling shows that clone is no longer a hotspot:
Another thing is that, I found that #1593 has actually prevented the fusion in the test cases in
test_fuse_dense.py
(probablytest_fuse_dynamic.py
as well, but i didn't test). This was because the kernel wasn't fully simplified, and contained an emptyserial
task:I fixed this by adding another
fully_simplify()
pass in the end ofcompile_to_offloads()
, but I wonder which specific pass can get rid of that empty offloaded..? @xumingkuanI'm not sure how much you like this template notion. Personally I think it's a not-very-intuitive but necessary workaround. Let me know your ideas :)
Related issue = #1608
[Click here for the format server]