Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[async] Clone offloaded tasks lazily by caching the AST #1619

Merged
merged 1 commit into from
Aug 1, 2020

Conversation

k-ye
Copy link
Member

@k-ye k-ye commented Jul 31, 2020

To avoid having to clone the offloaded task upon every kernel launch, I've cached the cloned offloaded AST as well. The cached AST, along with its hash, is put inside a struct called OffloadedCachedData, and is used as a read-only template. This optimization works because for most of the times, we are just reading the offloaded task AST without any modification.

For places that do change the AST, the clone is done via a copy-on-write way. This clone is necessary so as to keep the template untouched. Please see KernelLaunchRecord::clone_stmt_on_write().

For now, we only have two places that will modify the AST:

  1. The first time an AST gets compiled in compilation_workers. Some IR passes will modify it.
  2. When two ASTs are fused.

FPS improvement:

  • mpm88: 6 -> 11
  • mpm99: 12 -> 26

Profiling shows that clone is no longer a hotspot:

Screen Shot 2020-07-31 at 21 20 08


Another thing is that, I found that #1593 has actually prevented the fusion in the test cases in test_fuse_dense.py (probably test_fuse_dynamic.py as well, but i didn't test). This was because the kernel wasn't fully simplified, and contained an empty serial task:

kernel {
  $0 = offloaded  
  body {
  }
  $1 = offloaded struct_for(S1dense) grid_dim=0 block_dim=1024 bls= 
  body {
    <i32 x1> $2 = loop $1 index 0
    <i32*x1> $3 = global ptr [S2place_i32], index [$2] activate=false
    <i32 x1> $4 = global load $3
    <i32 x1> $5 = const [1]
    <i32 x1> $6 = add $4 $5
    <i32*x1> $7 : global store [$3 <- $6]
  }
}

I fixed this by adding another fully_simplify() pass in the end of compile_to_offloads(), but I wonder which specific pass can get rid of that empty offloaded..? @xumingkuan


I'm not sure how much you like this template notion. Personally I think it's a not-very-intuitive but necessary workaround. Let me know your ideas :)

Related issue = #1608

[Click here for the format server]


@k-ye k-ye changed the title [async] Clone offloaded tasks lazily by maintaining a cached template task [async] Clone offloaded tasks lazily by maintaining a cached AST Jul 31, 2020
@k-ye k-ye changed the title [async] Clone offloaded tasks lazily by maintaining a cached AST [async] Clone offloaded tasks lazily by aching the AST Jul 31, 2020
@codecov
Copy link

codecov bot commented Jul 31, 2020

Codecov Report

Merging #1619 into master will decrease coverage by 23.29%.
The diff coverage is 50.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #1619       +/-   ##
===========================================
- Coverage   86.51%   63.22%   -23.30%     
===========================================
  Files          19       19               
  Lines        3715     3717        +2     
  Branches      659      659               
===========================================
- Hits         3214     2350      -864     
- Misses        362     1239      +877     
+ Partials      139      128       -11     
Impacted Files Coverage Δ
python/taichi/lang/matrix.py 67.31% <33.33%> (-24.72%) ⬇️
python/taichi/lang/__init__.py 55.20% <100.00%> (-25.01%) ⬇️
python/taichi/lang/core.py 0.00% <0.00%> (-100.00%) ⬇️
python/taichi/lang/exception.py 33.33% <0.00%> (-66.67%) ⬇️
python/taichi/lang/ops.py 43.11% <0.00%> (-49.41%) ⬇️
python/taichi/lang/kernel_arguments.py 51.61% <0.00%> (-48.39%) ⬇️
python/taichi/lang/shell.py 3.33% <0.00%> (-45.56%) ⬇️
python/taichi/lang/util.py 28.93% <0.00%> (-33.34%) ⬇️
python/taichi/lang/snode.py 67.56% <0.00%> (-26.13%) ⬇️
python/taichi/lang/expr.py 63.79% <0.00%> (-25.29%) ⬇️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6514e87...a3ee94c. Read the comment docs.

@yuanming-hu yuanming-hu changed the title [async] Clone offloaded tasks lazily by aching the AST [async] Clone offloaded tasks lazily by caching the AST Jul 31, 2020
Copy link
Member

@yuanming-hu yuanming-hu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Looks great to me. Thank you so much!

@xumingkuan
Copy link
Contributor

xumingkuan commented Jul 31, 2020

I wonder which specific pass can get rid of that empty offloaded..?

Quick answer:

void visit(OffloadedStmt *stmt) override {
if (stmt->has_body() && stmt->body->statements.empty()) {
stmt->parent->erase(stmt);
throw IRModified();
}
}

(This reminds me about #1059...)

Copy link
Contributor

@xumingkuan xumingkuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! LGTM!

@k-ye k-ye merged commit 81437d6 into taichi-dev:master Aug 1, 2020
@k-ye k-ye deleted the clone branch August 1, 2020 09:24
@yuanming-hu yuanming-hu mentioned this pull request Aug 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants