Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Perf] Improve sparse computation performance #1263

Open
yuanming-hu opened this issue Jun 17, 2020 · 1 comment
Open

[Perf] Improve sparse computation performance #1263

yuanming-hu opened this issue Jun 17, 2020 · 1 comment
Assignees
Labels
feature request Suggest an idea on this project

Comments

@yuanming-hu
Copy link
Member

We haven't done a systematic performance engineering since we switch to the LLVM backend. I
suspect there to be a lot of space for performance optimization, especially when it comes to sparsity and hierarchical data structures. Note that we have done pretty big changes to the IR/Opt system without performance regression tests, so sparse computation performance probably has degraded over time (e.g. #1182).

Steps

  1. Inspect sparse computation performance.
    • ...
  2. Enable performance regression tests on an MIT server.
@yuanming-hu
Copy link
Member Author

One performance issue I found:

import taichi as ti

ti.init(arch=ti.cuda, print_ir=True, kernel_profiler=True)

a = ti.var(ti.i32)
b = ti.var(ti.i32)

block = ti.root.pointer(ti.i, 1024 * 128)
block.dense(ti.i, 1024).place(a)
block.dense(ti.i, 1024).place(b)

@ti.kernel
def activate():
    for i in range(1024 * 128 * 1024):
        a[i] = i

@ti.kernel
def copy():
    for i in a:
        b[i] = a[i]
        
activate()

for i in range(10):
    copy()

ti.kernel_profiler_print()

Current copy takes 11ms with IR being

kernel {
  $0 = offloaded clear_list S1pointer
  $1 = offloaded listgen S0root->S1pointer
  $2 = offloaded clear_list S2dense
  $3 = offloaded listgen S1pointer->S2dense
  $4 = offloaded struct_for(S2dense) block_dim=0 {
    <i32 x1> $5 = loop $4 index 0
    <gen*x1> $6 = get root
    <i32 x1> $7 = const [0]
    <gen*x1> $8 = [S0root][root]::lookup($6, $7) activate = false
    <gen*x1> $9 = get child [S0root->S1pointer] $8
    <i32 x1> $10 = bit_extract($5 + 0) bit_range=[10, 27)
    <gen*x1> $11 = [S1pointer][pointer]::lookup($9, $10) activate = false
    <gen*x1> $12 = get child [S1pointer->S2dense] $11
    <i32 x1> $13 = bit_extract($5 + 0) bit_range=[0, 10)
    <gen*x1> $14 = [S2dense][dense]::lookup($12, $13) activate = false
    <i32*x1> $15 = get child [S2dense->S3place_i32] $14
    <i32 x1> $16 = global load $15
    <gen*x1> $17 = [S1pointer][pointer]::lookup($9, $10) activate = true
    <gen*x1> $18 = get child [S1pointer->S4dense] $17
    <gen*x1> $19 = [S4dense][dense]::lookup($18, $13) activate = false
    <i32*x1> $20 = get child [S4dense->S5place_i32] $19
    <i32 x1> $21 : global store [$20 <- $16]
  }
}

Note that $17 does an extra activation.

After #1270, it takes 6.8 ms (1.6x faster) and the IR becomes

kernel {
  $0 = offloaded clear_list S1pointer
  $1 = offloaded listgen S0root->S1pointer
  $2 = offloaded clear_list S2dense
  $3 = offloaded listgen S1pointer->S2dense
  $4 = offloaded struct_for(S2dense) block_dim=0 {
    <i32 x1> $5 = loop $4 index 0
    <gen*x1> $6 = get root
    <i32 x1> $7 = const [0]
    <gen*x1> $8 = [S0root][root]::lookup($6, $7) activate = false
    <gen*x1> $9 = get child [S0root->S1pointer] $8
    <i32 x1> $10 = bit_extract($5 + 0) bit_range=[10, 27)
    <gen*x1> $11 = [S1pointer][pointer]::lookup($9, $10) activate = false
    <gen*x1> $12 = get child [S1pointer->S2dense] $11
    <i32 x1> $13 = bit_extract($5 + 0) bit_range=[0, 10)
    <gen*x1> $14 = [S2dense][dense]::lookup($12, $13) activate = false
    <i32*x1> $15 = get child [S2dense->S3place_i32] $14
    <i32 x1> $16 = global load $15
    <gen*x1> $17 = get child [S1pointer->S4dense] $11
    <gen*x1> $18 = [S4dense][dense]::lookup($17, $13) activate = false
    <i32*x1> $19 = get child [S4dense->S5place_i32] $18
    <i32 x1> $20 : global store [$19 <- $16]
  }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Suggest an idea on this project
Projects
None yet
Development

No branches or pull requests

2 participants