[Perf] Improve sparse computation performance #1263

yuanming-hu · 2020-06-17T03:02:52Z

We haven't done a systematic performance engineering since we switch to the LLVM backend. I
suspect there to be a lot of space for performance optimization, especially when it comes to sparsity and hierarchical data structures. Note that we have done pretty big changes to the IR/Opt system without performance regression tests, so sparse computation performance probably has degraded over time (e.g. #1182).

Steps

Inspect sparse computation performance.
- ...
Enable performance regression tests on an MIT server.

yuanming-hu · 2020-06-17T18:30:14Z

One performance issue I found:

import taichi as ti

ti.init(arch=ti.cuda, print_ir=True, kernel_profiler=True)

a = ti.var(ti.i32)
b = ti.var(ti.i32)

block = ti.root.pointer(ti.i, 1024 * 128)
block.dense(ti.i, 1024).place(a)
block.dense(ti.i, 1024).place(b)

@ti.kernel
def activate():
    for i in range(1024 * 128 * 1024):
        a[i] = i

@ti.kernel
def copy():
    for i in a:
        b[i] = a[i]
        
activate()

for i in range(10):
    copy()

ti.kernel_profiler_print()

Current copy takes 11ms with IR being

kernel {
  $0 = offloaded clear_list S1pointer
  $1 = offloaded listgen S0root->S1pointer
  $2 = offloaded clear_list S2dense
  $3 = offloaded listgen S1pointer->S2dense
  $4 = offloaded struct_for(S2dense) block_dim=0 {
    <i32 x1> $5 = loop $4 index 0
    <gen*x1> $6 = get root
    <i32 x1> $7 = const [0]
    <gen*x1> $8 = [S0root][root]::lookup($6, $7) activate = false
    <gen*x1> $9 = get child [S0root->S1pointer] $8
    <i32 x1> $10 = bit_extract($5 + 0) bit_range=[10, 27)
    <gen*x1> $11 = [S1pointer][pointer]::lookup($9, $10) activate = false
    <gen*x1> $12 = get child [S1pointer->S2dense] $11
    <i32 x1> $13 = bit_extract($5 + 0) bit_range=[0, 10)
    <gen*x1> $14 = [S2dense][dense]::lookup($12, $13) activate = false
    <i32*x1> $15 = get child [S2dense->S3place_i32] $14
    <i32 x1> $16 = global load $15
    <gen*x1> $17 = [S1pointer][pointer]::lookup($9, $10) activate = true
    <gen*x1> $18 = get child [S1pointer->S4dense] $17
    <gen*x1> $19 = [S4dense][dense]::lookup($18, $13) activate = false
    <i32*x1> $20 = get child [S4dense->S5place_i32] $19
    <i32 x1> $21 : global store [$20 <- $16]
  }
}

Note that $17 does an extra activation.

After #1270, it takes 6.8 ms (1.6x faster) and the IR becomes

kernel {
  $0 = offloaded clear_list S1pointer
  $1 = offloaded listgen S0root->S1pointer
  $2 = offloaded clear_list S2dense
  $3 = offloaded listgen S1pointer->S2dense
  $4 = offloaded struct_for(S2dense) block_dim=0 {
    <i32 x1> $5 = loop $4 index 0
    <gen*x1> $6 = get root
    <i32 x1> $7 = const [0]
    <gen*x1> $8 = [S0root][root]::lookup($6, $7) activate = false
    <gen*x1> $9 = get child [S0root->S1pointer] $8
    <i32 x1> $10 = bit_extract($5 + 0) bit_range=[10, 27)
    <gen*x1> $11 = [S1pointer][pointer]::lookup($9, $10) activate = false
    <gen*x1> $12 = get child [S1pointer->S2dense] $11
    <i32 x1> $13 = bit_extract($5 + 0) bit_range=[0, 10)
    <gen*x1> $14 = [S2dense][dense]::lookup($12, $13) activate = false
    <i32*x1> $15 = get child [S2dense->S3place_i32] $14
    <i32 x1> $16 = global load $15
    <gen*x1> $17 = get child [S1pointer->S4dense] $11
    <gen*x1> $18 = [S4dense][dense]::lookup($17, $13) activate = false
    <i32*x1> $19 = get child [S4dense->S5place_i32] $18
    <i32 x1> $20 : global store [$19 <- $16]
  }
}

yuanming-hu added the feature request Suggest an idea on this project label Jun 17, 2020

yuanming-hu self-assigned this Jun 17, 2020

archibate mentioned this issue Jun 17, 2020

[GUI] [bug] Fix "gui.set_image" not accept multi-channel scalar tensors #1264

Merged

archibate self-assigned this Jun 17, 2020

yuanming-hu mentioned this issue Jun 17, 2020

[perf] [opt] Better weaken_access pass for sparse cases #1270

Merged

yuanming-hu mentioned this issue Jun 18, 2020

[CUDA] Support "ti.cache_read_only" #1276

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf] Improve sparse computation performance #1263

[Perf] Improve sparse computation performance #1263

yuanming-hu commented Jun 17, 2020

yuanming-hu commented Jun 17, 2020

[Perf] Improve sparse computation performance #1263

[Perf] Improve sparse computation performance #1263

Comments

yuanming-hu commented Jun 17, 2020

yuanming-hu commented Jun 17, 2020