[perf] Enable load store forwarding across loop hierarchy for vulkan performance. #5350

turbo0628 · 2022-07-07T02:41:34Z

The store_to_load_forwarding transformation in the cfg_optimization pass ignores ExternalPtrStmt. It causes performance issue in the SPH case. The CUDA backend is insensitive to this optimization, but Vulkan backend cannot handles this well even on the same hardware.

I think we can get a ~2.2x speed up for the SPH case when this optimization is enabled.

The related source code section is here.

The text was updated successfully, but these errors were encountered:

qiao-bo · 2022-07-11T01:36:58Z

After some investigation, this seems not related to Ndarray in the store_to_load_forwarding pass. The code block (as mentioned) does consider Ndarray, which is ExternalPtrStmt:

taichi/taichi/ir/control_flow_graph.cpp

Lines 264 to 294 in 233f017

    
           if (auto local_load = stmt->cast<LocalLoadStmt>()) { 
        
             bool regular = true; 
        
             auto alloca = local_load->src[0].var; 
        
             for (int l = 0; l < stmt->width(); l++) { 
        
               if (local_load->src[l].offset != l || 
        
                   local_load->src[l].var != alloca) { 
        
                 regular = false; 
        
               } 
        
             } 
        
             if (regular) { 
        
               result = get_store_forwarding_data(alloca, i); 
        
             } 
        
           } else if (auto global_load = stmt->cast<GlobalLoadStmt>()) { 
        
             if (!after_lower_access) { 
        
               bool store_forwarding = true; 
        
               if (auto global_ptr = global_load->src->cast<GlobalPtrStmt>()) { 
        
                 TI_ASSERT(global_ptr->width() == 1); 
        
                 auto &snodes = global_ptr->snodes; 
        
                 if (snodes[0]->has_adjoint()) { 
        
                   // Has adjoint SNode. Skip the store forwarding 
        
                   // to keep the global load chain, 
        
                   // so that the grad of intermidiate variable can be computed 
        
                   // by GlobalLoadStmt 
        
                   store_forwarding = false; 
        
                 } 
        
               } 
        
               if (store_forwarding) { 
        
                 result = get_store_forwarding_data(global_load->src, i); 
        
               } 
        
             } 
        
           }

Let be Ndarray or Field, the pattern will start with a GlobalLoadStmt, hence the else-if clause will be executed, and store_forwarding remains true.

The problem here is that the get_store_forwarding_data function cannot find the forwarded data for certain scenarios:

taichi/taichi/ir/control_flow_graph.cpp

Line 291 in 233f017

result = get_store_forwarding_data(global_load->src, i);

For example, a dummy kernel like this can be optimized by store_to_load_forwarding:

@ti.kernel
def foo_ndarray(acc: ti.any_arr(field_dim=1)):
    for i in range(16):
        acc[i] = 1.0
        acc[i] -= i 
        acc[i] %= i

Nevertheless, if the access happens e.g., in a loop, the optimization no longer works:

@ti.kernel
def foo_field(acc: ti.template()):
    for i in range(16):
        acc[i] = 1.0
        for j in range(8):
            acc[i] -= j 
        for k in range(4):
            acc[i] %= k

This is regardless of Field/Ndarray. The question here becomes how to adapt the get_store_forwarding_data function to consider the second case forwardable.

turbo0628 · 2022-07-11T06:05:11Z

OK, so this is not a cause of ndarray vs field performance gap.

qiao-bo · 2022-07-11T06:20:23Z

@jim19930609 @strongoier Any suggestions on this? Should we update the get_store_forwarding_data function to be able to optimize this kind of pattern as well?

@ti.kernel
def foo_field(acc: ti.template()):
    for i in range(16):
        acc[i] = 1.0
        for j in range(8):
            acc[i] -= j 
        for k in range(4):
            acc[i] %= k

Similar to

@ti.kernel
def foo_field(acc: ti.template()):
    for i in range(16):
        var = acc[i]
        var = 1.0
        for j in range(8):
            var -= j 
        for k in range(4):
            var %= k 
        acc[i] = var

strongoier · 2022-07-11T06:39:06Z

I think this is definitely some optimization opportunity, which may or may not be taken care of by the backend compilers. However, is this the solution to the current problem you guys are facing?

qiao-bo · 2022-07-11T06:43:31Z

I think this is definitely some optimization opportunity, which may or may not be taken care of by the backend compilers. However, is this the solution to the current problem you guys are facing?

Partially yes, if this optimization is possible, Vulkan backend could benefit. (CUDA backend does not need it as the runtime opt is too powerful).

strongoier · 2022-07-11T06:48:17Z

Partially yes, if this optimization is possible, Vulkan backend could benefit. (CUDA backend does not need it as the runtime opt is too powerful).

Then feel free to hack it!

turbo0628 · 2022-07-29T03:22:24Z

Update:

A simplified version that reproduce the problem:

The original baseline version: natural to write, but slow.

@ti.kernel
def func(v : ti.types.ndarray(field_dim=1), d : ti.types.ndarray(field_dim=1), a : ti.types.ndarray(field_dim=1)):
    for i in range(N):
        acc = 0.0
        for j in range(N):
            acc += v[i] * d[j]
        a[i] = acc

The fast version that manually make LSF.

@ti.kernel
def func(v : ti.types.ndarray(field_dim=1), d : ti.types.ndarray(field_dim=1), a : ti.types.ndarray(field_dim=1)):
    for i in range(N):
        acc = 0.0
        v_val = v[i]
        for j in range(N):
            acc += v_val * d[j]
        a[i] = acc

We save the v[i] outside, as it is actually constant in this loop.

The optimization pass needs 2-steps:

Analyze that the loads that has no relation with the loop variants.
Move the load outside the loop

Another possible solution:
This case SHOULD be handled by L1 cache. Is that because some annotations in our SPIR-V codegen is somewhat improper? For example should the v[i] be annotated as in Private scope?

Desired performance of this code snippet on RTX3080 (The vulkan shared memory version is not yet merged though)

Full runnable code:

import time
import taichi as ti
import numpy as np


ti.init(arch=ti.vulkan)

block_dim = 512
nBlocks = 1024
N = nBlocks * block_dim
np.random.seed(42)
v_arr = np.random.randn(N).astype(np.float32)
d_arr = np.random.randn(N).astype(np.float32)
a_arr = np.zeros(N).astype(np.float32)
b_arr = np.zeros(N).astype(np.float32)


@ti.kernel
def init(v : ti.types.ndarray(field_dim=1), d : ti.types.ndarray(field_dim=1), a : ti.types.ndarray(field_dim=1)):
    for i in range(N):
        v[i] = ti.random(ti.f32)
        d[i] = ti.random(ti.f32)
        a[i] = 0.0

@ti.kernel
def func(v : ti.types.ndarray(field_dim=1), d : ti.types.ndarray(field_dim=1), a : ti.types.ndarray(field_dim=1)):
    for i in range(N):
        acc = 0.0
        for j in range(N):
            acc += v[i] * d[j]
        a[i] = acc

@ti.kernel
def func_lsf(v : ti.types.ndarray(field_dim=1), d : ti.types.ndarray(field_dim=1), a : ti.types.ndarray(field_dim=1)):
    for i in range(N):
        acc = 0.0
        v_val = v[i]
        for j in range(N):
            acc += v_val * d[j]
        a[i] = acc

init(v_arr, d_arr, a_arr)
func(v_arr, d_arr, a_arr)
nIter = 10
st = time.time()
for i in range(nIter):
    func(v_arr, d_arr, a_arr)
ti.sync()
et = time.time()
print(f"Baseline time {(et - st) / nIter}")
st = time.time()
for i in range(nIter):
    func_lsf(v_arr, d_arr, a_arr)
ti.sync()
et = time.time()
print(f"LSF time {(et - st) / nIter}")

lin-hitonami · 2022-09-19T07:37:29Z

Desired performance of this code snippet on RTX3080 (The vulkan shared memory version is not yet merged though)

Full runnable code:

import time
import taichi as ti
import numpy as np


ti.init(arch=ti.vulkan)

block_dim = 512
nBlocks = 1024
N = nBlocks * block_dim
np.random.seed(42)
v_arr = np.random.randn(N).astype(np.float32)
d_arr = np.random.randn(N).astype(np.float32)
a_arr = np.zeros(N).astype(np.float32)
b_arr = np.zeros(N).astype(np.float32)


@ti.kernel
def init(v : ti.types.ndarray(field_dim=1), d : ti.types.ndarray(field_dim=1), a : ti.types.ndarray(field_dim=1)):
    for i in range(N):
        v[i] = ti.random(ti.f32)
        d[i] = ti.random(ti.f32)
        a[i] = 0.0

@ti.kernel
def func(v : ti.types.ndarray(field_dim=1), d : ti.types.ndarray(field_dim=1), a : ti.types.ndarray(field_dim=1)):
    for i in range(N):
        acc = 0.0
        for j in range(N):
            acc += v[i] * d[j]
        a[i] = acc

@ti.kernel
def func_lsf(v : ti.types.ndarray(field_dim=1), d : ti.types.ndarray(field_dim=1), a : ti.types.ndarray(field_dim=1)):
    for i in range(N):
        acc = 0.0
        v_val = v[i]
        for j in range(N):
            acc += v_val * d[j]
        a[i] = acc

init(v_arr, d_arr, a_arr)
func(v_arr, d_arr, a_arr)
nIter = 10
st = time.time()
for i in range(nIter):
    func(v_arr, d_arr, a_arr)
ti.sync()
et = time.time()
print(f"Baseline time {(et - st) / nIter}")
st = time.time()
for i in range(nIter):
    func_lsf(v_arr, d_arr, a_arr)
ti.sync()
et = time.time()
print(f"LSF time {(et - st) / nIter}")

I don't know if I'm testing in a wrong way but on my machine (master branch, RTX3080) baseline time is shorter than LSF time...

[Taichi] version 1.1.3, llvm 10.0.0, commit 9d27b94e, linux, python 3.9.12
[Taichi] Starting on arch=vulkan
[I 09/19/22 15:39:00.841 2370085] [vulkan_device_creator.cpp:pick_physical_device@368] Found Vulkan Device 0 (llvmpipe (LLVM 12.0.0, 256 bits))
[I 09/19/22 15:39:00.841 2370085] [vulkan_device_creator.cpp:pick_physical_device@368] Found Vulkan Device 1 (NVIDIA GeForce RTX 3080)
[I 09/19/22 15:39:00.841 2370085] [vulkan_device_creator.cpp:find_queue_families@139] Async compute queue 2, graphics queue 0
[I 09/19/22 15:39:00.841 2370085] [vulkan_device_creator.cpp:find_queue_families@139] Async compute queue 2, graphics queue 0
[I 09/19/22 15:39:00.841 2370085] [vulkan_device_creator.cpp:create_logical_device@436] Vulkan Device "NVIDIA GeForce RTX 3080" supports Vulkan 0 version 1.2.175
Baseline time 0.1443796157836914
LSF time 0.5295180082321167

@turbo0628

…r is on related: taichi-dev#5350 Thanks @turbo0628 and @lin-hintonami for catching this! It turns out codegening non-semantic info spirv code isn't no-op when validation is off. It slows down the program execution so we should properly guard it.

@turbo0628

…r is on (#6129) related: #5350 Thanks @turbo0628 and @lin-hitonami for catching this! It turns out codegening non-semantic info spirv code isn't no-op when validation is off. It slows down the program execution so we should properly guard it.

Related issue = fixes #5350 Global variables can't be store-to-load forwarded after `lower-access` pass, so we need to do `simplify` before it. It should speed up the program in all circumstances. Caching loop-invariant global vars to local vars sometimes speeds up the program yet some time lets the program run slower so I let it controlled by the compiler config. FPS of Yu's program on RTX3080 on Vulkan: Original: 19fps Simplified before lower access: 30fps Cached loop-invariant global vars to local vars: 41fps **This PR does things as follows:** 1. Extract a base class `LoopInvariantDetector` from `LoopInvariantCodeMotion`. This class maintains information to detect whether a statement is loop-invariant. 2. Let LICM move `GlobalPtrStmt`, `ArgLoadStmt` and `ExternalPtrStmt` out of the loop so that they become loop-invariant. 3. Add `CacheLoopInvariantGlobalVars` to move out loop-invariant global variables that are loop-unique in the offloaded task. 4. Add pass `cache_loop_invariant_global_vars` after `demote_atomics` before `demote_dense_struct_fors` (because loop-uniqueness can't be correctly detected after `demote_dense_struct_fors`) and add a compiler config flag to control it. 5. Add pass `full_simplify` before `lower_access` to enable store-to-load forwarding for GlobalPtrs.  Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

turbo0628 added the feature request Suggest an idea on this project label Jul 7, 2022

turbo0628 assigned qiao-bo Jul 7, 2022

turbo0628 added this to Taichi Lang Jul 8, 2022

turbo0628 moved this to Untriaged in Taichi Lang Jul 8, 2022

turbo0628 moved this from Untriaged to In Progress in Taichi Lang Jul 8, 2022

turbo0628 changed the title ~~[perf] Ndarray should be considered in load store forwarding.~~ [perf] Enable load store forwarding across loop hierarchy for vulkan performance. Jul 29, 2022

lin-hitonami assigned lin-hitonami and unassigned qiao-bo Sep 14, 2022

lin-hitonami mentioned this issue Sep 15, 2022

[opt] Cache loop-invariant global vars to local vars #6072

Merged

ailzhang mentioned this issue Sep 22, 2022

[bug] [vulkan] Only enable non_semantic_info cap when validation layer is on #6129

Merged

lin-hitonami closed this as completed in #6072 Sep 23, 2022

Repository owner moved this from In Progress to Done in Taichi Lang Sep 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[perf] Enable load store forwarding across loop hierarchy for vulkan performance. #5350

[perf] Enable load store forwarding across loop hierarchy for vulkan performance. #5350

turbo0628 commented Jul 7, 2022 •

edited

Loading

qiao-bo commented Jul 11, 2022

turbo0628 commented Jul 11, 2022

qiao-bo commented Jul 11, 2022

strongoier commented Jul 11, 2022

qiao-bo commented Jul 11, 2022

strongoier commented Jul 11, 2022

turbo0628 commented Jul 29, 2022 •

edited

Loading

lin-hitonami commented Sep 19, 2022 •

edited

Loading

[perf] Enable load store forwarding across loop hierarchy for vulkan performance. #5350

[perf] Enable load store forwarding across loop hierarchy for vulkan performance. #5350

Comments

turbo0628 commented Jul 7, 2022 • edited Loading

qiao-bo commented Jul 11, 2022

turbo0628 commented Jul 11, 2022

qiao-bo commented Jul 11, 2022

strongoier commented Jul 11, 2022

qiao-bo commented Jul 11, 2022

strongoier commented Jul 11, 2022

turbo0628 commented Jul 29, 2022 • edited Loading

lin-hitonami commented Sep 19, 2022 • edited Loading

turbo0628 commented Jul 7, 2022 •

edited

Loading

turbo0628 commented Jul 29, 2022 •

edited

Loading

lin-hitonami commented Sep 19, 2022 •

edited

Loading