Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[perf] Enable load store forwarding across loop hierarchy for vulkan performance. #5350

Closed
turbo0628 opened this issue Jul 7, 2022 · 8 comments · Fixed by #6072
Closed
Assignees
Labels
feature request Suggest an idea on this project

Comments

@turbo0628
Copy link
Member

turbo0628 commented Jul 7, 2022

The store_to_load_forwarding transformation in the cfg_optimization pass ignores ExternalPtrStmt. It causes performance issue in the SPH case. The CUDA backend is insensitive to this optimization, but Vulkan backend cannot handles this well even on the same hardware.

I think we can get a ~2.2x speed up for the SPH case when this optimization is enabled.

The related source code section is here.

@turbo0628 turbo0628 added the feature request Suggest an idea on this project label Jul 7, 2022
@turbo0628 turbo0628 moved this to Untriaged in Taichi Lang Jul 8, 2022
@turbo0628 turbo0628 moved this from Untriaged to In Progress in Taichi Lang Jul 8, 2022
@qiao-bo
Copy link
Contributor

qiao-bo commented Jul 11, 2022

After some investigation, this seems not related to Ndarray in the store_to_load_forwarding pass. The code block (as mentioned) does consider Ndarray, which is ExternalPtrStmt:

if (auto local_load = stmt->cast<LocalLoadStmt>()) {
bool regular = true;
auto alloca = local_load->src[0].var;
for (int l = 0; l < stmt->width(); l++) {
if (local_load->src[l].offset != l ||
local_load->src[l].var != alloca) {
regular = false;
}
}
if (regular) {
result = get_store_forwarding_data(alloca, i);
}
} else if (auto global_load = stmt->cast<GlobalLoadStmt>()) {
if (!after_lower_access) {
bool store_forwarding = true;
if (auto global_ptr = global_load->src->cast<GlobalPtrStmt>()) {
TI_ASSERT(global_ptr->width() == 1);
auto &snodes = global_ptr->snodes;
if (snodes[0]->has_adjoint()) {
// Has adjoint SNode. Skip the store forwarding
// to keep the global load chain,
// so that the grad of intermidiate variable can be computed
// by GlobalLoadStmt
store_forwarding = false;
}
}
if (store_forwarding) {
result = get_store_forwarding_data(global_load->src, i);
}
}
}

Let be Ndarray or Field, the pattern will start with a GlobalLoadStmt, hence the else-if clause will be executed, and store_forwarding remains true.

The problem here is that the get_store_forwarding_data function cannot find the forwarded data for certain scenarios:

result = get_store_forwarding_data(global_load->src, i);

For example, a dummy kernel like this can be optimized by store_to_load_forwarding:

@ti.kernel
def foo_ndarray(acc: ti.any_arr(field_dim=1)):
    for i in range(16):
        acc[i] = 1.0
        acc[i] -= i 
        acc[i] %= i 

Nevertheless, if the access happens e.g., in a loop, the optimization no longer works:

@ti.kernel
def foo_field(acc: ti.template()):
    for i in range(16):
        acc[i] = 1.0
        for j in range(8):
            acc[i] -= j 
        for k in range(4):
            acc[i] %= k 

This is regardless of Field/Ndarray. The question here becomes how to adapt the get_store_forwarding_data function to consider the second case forwardable.

@turbo0628
Copy link
Member Author

OK, so this is not a cause of ndarray vs field performance gap.

@qiao-bo
Copy link
Contributor

qiao-bo commented Jul 11, 2022

@jim19930609 @strongoier Any suggestions on this? Should we update the get_store_forwarding_data function to be able to optimize this kind of pattern as well?

@ti.kernel
def foo_field(acc: ti.template()):
    for i in range(16):
        acc[i] = 1.0
        for j in range(8):
            acc[i] -= j 
        for k in range(4):
            acc[i] %= k 

Similar to

@ti.kernel
def foo_field(acc: ti.template()):
    for i in range(16):
        var = acc[i]
        var = 1.0
        for j in range(8):
            var -= j 
        for k in range(4):
            var %= k 
        acc[i] = var

@strongoier
Copy link
Contributor

I think this is definitely some optimization opportunity, which may or may not be taken care of by the backend compilers. However, is this the solution to the current problem you guys are facing?

@qiao-bo
Copy link
Contributor

qiao-bo commented Jul 11, 2022

I think this is definitely some optimization opportunity, which may or may not be taken care of by the backend compilers. However, is this the solution to the current problem you guys are facing?

Partially yes, if this optimization is possible, Vulkan backend could benefit. (CUDA backend does not need it as the runtime opt is too powerful).

@strongoier
Copy link
Contributor

Partially yes, if this optimization is possible, Vulkan backend could benefit. (CUDA backend does not need it as the runtime opt is too powerful).

Then feel free to hack it!

@turbo0628 turbo0628 changed the title [perf] Ndarray should be considered in load store forwarding. [perf] Enable load store forwarding across loop hierarchy for vulkan performance. Jul 29, 2022
@turbo0628
Copy link
Member Author

turbo0628 commented Jul 29, 2022

Update:

A simplified version that reproduce the problem:

The original baseline version: natural to write, but slow.

@ti.kernel
def func(v : ti.types.ndarray(field_dim=1), d : ti.types.ndarray(field_dim=1), a : ti.types.ndarray(field_dim=1)):
    for i in range(N):
        acc = 0.0
        for j in range(N):
            acc += v[i] * d[j]
        a[i] = acc

The fast version that manually make LSF.

@ti.kernel
def func(v : ti.types.ndarray(field_dim=1), d : ti.types.ndarray(field_dim=1), a : ti.types.ndarray(field_dim=1)):
    for i in range(N):
        acc = 0.0
        v_val = v[i]
        for j in range(N):
            acc += v_val * d[j]
        a[i] = acc

We save the v[i] outside, as it is actually constant in this loop.

The optimization pass needs 2-steps:

  1. Analyze that the loads that has no relation with the loop variants.
  2. Move the load outside the loop

Another possible solution:
This case SHOULD be handled by L1 cache. Is that because some annotations in our SPIR-V codegen is somewhat improper? For example should the v[i] be annotated as in Private scope?

Desired performance of this code snippet on RTX3080 (The vulkan shared memory version is not yet merged though)
image

Full runnable code:

import time
import taichi as ti
import numpy as np


ti.init(arch=ti.vulkan)

block_dim = 512
nBlocks = 1024
N = nBlocks * block_dim
np.random.seed(42)
v_arr = np.random.randn(N).astype(np.float32)
d_arr = np.random.randn(N).astype(np.float32)
a_arr = np.zeros(N).astype(np.float32)
b_arr = np.zeros(N).astype(np.float32)


@ti.kernel
def init(v : ti.types.ndarray(field_dim=1), d : ti.types.ndarray(field_dim=1), a : ti.types.ndarray(field_dim=1)):
    for i in range(N):
        v[i] = ti.random(ti.f32)
        d[i] = ti.random(ti.f32)
        a[i] = 0.0

@ti.kernel
def func(v : ti.types.ndarray(field_dim=1), d : ti.types.ndarray(field_dim=1), a : ti.types.ndarray(field_dim=1)):
    for i in range(N):
        acc = 0.0
        for j in range(N):
            acc += v[i] * d[j]
        a[i] = acc

@ti.kernel
def func_lsf(v : ti.types.ndarray(field_dim=1), d : ti.types.ndarray(field_dim=1), a : ti.types.ndarray(field_dim=1)):
    for i in range(N):
        acc = 0.0
        v_val = v[i]
        for j in range(N):
            acc += v_val * d[j]
        a[i] = acc

init(v_arr, d_arr, a_arr)
func(v_arr, d_arr, a_arr)
nIter = 10
st = time.time()
for i in range(nIter):
    func(v_arr, d_arr, a_arr)
ti.sync()
et = time.time()
print(f"Baseline time {(et - st) / nIter}")
st = time.time()
for i in range(nIter):
    func_lsf(v_arr, d_arr, a_arr)
ti.sync()
et = time.time()
print(f"LSF time {(et - st) / nIter}")

@lin-hitonami
Copy link
Contributor

lin-hitonami commented Sep 19, 2022

Desired performance of this code snippet on RTX3080 (The vulkan shared memory version is not yet merged though) image

Full runnable code:

import time
import taichi as ti
import numpy as np


ti.init(arch=ti.vulkan)

block_dim = 512
nBlocks = 1024
N = nBlocks * block_dim
np.random.seed(42)
v_arr = np.random.randn(N).astype(np.float32)
d_arr = np.random.randn(N).astype(np.float32)
a_arr = np.zeros(N).astype(np.float32)
b_arr = np.zeros(N).astype(np.float32)


@ti.kernel
def init(v : ti.types.ndarray(field_dim=1), d : ti.types.ndarray(field_dim=1), a : ti.types.ndarray(field_dim=1)):
    for i in range(N):
        v[i] = ti.random(ti.f32)
        d[i] = ti.random(ti.f32)
        a[i] = 0.0

@ti.kernel
def func(v : ti.types.ndarray(field_dim=1), d : ti.types.ndarray(field_dim=1), a : ti.types.ndarray(field_dim=1)):
    for i in range(N):
        acc = 0.0
        for j in range(N):
            acc += v[i] * d[j]
        a[i] = acc

@ti.kernel
def func_lsf(v : ti.types.ndarray(field_dim=1), d : ti.types.ndarray(field_dim=1), a : ti.types.ndarray(field_dim=1)):
    for i in range(N):
        acc = 0.0
        v_val = v[i]
        for j in range(N):
            acc += v_val * d[j]
        a[i] = acc

init(v_arr, d_arr, a_arr)
func(v_arr, d_arr, a_arr)
nIter = 10
st = time.time()
for i in range(nIter):
    func(v_arr, d_arr, a_arr)
ti.sync()
et = time.time()
print(f"Baseline time {(et - st) / nIter}")
st = time.time()
for i in range(nIter):
    func_lsf(v_arr, d_arr, a_arr)
ti.sync()
et = time.time()
print(f"LSF time {(et - st) / nIter}")

I don't know if I'm testing in a wrong way but on my machine (master branch, RTX3080) baseline time is shorter than LSF time...

[Taichi] version 1.1.3, llvm 10.0.0, commit 9d27b94e, linux, python 3.9.12
[Taichi] Starting on arch=vulkan
[I 09/19/22 15:39:00.841 2370085] [vulkan_device_creator.cpp:pick_physical_device@368] Found Vulkan Device 0 (llvmpipe (LLVM 12.0.0, 256 bits))
[I 09/19/22 15:39:00.841 2370085] [vulkan_device_creator.cpp:pick_physical_device@368] Found Vulkan Device 1 (NVIDIA GeForce RTX 3080)
[I 09/19/22 15:39:00.841 2370085] [vulkan_device_creator.cpp:find_queue_families@139] Async compute queue 2, graphics queue 0
[I 09/19/22 15:39:00.841 2370085] [vulkan_device_creator.cpp:find_queue_families@139] Async compute queue 2, graphics queue 0
[I 09/19/22 15:39:00.841 2370085] [vulkan_device_creator.cpp:create_logical_device@436] Vulkan Device "NVIDIA GeForce RTX 3080" supports Vulkan 0 version 1.2.175
Baseline time 0.1443796157836914
LSF time 0.5295180082321167

ailzhang pushed a commit to ailzhang/taichi that referenced this issue Sep 22, 2022
…r is on

related: taichi-dev#5350

Thanks @turbo0628 and @lin-hintonami for catching this!
It turns out codegening non-semantic info spirv code isn't no-op when
validation is off. It slows down the program execution so we should
properly guard it.
ailzhang added a commit that referenced this issue Sep 22, 2022
…r is on (#6129)

related: #5350

Thanks @turbo0628 and @lin-hitonami for catching this! It turns out
codegening non-semantic info spirv code isn't no-op when validation is
off. It slows down the program execution so we should properly guard it.
Repository owner moved this from In Progress to Done in Taichi Lang Sep 23, 2022
lin-hitonami added a commit that referenced this issue Sep 23, 2022
Related issue = fixes #5350 

Global variables can't be store-to-load forwarded after `lower-access`
pass, so we need to do `simplify` before it. It should speed up the
program in all circumstances.

Caching loop-invariant global vars to local vars sometimes speeds up the
program yet some time lets the program run slower so I let it controlled
by the compiler config.

FPS of Yu's program on RTX3080 on Vulkan:
Original: 19fps
Simplified before lower access: 30fps
Cached loop-invariant global vars to local vars: 41fps

**This PR does things as follows:**
1. Extract a base class `LoopInvariantDetector` from
`LoopInvariantCodeMotion`. This class maintains information to detect
whether a statement is loop-invariant.
2. Let LICM move `GlobalPtrStmt`, `ArgLoadStmt` and `ExternalPtrStmt`
out of the loop so that they become loop-invariant.
3. Add `CacheLoopInvariantGlobalVars` to move out loop-invariant global
variables that are loop-unique in the offloaded task.
4. Add pass `cache_loop_invariant_global_vars` after `demote_atomics`
before `demote_dense_struct_fors` (because loop-uniqueness can't be
correctly detected after `demote_dense_struct_fors`) and add a compiler
config flag to control it.
5. Add pass `full_simplify` before `lower_access` to enable
store-to-load forwarding for GlobalPtrs.

<!--
Thank you for your contribution!

If it is your first time contributing to Taichi, please read our
Contributor Guidelines:
  https://docs.taichi-lang.org/docs/contributor_guide

- Please always prepend your PR title with tags such as [CUDA], [Lang],
[Doc], [Example]. For a complete list of valid PR tags, please check out
https://github.com/taichi-dev/taichi/blob/master/misc/prtags.json.
- Use upper-case tags (e.g., [Metal]) for PRs that change public APIs.
Otherwise, please use lower-case tags (e.g., [metal]).
- More details:
https://docs.taichi-lang.org/docs/contributor_guide#pr-title-format-and-tags

- Please fill in the issue number that this PR relates to.
- If your PR fixes the issue **completely**, use the `close` or `fixes`
prefix so that GitHub automatically closes the issue when the PR is
merged. For example,
    Related issue = close #2345
- If the PR does not belong to any existing issue, free to leave it
blank.
-->

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Suggest an idea on this project
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants