Parallelized for-loop performance same as serial one #4541

houkensjtu · 2022-03-15T12:35:16Z

I tried to compare the computational performance of a Taichi kernel with the following code:

import taichi as ti
import time

ti.init(arch=ti.gpu, default_fp=ti.f32)

n = 4096
v1 = ti.field(dtype=float, shape = n)
v2 = ti.field(dtype=float, shape = n)

@ti.kernel
def init():
    for i in range(n):
        v1[i] = 1.0
        v2[i] = 2.0

@ti.kernel
def reduce_para()->ti.f32:
    n = v1.shape[0]
    sum = 0.0
    for i in range(n):
        sum += v1[i]*v2[i]
    return sum

@ti.kernel
def reduce_seri()->ti.f32:
    n = v1.shape[0]
    sum = 0.0    
    for _ in range(1):
        for i in range(n):
            sum += v1[i]*v2[i]
    return sum

print('Initializing...')
init()

print('Reducing in Taichi scope with a parallel kernel...')
start = time.time()
print(reduce_para())
print(time.time() - start)

print('Reducing in Taichi scope with serial kernel...')
start = time.time()
print(reduce_seri())
print(time.time() - start)

Surprisingly, the parallelized version yielded almost the same performance as the serial one. (0.02 sec on my M1 2020 MacBook Pro for ti.cpu)
I'm not sure if it's

The nature of this computation, or
My coding problem, or
Taichi language's optimization problem

that caused such behavior. Any suggestions or comments are welcomed.

qiao-bo · 2022-03-16T09:13:23Z

Share a few findings here, also for future visitors:

The measurement you posted includes the compilation time, since it is the first time kernel being executed. If the data size is small, the reported time does not reflects the actual computation time. The recommended way is to skip the first time execution and take the average of some repeated runs. For example:

num_runs = 1000
print('Initializing...')
init()

reduce_para() # skip compilation
print('Reducing in Taichi scope with a parallel kernel...')
start = perf_counter()
for _ in range(num_runs):
    reduce_para()
#ti.sync() # for cuda executions
print((perf_counter() - start) / num_runs, "s")

reduce_seri() # skip compilation
print('Reducing in Taichi scope with serial kernel...')
start = perf_counter()
for _ in range(num_runs):
    reduce_seri()
#ti.sync() # for cuda executions
print((perf_counter() - start) / num_runs, "s")

For arch=ti.gpu, (We discuss CPU later, it's a slightly different story). Locally running the above code gives me:

[Taichi] version 0.9.1, llvm 10.0.0, commit e2e0e669, linux, python 3.9.7
[Taichi] Starting on arch=cuda
Initializing...
Reducing in Taichi scope with a parallel kernel...
0.02098149503581226 ms
Reducing in Taichi scope with serial kernel...
0.1074029229930602 ms

(Nvidia RTX3080 + Driver 470)
This is kind of expected as the parallel version is faster than the serial one. If you increase the data size, the performance gap will increase. So far so good.

When using arch=ti.cpu, the above code gives (locally env: i9-11900k):

[Taichi] version 0.9.1, llvm 10.0.0, commit e2e0e669, linux, python 3.9.7
[Taichi] Starting on arch=x64
Initializing...
Reducing in Taichi scope with a parallel kernel...
0.029548327962402254 ms
Reducing in Taichi scope with serial kernel...
0.017637054028455168 ms

In this case, the parallel version becomes slower than the serial kernel. This is consistent with your observations on the mac. The main reason is the block_dim selection for CPU backend. If you print out the ir by print_ir=True, you will find out the value for block_dim is set to 32. This is probably too low since your data of size 4096 is divided into 128 blocks, which will be handled by your CPU threads. For small kernels light in computation, you probably want threads doing many iterations of work within itself instead of busy with scheduling and switching with overhead. We can verify this by playing with the block_dim in the parallel kernel. For example, increase the size to 1024:

@ti.kernel
def reduce_para()->ti.f32:
    n = v1.shape[0]
    sum = 0.0
    ti.block_dim(1024)
    for i in range(n):
        sum += v1[i]*v2[i]
    return sum

This gives much closer performance:

[Taichi] version 0.9.1, llvm 10.0.0, commit e2e0e669, linux, python 3.9.7
[Taichi] Starting on arch=x64
Initializing...
Reducing in Taichi scope with a parallel kernel...
0.018682260008063167 ms
Reducing in Taichi scope with serial kernel...
0.01732683094451204 ms

Nevertheless, the parallel kernel is still slower than the serial one. Now we have to look from another dimension: Data size. To best benefit from the parallel approach, we need the data to be large enough to 'saturate' the processor core. If we increase the previous n from n=4096 to n=409600. Rerun the test gives us:

Reducing in Taichi scope with a parallel kernel...
0.10280601400882006 ms
Reducing in Taichi scope with serial kernel...
0.49146420997567475 ms

Now, the parallel kernel is faster ;)

A side note here: A good block_dim is tricky to set. On one hand, you want to increase it to avoid the problem above. On the other hand, you want to reduce it to have 'enough' blocks to execute. Think about a scenario where kernel execution greatly fluctuates and you do want multiple waves of blocks available to minimize this tail effect.

We will think about a good heuristic model to better set this dimension value for CPUs and GPUs, it's on our roadmap @turbo0628 ;)

Regarding to block_dim, there are some previous discussions related:
Why from_numpy is faster than kernel initialization
Enhance the default scheduling mechanism of the CPU backend

houkensjtu added the question Question on using Taichi label Mar 15, 2022

qiao-bo closed this as completed Mar 29, 2022

qiao-bo mentioned this issue Jun 16, 2022

[refactor] Enable adaptive block_dim selection for CPU backend #5190

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelized for-loop performance same as serial one #4541

Parallelized for-loop performance same as serial one #4541

houkensjtu commented Mar 15, 2022

qiao-bo commented Mar 16, 2022 •

edited

Loading

Parallelized for-loop performance same as serial one #4541

Parallelized for-loop performance same as serial one #4541

Comments

houkensjtu commented Mar 15, 2022

qiao-bo commented Mar 16, 2022 • edited Loading

qiao-bo commented Mar 16, 2022 •

edited

Loading