Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why from_numpy is faster than kernel initialization #3734

Closed
FantasyVR opened this issue Dec 7, 2021 · 1 comment
Closed

Why from_numpy is faster than kernel initialization #3734

FantasyVR opened this issue Dec 7, 2021 · 1 comment
Labels
question Question on using Taichi

Comments

@FantasyVR
Copy link
Collaborator

FantasyVR commented Dec 7, 2021

import taichi as ti 
import numpy as np
import time
ti.init(arch=ti.cpu)
n = 1000000000
v_ti = ti.field(ti.f32,shape=n)
v_tiFnp = ti.field(ti.f32, shape=n)

@ti.kernel
def init():
    for i in range(n):
        v_ti[i] = i 

# Test from_numpy performance
start = time.time()
v_np = np.arange(n,dtype=np.float32)
v_tiFnp.from_numpy(v_np)
ti.sync()
end = time.time()
print(f">> Time using from_numpy: {end-start}")

# Test kernel initialization performance
start = time.time()
init()
ti.sync()
end = time.time()
print(f">> Time using init kernel: {end-start}")
  • The results show that
>> Time using from_numpy: 3.888772964477539
>> Time using init kernel: 6.807356357574463
@FantasyVR FantasyVR added the question Question on using Taichi label Dec 7, 2021
@strongoier
Copy link
Contributor

After printing out the CHI IRs, I find that the block_dim set for from_numpy() is 128, while the block_dim set for your init() kernel is 32. As you are using the CPU backend, block_dim is the number of iterations handled in a task, and all the tasks will be scheduled to the threads you set. Assume we are using the default setting, where the number of threads equals the number of processors you have.

Let's first see why block_dim matters. In the case of your init() kernel, each iteration has equal running time. Therefore, as long as the total number of iterations executed by a single processor is roughly the same, reducing the number of tasks (by increasing block_dim) will reduce the total time because the overhead of scheduling is reduced. On my local machine, change your kernel into:

@ti.kernel
def init():
    ti.block_dim(1 << 10) 
    for i in range(n):
        v_ti[i] = i 

will have a 25x speedup.

Now let's try to understand why different block_dims are set in your program. This is because from_numpy() is internally implemented with a kernel using a struct for, whose block_dim is default to 128, while your init() kernel uses a range for, whose block_dim is default to 32.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Question on using Taichi
Projects
None yet
Development

No branches or pull requests

2 participants