-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelized for-loop performance same as serial one #4541
Comments
Share a few findings here, also for future visitors:
(Nvidia RTX3080 + Driver 470)
In this case, the parallel version becomes slower than the serial kernel. This is consistent with your observations on the mac. The main reason is the
This gives much closer performance:
Nevertheless, the parallel kernel is still slower than the serial one. Now we have to look from another dimension: Data size. To best benefit from the parallel approach, we need the data to be large enough to 'saturate' the processor core. If we increase the previous
Now, the parallel kernel is faster ;) A side note here: A good We will think about a good heuristic model to better set this dimension value for CPUs and GPUs, it's on our roadmap @turbo0628 ;) Regarding to |
I tried to compare the computational performance of a Taichi kernel with the following code:
Surprisingly, the parallelized version yielded almost the same performance as the serial one. (0.02 sec on my M1 2020 MacBook Pro for ti.cpu)
I'm not sure if it's
that caused such behavior. Any suggestions or comments are welcomed.
The text was updated successfully, but these errors were encountered: