Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance the default scheduling mechanism of the CPU backend #3750

Open
strongoier opened this issue Dec 8, 2021 · 3 comments
Open

Enhance the default scheduling mechanism of the CPU backend #3750

strongoier opened this issue Dec 8, 2021 · 3 comments
Labels
advanced optimization The issue or bug is related to advanced optimization discussion Welcome discussion! feature request Suggest an idea on this project welcome contribution

Comments

@strongoier
Copy link
Contributor

Concisely describe the proposed feature
Currently, the scheduling mechanism of the CPU backend is similar to schedule(dynamic, chunk) in OpenMP where chunk is set to block_dim in Taichi. Although users can manually specify block_dim to get desired behavior, newcomers tend to rely on the default behavior of Taichi.

Now block_dim has default value 32 in the CPU backend. However, just as exposed in #3734, it is not always a good choice towards performance. To avoid misleading users about the performance of Taichi, we hope to enhance the default behavior - adaptively determine block_dim as a heuristic function of the number of threads, the number of loop iterations, as well as the estimated workload of a single iteration.

Discussions and contributions are welcome!

@strongoier strongoier added feature request Suggest an idea on this project welcome contribution discussion Welcome discussion! advanced optimization The issue or bug is related to advanced optimization labels Dec 8, 2021
@turbo0628
Copy link
Member

I think this block_dim corresponds to vectorization for SIMD units. 32 is reasonable for AVX-512 but redundant and inflexible for a large amount of processors.

There are other things that are important for CPU perf, especially cache blocking and register planning, maybe an additional layer of IR can help better CPU scheduling. Do you have any thoughts?

@k-ye
Copy link
Member

k-ye commented Dec 29, 2021

There is currently no CPU vectorization on Taichi, and we haven't explored the scheduling space much yet. I think restoring the CPU vectorization could be the more economical approach first, WDYT?

@turbo0628
Copy link
Member

I think restoring the CPU vectorization could be the more economical approach first, WDYT?

Agree. Vectorization is the most beneficial technique and we can easily improve some CPU-specific use cases with the LLVM backend. Schedulers should definitely be placed in the second phase.

Is there any proper case/perf issues that we can work on?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
advanced optimization The issue or bug is related to advanced optimization discussion Welcome discussion! feature request Suggest an idea on this project welcome contribution
Projects
None yet
Development

No branches or pull requests

3 participants