Enhance the default scheduling mechanism of the CPU backend #3750

strongoier · 2021-12-08T11:31:58Z

Concisely describe the proposed feature
Currently, the scheduling mechanism of the CPU backend is similar to schedule(dynamic, chunk) in OpenMP where chunk is set to block_dim in Taichi. Although users can manually specify block_dim to get desired behavior, newcomers tend to rely on the default behavior of Taichi.

Now block_dim has default value 32 in the CPU backend. However, just as exposed in #3734, it is not always a good choice towards performance. To avoid misleading users about the performance of Taichi, we hope to enhance the default behavior - adaptively determine block_dim as a heuristic function of the number of threads, the number of loop iterations, as well as the estimated workload of a single iteration.

Discussions and contributions are welcome!

The text was updated successfully, but these errors were encountered:

turbo0628 · 2021-12-29T04:34:15Z

I think this block_dim corresponds to vectorization for SIMD units. 32 is reasonable for AVX-512 but redundant and inflexible for a large amount of processors.

There are other things that are important for CPU perf, especially cache blocking and register planning, maybe an additional layer of IR can help better CPU scheduling. Do you have any thoughts?

k-ye · 2021-12-29T06:55:08Z

There is currently no CPU vectorization on Taichi, and we haven't explored the scheduling space much yet. I think restoring the CPU vectorization could be the more economical approach first, WDYT?

turbo0628 · 2021-12-29T07:31:40Z

I think restoring the CPU vectorization could be the more economical approach first, WDYT?

Agree. Vectorization is the most beneficial technique and we can easily improve some CPU-specific use cases with the LLVM backend. Schedulers should definitely be placed in the second phase.

Is there any proper case/perf issues that we can work on?

strongoier added feature request Suggest an idea on this project welcome contribution discussion Welcome discussion! advanced optimization The issue or bug is related to advanced optimization labels Dec 8, 2021

qiao-bo mentioned this issue Mar 16, 2022

Parallelized for-loop performance same as serial one #4541

Closed

qiao-bo mentioned this issue Jun 16, 2022

[refactor] Enable adaptive block_dim selection for CPU backend #5190

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance the default scheduling mechanism of the CPU backend #3750

Enhance the default scheduling mechanism of the CPU backend #3750

strongoier commented Dec 8, 2021

turbo0628 commented Dec 29, 2021

k-ye commented Dec 29, 2021

turbo0628 commented Dec 29, 2021

Enhance the default scheduling mechanism of the CPU backend #3750

Enhance the default scheduling mechanism of the CPU backend #3750

Comments

strongoier commented Dec 8, 2021

turbo0628 commented Dec 29, 2021

k-ye commented Dec 29, 2021

turbo0628 commented Dec 29, 2021