Halide and Adams 2019 autoscheduler performance drastically decreases with environment variable KMP_AFFINITY set to granularity=fine,scatter #8538

ivangarcia44 · 2024-12-23T21:59:47Z

We are comparing the performance of Halide with Adams 2019 on various sizes of matrix multiplication against another technology.

As part of that comparison we set the following two environment variables:

export KMP_AFFINITY=granularity=fine,scatter
export OMP_NUM_THREADS=6

The runtime performance of Halide drops by around 5x when KMP_AFFINITY is set as above, compared as being empty. The OMP_NUM_THREADS environment variable does not affect much. The other technology runtime performance is not affected much by these two environment variables.

Is it known why the KMP_AFFINITY setting above is affecting Halide runtime performance? What would the recommended setting for this would be? Please let me know if you have a link with the recommended environment variable settings for having the best performance for Halide and Adams 2019.

My machine is an AMD EPYC 74F3 24-Core Processor x86_64 with 10 CPU's.

Thanks,
Ivan

abadams · 2024-12-23T23:15:26Z

Are you using a custom thread pool? Or are you reusing your openmp threads for Halide's threads somehow? As far as I can tell, KMP_AFFINITY should only affect code using openmp. Maybe all of Halide's threads are getting pinned to the same core as the main thread. I advise doing your Halide tests in a separate process without KMP_AFFINITY set.

But matrix multiplication is really not a good use case for Adams 2019. You can write down a good schedule for a matrix multiply directly, but it's somewhat fiddly (see test/performance/matrix_multiplication.cpp). Adams 2019 is designed for imaging pipelines, and would have to get extraordinarily lucky to find that matrix multiply schedule. It won't even attempt the rfactor, so any split-k schedules are out, and if you don't add the wrapper Func yourself, it's going to be forced to do a whole separate pass just to zero-initialize the output. It also doesn't use Func::in() so there can't be any staging of inputs, which is sometimes helpful. Scheduling a matrix multiply is unlike scheduling most other code (e.g. register pressure is the key concern for the inner loop, tiled storage actually makes sense for the memory hierarchy, etc).

If I were autoscheduling CPU matrix multiplies in Halide I'd just use the schedule from that test and add autotuning over the split factors (mostly tile_y and tile_k).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Halide and Adams 2019 autoscheduler performance drastically decreases with environment variable KMP_AFFINITY set to granularity=fine,scatter #8538

Halide and Adams 2019 autoscheduler performance drastically decreases with environment variable KMP_AFFINITY set to granularity=fine,scatter #8538

ivangarcia44 commented Dec 23, 2024

abadams commented Dec 23, 2024

Halide and Adams 2019 autoscheduler performance drastically decreases with environment variable KMP_AFFINITY set to granularity=fine,scatter #8538

Halide and Adams 2019 autoscheduler performance drastically decreases with environment variable KMP_AFFINITY set to granularity=fine,scatter #8538

Comments

ivangarcia44 commented Dec 23, 2024

abadams commented Dec 23, 2024