Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: mark a Func as no_profiling, to prevent injection of profiling. #8136

Closed
wants to merge 2 commits into from

Conversation

mcourteaux
Copy link
Contributor

This PR adds a small change that allows you to .no_profiling() on a Func to prevent the corresponding ProducerConsumer node to not be instrumented with profiling. Therefore all the collected samples are attributed to the enclosing ProducerConsumer. I wanted to give this a try to see what the generated code looks like.

I was thinking this might be useful to get rid of this overhead manually when you are working on pipelines that have a Func without any loops sitting in the innermost loop nest level, e.g.: the Func gets compiled to only 1 to 5 instructions.

Feedback and opinions welcome!


My use case:

This code full of profiling injections (with an unrolled loop around it):
image

Became this:
image

Meanwhile, the profile report went from:

neonraw_highlight_ratio_extractor
 total time: 1389.949829 ms  samples: 1309  runs: 48  time/run: 28.957289 ms
 average threads used: 2.828113
 heap allocations: 0  peak heap usage: 0 bytes
    halide_malloc:                0.000ms ( 0.0%)   threads: 0.000 
    halide_free:                  0.000ms ( 0.0%)   threads: 0.000 
    ratio_map:                   20.969ms (72.4%)   threads: 2.685 
    repeat_edge:                  3.639ms (12.5%)   threads: 3.219  stack: 2040
    max_field_hpass:              0.617ms ( 2.1%)   threads: 3.678  stack: 1920
    max_field:                    0.263ms ( 0.9%)   threads: 2.000  stack: 1536
    min_field_hpass:              0.462ms ( 1.5%)   threads: 2.333  stack: 640
    max_channel:                  0.088ms ( 0.3%)   threads: 1.500  stack: 68
    min_field:                    0.044ms ( 0.1%)   threads: 1.500  stack: 512
    b:                            0.352ms ( 1.2%)   threads: 1.750  stack: 32
    g:                            0.154ms ( 0.5%)   threads: 1.571  stack: 32
    r:                            0.110ms ( 0.3%)   threads: 4.000  stack: 32
    lum:                          0.619ms ( 2.1%)   threads: 4.222  stack: 32
    lum_pow3:                     0.044ms ( 0.1%)   threads: 1.500  stack: 32
    lum4:                         1.547ms ( 5.3%)   threads: 3.542  stack: 32
    w_0:                          0.000ms ( 0.0%)   threads: 0.000  stack: 32
    w_1:                          0.022ms ( 0.0%)   threads: 14.000 stack: 32
    w_2:                          0.022ms ( 0.0%)   threads: 2.000  stack: 32

to:

neonraw_highlight_ratio_extractor
 total time: 1376.182739 ms  samples: 1295  runs: 48  time/run: 28.670473 ms
 average threads used: 2.325097
 heap allocations: 0  peak heap usage: 0 bytes
    halide_malloc:                0.000ms ( 0.0%)   threads: 0.000 
    halide_free:                  0.000ms ( 0.0%)   threads: 0.000 
    ratio_map:                   24.013ms (83.7%)   threads: 2.245 
    repeat_edge:                  3.434ms (11.9%)   threads: 2.647  stack: 2040
    max_field_hpass:              0.515ms ( 1.7%)   threads: 3.130  stack: 1920
    max_field:                    0.176ms ( 0.6%)   threads: 3.500  stack: 1536
    min_field_hpass:              0.486ms ( 1.6%)   threads: 2.181  stack: 640
    max_channel:                  0.000ms ( 0.0%)   threads: 0.000  stack: 68
    min_field:                    0.044ms ( 0.1%)   threads: 8.500  stack: 512
    b:                            0.000ms ( 0.0%)   threads: 0.000  stack: 32
    g:                            0.000ms ( 0.0%)   threads: 0.000  stack: 32
    r:                            0.000ms ( 0.0%)   threads: 0.000  stack: 32
    lum:                          0.000ms ( 0.0%)   threads: 0.000  stack: 32
    lum_pow3:                     0.000ms ( 0.0%)   threads: 0.000  stack: 32
    lum4:                         0.000ms ( 0.0%)   threads: 0.000  stack: 32
    w_0:                          0.000ms ( 0.0%)   threads: 0.000  stack: 32
    w_1:                          0.000ms ( 0.0%)   threads: 0.000  stack: 32
    w_2:                          0.000ms ( 0.0%)   threads: 0.000  stack: 32

@abadams abadams added the dev_meeting Topic to be discussed at the next dev meeting label Mar 5, 2024
@mcourteaux mcourteaux marked this pull request as draft March 6, 2024 10:45
@mcourteaux
Copy link
Contributor Author

I still run in a situation where the information from calling Func::no_profiling() is not fully propagated to the ProducerConsumer node. I don't see where it goes wrong. It works for quite some of the Funcs I call this on, but not all.

image

In the screenshot above, w_0, w_1, and w_2 should not have the injection be done on the produce.

@steven-johnson steven-johnson requested a review from abadams March 6, 2024 17:18
@abadams
Copy link
Member

abadams commented Mar 6, 2024

The idea seems useful, but I don't think this should be a flag on ProducerConsumer nodes for such a niche purpose. The nice thing about ProducerConsumer nodes is that they give you the name of the Func being produced/consumed. We never make ones that don't correspond directly to a Func. So I think a better way to implement it is to look up the Func in the environment passed around during lowering and read the flag from its schedule.

@mcourteaux
Copy link
Contributor Author

Replaced by #8143. Thanks for the feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dev_meeting Topic to be discussed at the next dev meeting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants