Improve code size and compile time for local laplacian app #7927

abadams · 2023-10-31T19:05:20Z

This reduces compile time for the manual local laplacian schedule from 4.9s to 2.2s, and reduces code size from 126k to 82k

It's all from avoiding loop partitioning. Most of the reduction comes from avoiding a pointless boundary condition in the output Func, which triggered partitioning. A smaller amount comes from using RoundUp and Partition::Never. The Partition::Never calls are responsible for a 3% reduction in code size and compile times by themselves.

This has basically no effect on runtime. It seems to reduce it very slightly, but it's in the noise.

There's also a drive-by fix to Generator.h, exporting the Partition enum class.

This reduces compile time for the manual local laplacian schedule from 4.9s to 2.2s, and reduces code size from 126k to 82k Most of the reduction comes from avoiding a pointless boundary condition in the output Func. A smaller amount comes from avoiding loop partitioning using RoundUp and Partition::Never. The Partition::Never calls are responsible for a 3% reduction in code size and compile times by themselves. This has basically no effect on runtime. It seems to reduce it very slightly, but it's in the noise.

abadams · 2023-10-31T19:49:25Z

Updated to also drop partitioning in y on the padded input, which saves another 2k code size.

vksnk · 2023-10-31T20:10:02Z

apps/local_laplacian/local_laplacian_generator.cpp

@@ -81,10 +81,10 @@ class LocalLaplacian : public Halide::Generator<LocalLaplacian> {
        // Reintroduce color (Connelly: use eps to avoid scaling up noise w/ apollo3.png input)
        Func color;
        float eps = 0.01f;
-        color(x, y, c) = outGPyramid[0](x, y) * (floating(x, y, c) + eps) / (gray(x, y) + eps);
+        color(x, y, c) = input(x, y, c) * (outGPyramid[0](x, y) + eps) / (gray(x, y) + eps);


This is not a bit-exact change? I guess it's very close and doesn't really matter (especially if it helps to avoid the boundary condition), just wanted to double-check.

Yes, this is not bit exact, but it made more sense to me with the change to the scaling of this term. Before it took the ratio of the input color channel to the input grayscale image, and applied that ratio to the output grayscale. Now it computes the ratio of the output grayscale to the input grayscale, and applies that as a scaling factor to the input.

(The only difference is which term in the numerator gets a +eps)

abadams · 2023-10-31T20:34:19Z

Also updated to include a cuda schedule. Some thoughts come to mind from using this:

Now that we have Never/Auto/Always, maybe Auto should default to off instead of on inside shader code.
One mistake I made was applying .partition after a split, and having Halide partition the other child Var instead. Sometimes this other child var is anonymous (e.g. in .parallel(y, 32)). It's possibly better practice to apply .partition before splits so that child vars just inherit.
Two pieces of sugar would be nice: never_partition(Var), but also never_partition(), which would apply to all Vars for that Func. For most of these cases I just knew that I didn't want any kind of loop partitioning for a particular Func.

avx512 schedule reduces code size from 95k to 60k cuda schedule reduces code size from 261k to 194k No impact on performance

abadams · 2023-11-01T17:40:41Z

Updated to also affect the interpolate app. Reduces code size by 30% on CPU, and 20% on GPU. No impact on performance. Once I again I just turned it off for every Func in the cuda schedule.

@mcourteaux have you found a case where you actually want automatic loop partitioning inside a cuda kernel? Now that it's possible to turn on manually, I'm really wondering if it should be off by default in shaders.

mcourteaux · 2023-11-01T19:25:43Z

I can't remember by heart if there were. I'll check it out tomorrow. But in general, I think it is a sane strategy to not do loop partitioning in shaders and kernels. That was at least my intuition.

mcourteaux · 2023-11-01T22:30:30Z

I guess that for very large convolutions, loop Partitioning the reduction loop might be beneficial.

An RDom over 11*11 domain for example. The amount of clamping overhead might become significant here, and it might be beneficial to get rid of that in the steady state. The subltety here is that it might be cool to loop parition based on the gpu block size. As such you still avoid the branch divergence. Ie: all threads in every block of the outer ring of blocks run the clamping code. All other threads run the optimized code.

Althought I think this is another type of Partitioning than what currently happens. You'd want an:

if (on-outer-ring) { for with clamping } else { for without clamping }

Instead of the prologue, steady state, and epilogue loops.

mcourteaux · 2023-11-02T21:07:48Z

Thinking about this, I think Always is kind of more AutoWithExpectedEffect. Currently, what Auto and Always do, is actually what could be the enum value PrologueSteadystateEpilogue or EpiloguePrologue or whatever. The loop paritioning strategy I mentioned could be RoundToGPUBlock or something. I start seeing potential for more than simply Always.

I think we are early enough to still tweak the API, as probably nobody besides the two of us have started using this.

abadams · 2023-11-02T21:45:00Z

For serial loops we generate (prologue, steady-state, epilogue), but for parallel loops like the gpu blocks loop we currently generate:

if (not in prologue or epilogue) {
  version without clamping
} else {
  version with clamping
}

This happens in x and y, so if in a shader you might see something like:

if (not at top or bottom) {
  if (not at left or right) {
    version without clamping 
  } else {
    version with clamping in x only
  } 
} else {
    version with clamping in x and y
}

Here's a real example from a repeat edge boundary condition in cuda:

    gpu_block<CUDA> (repeat_edge$2.s0.v15.v17.__block_id_y, 0, t307) {
   gpu_block<CUDA> (repeat_edge$2.s0.v14.v16.__block_id_x, 0, t308) {
    gpu_thread<CUDA> (.__thread_id_y, 0, 8) {
     gpu_thread<CUDA> (.__thread_id_x, 0, 8) {
      if ((repeat_edge$2.s0.v15.v17.__block_id_y.prologue <= repeat_edge$2.s0.v15.v17.__block_id_y) && (repeat_edge$2.s0.v15.v17.__block_id_y < repeat_edge$2.s0.v15.v17.__block_id_y.epilogue)) {
       let repeat_edge$2.s0.v15.v19.base.s = min(repeat_edge$2.s0.v15.v17.__block_id_y*8, repeat_edge$2.extent.1 + -8)
       if ((t225 <= repeat_edge$2.s0.v14.v16.__block_id_x) && (repeat_edge$2.s0.v14.v16.__block_id_x < t226)) {
        let t242 = (repeat_edge$2.s0.v14.v16.__block_id_x*8) + t309
        let t300 = (repeat_edge$2.min.1 + repeat_edge$2.s0.v15.v19.base.s) + .__thread_id_y
        repeat_edge$2[((repeat_edge$2.stride.1*t300) + ((repeat_edge$2.s0.v14.v16.__block_id_x*8) - (repeat_edge$2.min.1*repeat_edge$2.stride.1))) + .__thread_id_x] = p0[((((t300 + 1)*p0.stride.1) + t242) + .__thread_id_x) + 1] + (p0[((((t300 + -1)*p0.stride.1) + t242) + .__thread_id_x) + -1] + p0[((p0.stride.1*t300) + t242) + .__thread_id_x])
       } else {
        let repeat_edge$2.s0.v14.v18.base.s = min(repeat_edge$2.s0.v14.v16.__block_id_x*8, repeat_edge$2.extent.0 + -8)
        let t301 = (repeat_edge$2.s0.v15.v19.base.s + t311) + .__thread_id_y
        let t302 = (repeat_edge$2.min.0 + repeat_edge$2.s0.v14.v18.base.s) + .__thread_id_x
        repeat_edge$2[((((repeat_edge$2.min.1 + repeat_edge$2.s0.v15.v19.base.s) + .__thread_id_y)*repeat_edge$2.stride.1) + (repeat_edge$2.s0.v14.v18.base.s - (repeat_edge$2.min.1*repeat_edge$2.stride.1))) + .__thread_id_x] = p0[max(min(t302 + 2, t312) + -1, p0.min.0) + (((t301 + 1)*p0.stride.1) - p0.min.0)] + (p0[max(min(t302, t312) + -1, p0.min.0) + (((t301 + -1)*p0.stride.1) - p0.min.0)] + p0[max(min(t312 + -1, t302), p0.min.0) + ((p0.stride.1*t301) - p0.min.0)])
       }
      } else {
       let repeat_edge$2.s0.v15.v19.base.s = min(repeat_edge$2.s0.v15.v17.__block_id_y*8, repeat_edge$2.extent.1 + -8)
       let repeat_edge$2.s0.v14.v18.base.s = min(repeat_edge$2.s0.v14.v16.__block_id_x*8, repeat_edge$2.extent.0 + -8)
       let t304 = (repeat_edge$2.s0.v15.v19.base.s + t311) + .__thread_id_y
       let t305 = (repeat_edge$2.min.0 + repeat_edge$2.s0.v14.v18.base.s) + .__thread_id_x
       repeat_edge$2[((((repeat_edge$2.min.1 + repeat_edge$2.s0.v15.v19.base.s) + .__thread_id_y)*repeat_edge$2.stride.1) + (repeat_edge$2.s0.v14.v18.base.s - (repeat_edge$2.min.1*repeat_edge$2.stride.1))) + .__thread_id_x] = p0[max(min(t305 + 2, t312) + -1, p0.min.0) + (((max(min(t304 + 2, p0.extent.1), 1) + -1)*p0.stride.1) - p0.min.0)] + (p0[max(min(t305, t312) + -1, p0.min.0) + (((max(min(p0.extent.1, t304), 1) + -1)*p0.stride.1) - p0.min.0)] + p0[max(min(t312 + -1, t305), p0.min.0) + ((max(min(p0.extent.1 + -1, t304), 0)*p0.stride.1) - p0.min.0)])
      }
     }
    }
   }
  }

mcourteaux · 2023-11-03T17:51:59Z

but for parallel loops like the gpu blocks loop we currently generate:
if (not in prologue or epilogue) {
  version without clamping
} else {
  version with clamping
}

I don't really know what your pipeline/schedule was here, but I don't really understand what caused the the if-logic to only check against the block_id, and not the thread_id.

In case that was coincidence and not generally the case, then, in general, the if-condition on GPU architectures with thread-blocks could make us of alternative partitioning strategy I briefly hinted at in my last comment. All threads in a warp are going to be doing the same instructions. If the control-flow logic introduced by loop partitioning causes the warp to enter both branches, we will pay the cost of both branches for that warp. A way to remedy this branch divergence within the warp is to make sure that all warps in the block do follow the same control flow. You could probably achieve this relatively easy with something along the lines of:

if (in-prologue-or-epilogue-rounded-conservatively-towards-gpu-block-boundary) {
    variant with clamping;
} else {
    variant without clamping;
}

abadams · 2023-11-03T18:10:58Z

It is generally the case. It checked against the block id only because automatic loop partitioning partitions the outermost possible loop that will remove the boundary condition, and the block loop is outside the thread loop.

abadams · 2023-11-10T21:27:50Z

Sugar proposed in dev_meeting:

never_partition(variadic list of vars...)
never_partition_all()

y wasn't being partitioned, but this more clearly says "I'm optimizing for code size"

mcourteaux · 2023-11-11T18:57:34Z

Super cool to see that there is some real use already for this scheduling directive! 😄 It's suiting to see code size reductions for same performance.

mcourteaux · 2023-11-14T13:51:01Z

Seeing these code size improvements, I wonder what the impact would be on your amazing work on the median filters, where compile times and code size started skyrocketing. I don't have access to the paper, so I can't eyeball the pipeline, and so I don't even know if Halide was doing partitioning.

mcourteaux · 2023-11-14T00:03:42Z

apps/local_laplacian/local_laplacian_generator.cpp

@@ -102,21 +102,36 @@ class LocalLaplacian : public Halide::Generator<LocalLaplacian> {
            // Nothing.
        } else if (get_target().has_gpu_feature()) {
            // GPU schedule.
-            // 3.19ms on an RTX 2060.
+            // 2.9ms on an RTX 2060.


Interesting. If loop partitioning did not impact performance, than what did in this PR? Newer LLVM? Newer CUDA driver? Or the non-bit-exact change from above?

abadams · 2023-11-14T21:35:48Z

In another branch, I found myself wanting to partition an inner loop in the tail case of an outer loop, because that tail case could be quite large, and it seemed reasonable to me that Always should do this. I added sugar for always too, for that reason.

mcourteaux · 2023-11-14T21:56:14Z

I found myself wanting to partition an inner loop in the tail case of an outer loop

How do I interpret this? Like:

for (int xo = 0; xo < 10; ++xo) { // outer
   for (int x = 0; x < 32; ++x) { // inner
       // partition this?
   }
}
// tail of outer?
for (int x = 0; x < 20; ++x) { // inner loop in tail of outer?
    // partition this as well?
}

I must say I have not a good understanding of the tail strategies in Halide. Tail strategies are for when a split loop does not factor perfectly in a number of iterations for the outer loop, and a number of iterations for the inner loop. A tail gets inserted after the outer loop that does the modulo number of remaining iterations. However, partitioning is because of potential body simplification in case of BoundaryConditions, which happens all around the domain of the input. I don't understand how a tail and loop partitioning interact.

Regarding the modified test that now expects 9 zones, how do I could those?

for outer.epilogue { for inner {} }
for outer.ss {
  for inner.prologue {}
  for inner.ss {}
  for inner.epilogue {}
}
// tail
for inner.prologue {}
for inner.ss {}
for inner.epilogue {}

This way I only count 7, and I really don't understand. I think a better description of what is happening and expected in that test would be nice.

abadams · 2023-11-14T23:31:04Z

Loop tails in Halide are implemented using loop partitioning rather than by adding an additional scalar loop at the end. For GuardWithIf, for example, we just generate a single loop next that looks like this:

for xo:
  for xi:
    let x = xo * 8 + xi
    if (likely(x < width)) f(x) = ...

and then rely on loop partitioning to separate the tail case from the steady-state. In some situations this will simplify down to a scalar loop that does width % 8 iterations, but in others you'll get different stuff. E.g. if xi is vectorized and you're on an architecture that has predicated store instructions it'll just do a single predicated vector computation. In general scalar loop tails are bad for performance, because they're not vectorized, and they can be large on architectures with wide vectors, so we try to avoid them.

If you have a tail strategy and a boundary condition, then the prologue handles before the start of the boundary, the steady state handles the interior of the image up to some multiple of the split factor, and the epilogue handles past the end of the boundary and the loop tail case together.

The 9 zones in the test are these zones:

top left	top middle	top right
midde left	middle	middle right
bottom left	bottom middle	bottom right

With Partition::Auto, we don't partition the prologue and epilogue in y, so you get this instead:

top
midde left	middle	middle right
bottom

abadams · 2023-11-14T23:35:44Z

In the case I was wrestling with, I just had partitioning from loop tails, not from boundary conditions, so I only have an epilogue, not a prologue. I was getting code with 3 cases that divide the domain like this:

steady state	x epilogue
y epilogue

but I want code with 4 cases that divides the domain like this:

steady state	x epilogue
x steady state, y epilogue	x, y epilogue

mcourteaux · 2023-11-16T18:07:34Z

Thank you for the clarification that tail strategies are basically implemented by modifying the IR such that loop partitioning will trigger. This is the case for GuardWithIf, clearly, as the likely() function is used in the if-condition. I wonder if loop partitioning is also at play when using ShiftInwards. Eyeballing the code, ShiftInwards also makes use of likely() that will also trigger loop partitioning and remove the min(), but still gets you the full-length inner loop. This would mean that the loop partitioning directives for disabling it can now also significantly reduce code size for these vectorized loops that use ShiftInwards at close to no cost.

I think this explains my confusion I had a long while ago: I was trying to get small code, and was using ShiftInwards as that -- in my head -- meant use min() to make sure the inner loop fully fits inside the dim of the Func. Yet, the code it generated was huge, and still duplicated almost everything. I thought that that was something you could expect from GuardWithIf, as you indeed need a scalar loop for the tail, most of the time. But I expected a vectorized inner loop to be fully compatible with the ShiftInward, which it is, but loop partitioning was still duplicating the code, fully confusing me on what was happening.

abadams · 2023-11-21T23:26:50Z

Failures unrelated. Merging.

Improve code size and compile time for local laplacian and interpolate apps This reduces compile time for the manual local laplacian schedule from 4.9s to 2.2s, and reduces code size from 126k to 82k Most of the reduction comes from avoiding a pointless boundary condition in the output Func. A smaller amount comes from avoiding loop partitioning using RoundUp and Partition::Never. The Partition::Never calls are responsible for a 3% reduction in code size and compile times by themselves. This has basically no effect on runtime. It seems to reduce it very slightly, but it's in the noise.

abadams added 2 commits October 31, 2023 12:04

Remove the partition in y on gray

71d92b9

vksnk approved these changes Oct 31, 2023

View reviewed changes

Also update GPU schedule

1baac8a

Selectively disable loop partitioning in interpolate app

49e5e9e

avx512 schedule reduces code size from 95k to 60k cuda schedule reduces code size from 261k to 194k No impact on performance

abadams added the dev_meeting Topic to be discussed at the next dev meeting label Nov 10, 2023

abadams removed the dev_meeting Topic to be discussed at the next dev meeting label Nov 10, 2023

abadams added 2 commits November 11, 2023 09:33

Add some syntactic sugar

6411c2b

Tweak code to better show intent

933098c

y wasn't being partitioned, but this more clearly says "I'm optimizing for code size"

mcourteaux reviewed Nov 14, 2023

View reviewed changes

Make Partition::Always partition even in outer loop tails.

5a2787e

Update comment

729248c

Add a little picture of the 9 zones

e8f1dbe

abadams merged commit 8c28a73 into main Nov 21, 2023
17 of 19 checks passed

BrewTestBot mentioned this pull request Feb 2, 2024

halide 17.0.0 Homebrew/homebrew-core#161602

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve code size and compile time for local laplacian app #7927

Improve code size and compile time for local laplacian app #7927

abadams commented Oct 31, 2023 •

edited

Loading

abadams commented Oct 31, 2023

vksnk Oct 31, 2023

abadams Oct 31, 2023

abadams Oct 31, 2023

abadams commented Oct 31, 2023 •

edited

Loading

abadams commented Nov 1, 2023

mcourteaux commented Nov 1, 2023

mcourteaux commented Nov 1, 2023

mcourteaux commented Nov 2, 2023 •

edited

Loading

abadams commented Nov 2, 2023

mcourteaux commented Nov 3, 2023

abadams commented Nov 3, 2023

abadams commented Nov 10, 2023

mcourteaux commented Nov 11, 2023

mcourteaux commented Nov 14, 2023

mcourteaux Nov 14, 2023

abadams commented Nov 14, 2023

mcourteaux commented Nov 14, 2023

abadams commented Nov 14, 2023

abadams commented Nov 14, 2023

mcourteaux commented Nov 16, 2023

abadams commented Nov 21, 2023

Improve code size and compile time for local laplacian app #7927

Improve code size and compile time for local laplacian app #7927

Conversation

abadams commented Oct 31, 2023 • edited Loading

abadams commented Oct 31, 2023

vksnk Oct 31, 2023

Choose a reason for hiding this comment

abadams Oct 31, 2023

Choose a reason for hiding this comment

abadams Oct 31, 2023

Choose a reason for hiding this comment

abadams commented Oct 31, 2023 • edited Loading

abadams commented Nov 1, 2023

mcourteaux commented Nov 1, 2023

mcourteaux commented Nov 1, 2023

mcourteaux commented Nov 2, 2023 • edited Loading

abadams commented Nov 2, 2023

mcourteaux commented Nov 3, 2023

abadams commented Nov 3, 2023

abadams commented Nov 10, 2023

mcourteaux commented Nov 11, 2023

mcourteaux commented Nov 14, 2023

mcourteaux Nov 14, 2023

Choose a reason for hiding this comment

abadams commented Nov 14, 2023

mcourteaux commented Nov 14, 2023

abadams commented Nov 14, 2023

abadams commented Nov 14, 2023

mcourteaux commented Nov 16, 2023

abadams commented Nov 21, 2023

abadams commented Oct 31, 2023 •

edited

Loading

abadams commented Oct 31, 2023 •

edited

Loading

mcourteaux commented Nov 2, 2023 •

edited

Loading