Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve code size and compile time for local laplacian app #7927

Merged
merged 9 commits into from
Nov 21, 2023

Conversation

abadams
Copy link
Member

@abadams abadams commented Oct 31, 2023

This reduces compile time for the manual local laplacian schedule from 4.9s to 2.2s, and reduces code size from 126k to 82k

It's all from avoiding loop partitioning. Most of the reduction comes from avoiding a pointless boundary condition in the output Func, which triggered partitioning. A smaller amount comes from using RoundUp and Partition::Never. The Partition::Never calls are responsible for a 3% reduction in code size and compile times by themselves.

This has basically no effect on runtime. It seems to reduce it very slightly, but it's in the noise.

There's also a drive-by fix to Generator.h, exporting the Partition enum class.

This reduces compile time for the manual local laplacian schedule from
4.9s to 2.2s, and reduces code size from 126k to 82k

Most of the reduction comes from avoiding a pointless boundary condition
in the output Func. A smaller amount comes from avoiding loop
partitioning using RoundUp and Partition::Never. The Partition::Never
calls are responsible for a 3% reduction in code size and compile times
by themselves.

This has basically no effect on runtime. It seems to reduce it very
slightly, but it's in the noise.
@abadams
Copy link
Member Author

abadams commented Oct 31, 2023

Updated to also drop partitioning in y on the padded input, which saves another 2k code size.

@@ -81,10 +81,10 @@ class LocalLaplacian : public Halide::Generator<LocalLaplacian> {
// Reintroduce color (Connelly: use eps to avoid scaling up noise w/ apollo3.png input)
Func color;
float eps = 0.01f;
color(x, y, c) = outGPyramid[0](x, y) * (floating(x, y, c) + eps) / (gray(x, y) + eps);
color(x, y, c) = input(x, y, c) * (outGPyramid[0](x, y) + eps) / (gray(x, y) + eps);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a bit-exact change? I guess it's very close and doesn't really matter (especially if it helps to avoid the boundary condition), just wanted to double-check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is not bit exact, but it made more sense to me with the change to the scaling of this term. Before it took the ratio of the input color channel to the input grayscale image, and applied that ratio to the output grayscale. Now it computes the ratio of the output grayscale to the input grayscale, and applies that as a scaling factor to the input.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(The only difference is which term in the numerator gets a +eps)

@abadams
Copy link
Member Author

abadams commented Oct 31, 2023

Also updated to include a cuda schedule. Some thoughts come to mind from using this:

  • Now that we have Never/Auto/Always, maybe Auto should default to off instead of on inside shader code.
  • One mistake I made was applying .partition after a split, and having Halide partition the other child Var instead. Sometimes this other child var is anonymous (e.g. in .parallel(y, 32)). It's possibly better practice to apply .partition before splits so that child vars just inherit.
  • Two pieces of sugar would be nice: never_partition(Var), but also never_partition(), which would apply to all Vars for that Func. For most of these cases I just knew that I didn't want any kind of loop partitioning for a particular Func.

avx512 schedule reduces code size from 95k to 60k
cuda schedule reduces code size from 261k to 194k

No impact on performance
@abadams
Copy link
Member Author

abadams commented Nov 1, 2023

Updated to also affect the interpolate app. Reduces code size by 30% on CPU, and 20% on GPU. No impact on performance. Once I again I just turned it off for every Func in the cuda schedule.

@mcourteaux have you found a case where you actually want automatic loop partitioning inside a cuda kernel? Now that it's possible to turn on manually, I'm really wondering if it should be off by default in shaders.

@mcourteaux
Copy link
Contributor

I can't remember by heart if there were. I'll check it out tomorrow. But in general, I think it is a sane strategy to not do loop partitioning in shaders and kernels. That was at least my intuition.

@mcourteaux
Copy link
Contributor

I guess that for very large convolutions, loop Partitioning the reduction loop might be beneficial.

An RDom over 11*11 domain for example. The amount of clamping overhead might become significant here, and it might be beneficial to get rid of that in the steady state. The subltety here is that it might be cool to loop parition based on the gpu block size. As such you still avoid the branch divergence. Ie: all threads in every block of the outer ring of blocks run the clamping code. All other threads run the optimized code.

Althought I think this is another type of Partitioning than what currently happens. You'd want an:

if (on-outer-ring) { for with clamping } else { for without clamping }

Instead of the prologue, steady state, and epilogue loops.

@mcourteaux
Copy link
Contributor

mcourteaux commented Nov 2, 2023

Thinking about this, I think Always is kind of more AutoWithExpectedEffect. Currently, what Auto and Always do, is actually what could be the enum value PrologueSteadystateEpilogue or EpiloguePrologue or whatever. The loop paritioning strategy I mentioned could be RoundToGPUBlock or something. I start seeing potential for more than simply Always.

I think we are early enough to still tweak the API, as probably nobody besides the two of us have started using this.

@abadams
Copy link
Member Author

abadams commented Nov 2, 2023

For serial loops we generate (prologue, steady-state, epilogue), but for parallel loops like the gpu blocks loop we currently generate:

if (not in prologue or epilogue) {
  version without clamping
} else {
  version with clamping
}

This happens in x and y, so if in a shader you might see something like:

if (not at top or bottom) {
  if (not at left or right) {
    version without clamping 
  } else {
    version with clamping in x only
  } 
} else {
    version with clamping in x and y
}

Here's a real example from a repeat edge boundary condition in cuda:

    gpu_block<CUDA> (repeat_edge$2.s0.v15.v17.__block_id_y, 0, t307) {
   gpu_block<CUDA> (repeat_edge$2.s0.v14.v16.__block_id_x, 0, t308) {
    gpu_thread<CUDA> (.__thread_id_y, 0, 8) {
     gpu_thread<CUDA> (.__thread_id_x, 0, 8) {
      if ((repeat_edge$2.s0.v15.v17.__block_id_y.prologue <= repeat_edge$2.s0.v15.v17.__block_id_y) && (repeat_edge$2.s0.v15.v17.__block_id_y < repeat_edge$2.s0.v15.v17.__block_id_y.epilogue)) {
       let repeat_edge$2.s0.v15.v19.base.s = min(repeat_edge$2.s0.v15.v17.__block_id_y*8, repeat_edge$2.extent.1 + -8)
       if ((t225 <= repeat_edge$2.s0.v14.v16.__block_id_x) && (repeat_edge$2.s0.v14.v16.__block_id_x < t226)) {
        let t242 = (repeat_edge$2.s0.v14.v16.__block_id_x*8) + t309
        let t300 = (repeat_edge$2.min.1 + repeat_edge$2.s0.v15.v19.base.s) + .__thread_id_y
        repeat_edge$2[((repeat_edge$2.stride.1*t300) + ((repeat_edge$2.s0.v14.v16.__block_id_x*8) - (repeat_edge$2.min.1*repeat_edge$2.stride.1))) + .__thread_id_x] = p0[((((t300 + 1)*p0.stride.1) + t242) + .__thread_id_x) + 1] + (p0[((((t300 + -1)*p0.stride.1) + t242) + .__thread_id_x) + -1] + p0[((p0.stride.1*t300) + t242) + .__thread_id_x])
       } else {
        let repeat_edge$2.s0.v14.v18.base.s = min(repeat_edge$2.s0.v14.v16.__block_id_x*8, repeat_edge$2.extent.0 + -8)
        let t301 = (repeat_edge$2.s0.v15.v19.base.s + t311) + .__thread_id_y
        let t302 = (repeat_edge$2.min.0 + repeat_edge$2.s0.v14.v18.base.s) + .__thread_id_x
        repeat_edge$2[((((repeat_edge$2.min.1 + repeat_edge$2.s0.v15.v19.base.s) + .__thread_id_y)*repeat_edge$2.stride.1) + (repeat_edge$2.s0.v14.v18.base.s - (repeat_edge$2.min.1*repeat_edge$2.stride.1))) + .__thread_id_x] = p0[max(min(t302 + 2, t312) + -1, p0.min.0) + (((t301 + 1)*p0.stride.1) - p0.min.0)] + (p0[max(min(t302, t312) + -1, p0.min.0) + (((t301 + -1)*p0.stride.1) - p0.min.0)] + p0[max(min(t312 + -1, t302), p0.min.0) + ((p0.stride.1*t301) - p0.min.0)])
       }
      } else {
       let repeat_edge$2.s0.v15.v19.base.s = min(repeat_edge$2.s0.v15.v17.__block_id_y*8, repeat_edge$2.extent.1 + -8)
       let repeat_edge$2.s0.v14.v18.base.s = min(repeat_edge$2.s0.v14.v16.__block_id_x*8, repeat_edge$2.extent.0 + -8)
       let t304 = (repeat_edge$2.s0.v15.v19.base.s + t311) + .__thread_id_y
       let t305 = (repeat_edge$2.min.0 + repeat_edge$2.s0.v14.v18.base.s) + .__thread_id_x
       repeat_edge$2[((((repeat_edge$2.min.1 + repeat_edge$2.s0.v15.v19.base.s) + .__thread_id_y)*repeat_edge$2.stride.1) + (repeat_edge$2.s0.v14.v18.base.s - (repeat_edge$2.min.1*repeat_edge$2.stride.1))) + .__thread_id_x] = p0[max(min(t305 + 2, t312) + -1, p0.min.0) + (((max(min(t304 + 2, p0.extent.1), 1) + -1)*p0.stride.1) - p0.min.0)] + (p0[max(min(t305, t312) + -1, p0.min.0) + (((max(min(p0.extent.1, t304), 1) + -1)*p0.stride.1) - p0.min.0)] + p0[max(min(t312 + -1, t305), p0.min.0) + ((max(min(p0.extent.1 + -1, t304), 0)*p0.stride.1) - p0.min.0)])
      }
     }
    }
   }
  }

@mcourteaux
Copy link
Contributor

but for parallel loops like the gpu blocks loop we currently generate:

if (not in prologue or epilogue) {
  version without clamping
} else {
  version with clamping
}

I don't really know what your pipeline/schedule was here, but I don't really understand what caused the the if-logic to only check against the block_id, and not the thread_id.

In case that was coincidence and not generally the case, then, in general, the if-condition on GPU architectures with thread-blocks could make us of alternative partitioning strategy I briefly hinted at in my last comment. All threads in a warp are going to be doing the same instructions. If the control-flow logic introduced by loop partitioning causes the warp to enter both branches, we will pay the cost of both branches for that warp. A way to remedy this branch divergence within the warp is to make sure that all warps in the block do follow the same control flow. You could probably achieve this relatively easy with something along the lines of:

if (in-prologue-or-epilogue-rounded-conservatively-towards-gpu-block-boundary) {
    variant with clamping;
} else {
    variant without clamping;
}

@abadams
Copy link
Member Author

abadams commented Nov 3, 2023

It is generally the case. It checked against the block id only because automatic loop partitioning partitions the outermost possible loop that will remove the boundary condition, and the block loop is outside the thread loop.

@abadams abadams added the dev_meeting Topic to be discussed at the next dev meeting label Nov 10, 2023
@abadams
Copy link
Member Author

abadams commented Nov 10, 2023

Sugar proposed in dev_meeting:

never_partition(variadic list of vars...)
never_partition_all()

@abadams abadams removed the dev_meeting Topic to be discussed at the next dev meeting label Nov 10, 2023
y wasn't being partitioned, but this more clearly says "I'm optimizing
for code size"
@mcourteaux
Copy link
Contributor

Super cool to see that there is some real use already for this scheduling directive! 😄 It's suiting to see code size reductions for same performance.

@mcourteaux
Copy link
Contributor

Seeing these code size improvements, I wonder what the impact would be on your amazing work on the median filters, where compile times and code size started skyrocketing. I don't have access to the paper, so I can't eyeball the pipeline, and so I don't even know if Halide was doing partitioning.

@@ -102,21 +102,36 @@ class LocalLaplacian : public Halide::Generator<LocalLaplacian> {
// Nothing.
} else if (get_target().has_gpu_feature()) {
// GPU schedule.
// 3.19ms on an RTX 2060.
// 2.9ms on an RTX 2060.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. If loop partitioning did not impact performance, than what did in this PR? Newer LLVM? Newer CUDA driver? Or the non-bit-exact change from above?

@abadams
Copy link
Member Author

abadams commented Nov 14, 2023

In another branch, I found myself wanting to partition an inner loop in the tail case of an outer loop, because that tail case could be quite large, and it seemed reasonable to me that Always should do this. I added sugar for always too, for that reason.

@mcourteaux
Copy link
Contributor

I found myself wanting to partition an inner loop in the tail case of an outer loop

How do I interpret this? Like:

for (int xo = 0; xo < 10; ++xo) { // outer
   for (int x = 0; x < 32; ++x) { // inner
       // partition this?
   }
}
// tail of outer?
for (int x = 0; x < 20; ++x) { // inner loop in tail of outer?
    // partition this as well?
}

I must say I have not a good understanding of the tail strategies in Halide. Tail strategies are for when a split loop does not factor perfectly in a number of iterations for the outer loop, and a number of iterations for the inner loop. A tail gets inserted after the outer loop that does the modulo number of remaining iterations. However, partitioning is because of potential body simplification in case of BoundaryConditions, which happens all around the domain of the input. I don't understand how a tail and loop partitioning interact.


Regarding the modified test that now expects 9 zones, how do I could those?

for outer.epilogue { for inner {} }
for outer.ss {
  for inner.prologue {}
  for inner.ss {}
  for inner.epilogue {}
}
// tail
for inner.prologue {}
for inner.ss {}
for inner.epilogue {}

This way I only count 7, and I really don't understand. I think a better description of what is happening and expected in that test would be nice.

@abadams
Copy link
Member Author

abadams commented Nov 14, 2023

Loop tails in Halide are implemented using loop partitioning rather than by adding an additional scalar loop at the end. For GuardWithIf, for example, we just generate a single loop next that looks like this:

for xo:
  for xi:
    let x = xo * 8 + xi
    if (likely(x < width)) f(x) = ...

and then rely on loop partitioning to separate the tail case from the steady-state. In some situations this will simplify down to a scalar loop that does width % 8 iterations, but in others you'll get different stuff. E.g. if xi is vectorized and you're on an architecture that has predicated store instructions it'll just do a single predicated vector computation. In general scalar loop tails are bad for performance, because they're not vectorized, and they can be large on architectures with wide vectors, so we try to avoid them.

If you have a tail strategy and a boundary condition, then the prologue handles before the start of the boundary, the steady state handles the interior of the image up to some multiple of the split factor, and the epilogue handles past the end of the boundary and the loop tail case together.

The 9 zones in the test are these zones:

top lefttop middletop right
midde leftmiddlemiddle right
bottom leftbottom middlebottom right

With Partition::Auto, we don't partition the prologue and epilogue in y, so you get this instead:

top
midde leftmiddlemiddle right
bottom

@abadams
Copy link
Member Author

abadams commented Nov 14, 2023

In the case I was wrestling with, I just had partitioning from loop tails, not from boundary conditions, so I only have an epilogue, not a prologue. I was getting code with 3 cases that divide the domain like this:

steady statex epilogue
y epilogue

but I want code with 4 cases that divides the domain like this:

steady statex epilogue
x steady state, y epiloguex, y epilogue

@mcourteaux
Copy link
Contributor

Thank you for the clarification that tail strategies are basically implemented by modifying the IR such that loop partitioning will trigger. This is the case for GuardWithIf, clearly, as the likely() function is used in the if-condition. I wonder if loop partitioning is also at play when using ShiftInwards. Eyeballing the code, ShiftInwards also makes use of likely() that will also trigger loop partitioning and remove the min(), but still gets you the full-length inner loop. This would mean that the loop partitioning directives for disabling it can now also significantly reduce code size for these vectorized loops that use ShiftInwards at close to no cost.

I think this explains my confusion I had a long while ago: I was trying to get small code, and was using ShiftInwards as that -- in my head -- meant use min() to make sure the inner loop fully fits inside the dim of the Func. Yet, the code it generated was huge, and still duplicated almost everything. I thought that that was something you could expect from GuardWithIf, as you indeed need a scalar loop for the tail, most of the time. But I expected a vectorized inner loop to be fully compatible with the ShiftInward, which it is, but loop partitioning was still duplicating the code, fully confusing me on what was happening.

@abadams
Copy link
Member Author

abadams commented Nov 21, 2023

Failures unrelated. Merging.

@abadams abadams merged commit 8c28a73 into main Nov 21, 2023
17 of 19 checks passed
ardier pushed a commit to ardier/Halide-mutation that referenced this pull request Mar 3, 2024
Improve code size and compile time for local laplacian and interpolate apps

This reduces compile time for the manual local laplacian schedule from
4.9s to 2.2s, and reduces code size from 126k to 82k

Most of the reduction comes from avoiding a pointless boundary condition
in the output Func. A smaller amount comes from avoiding loop
partitioning using RoundUp and Partition::Never. The Partition::Never
calls are responsible for a 3% reduction in code size and compile times
by themselves.

This has basically no effect on runtime. It seems to reduce it very
slightly, but it's in the noise.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants