-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve code size and compile time for local laplacian app #7927
Conversation
This reduces compile time for the manual local laplacian schedule from 4.9s to 2.2s, and reduces code size from 126k to 82k Most of the reduction comes from avoiding a pointless boundary condition in the output Func. A smaller amount comes from avoiding loop partitioning using RoundUp and Partition::Never. The Partition::Never calls are responsible for a 3% reduction in code size and compile times by themselves. This has basically no effect on runtime. It seems to reduce it very slightly, but it's in the noise.
Updated to also drop partitioning in y on the padded input, which saves another 2k code size. |
@@ -81,10 +81,10 @@ class LocalLaplacian : public Halide::Generator<LocalLaplacian> { | |||
// Reintroduce color (Connelly: use eps to avoid scaling up noise w/ apollo3.png input) | |||
Func color; | |||
float eps = 0.01f; | |||
color(x, y, c) = outGPyramid[0](x, y) * (floating(x, y, c) + eps) / (gray(x, y) + eps); | |||
color(x, y, c) = input(x, y, c) * (outGPyramid[0](x, y) + eps) / (gray(x, y) + eps); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a bit-exact change? I guess it's very close and doesn't really matter (especially if it helps to avoid the boundary condition), just wanted to double-check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is not bit exact, but it made more sense to me with the change to the scaling of this term. Before it took the ratio of the input color channel to the input grayscale image, and applied that ratio to the output grayscale. Now it computes the ratio of the output grayscale to the input grayscale, and applies that as a scaling factor to the input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(The only difference is which term in the numerator gets a +eps)
Also updated to include a cuda schedule. Some thoughts come to mind from using this:
|
avx512 schedule reduces code size from 95k to 60k cuda schedule reduces code size from 261k to 194k No impact on performance
Updated to also affect the interpolate app. Reduces code size by 30% on CPU, and 20% on GPU. No impact on performance. Once I again I just turned it off for every Func in the cuda schedule. @mcourteaux have you found a case where you actually want automatic loop partitioning inside a cuda kernel? Now that it's possible to turn on manually, I'm really wondering if it should be off by default in shaders. |
I can't remember by heart if there were. I'll check it out tomorrow. But in general, I think it is a sane strategy to not do loop partitioning in shaders and kernels. That was at least my intuition. |
I guess that for very large convolutions, loop Partitioning the reduction loop might be beneficial. An RDom over 11*11 domain for example. The amount of clamping overhead might become significant here, and it might be beneficial to get rid of that in the steady state. The subltety here is that it might be cool to loop parition based on the gpu block size. As such you still avoid the branch divergence. Ie: all threads in every block of the outer ring of blocks run the clamping code. All other threads run the optimized code. Althought I think this is another type of Partitioning than what currently happens. You'd want an:
Instead of the prologue, steady state, and epilogue loops. |
Thinking about this, I think I think we are early enough to still tweak the API, as probably nobody besides the two of us have started using this. |
For serial loops we generate (prologue, steady-state, epilogue), but for parallel loops like the gpu blocks loop we currently generate:
This happens in x and y, so if in a shader you might see something like:
Here's a real example from a repeat edge boundary condition in cuda:
|
I don't really know what your pipeline/schedule was here, but I don't really understand what caused the the if-logic to only check against the In case that was coincidence and not generally the case, then, in general, the if-condition on GPU architectures with thread-blocks could make us of alternative partitioning strategy I briefly hinted at in my last comment. All threads in a warp are going to be doing the same instructions. If the control-flow logic introduced by loop partitioning causes the warp to enter both branches, we will pay the cost of both branches for that warp. A way to remedy this branch divergence within the warp is to make sure that all warps in the block do follow the same control flow. You could probably achieve this relatively easy with something along the lines of:
|
It is generally the case. It checked against the block id only because automatic loop partitioning partitions the outermost possible loop that will remove the boundary condition, and the block loop is outside the thread loop. |
Sugar proposed in dev_meeting: never_partition(variadic list of vars...) |
y wasn't being partitioned, but this more clearly says "I'm optimizing for code size"
Super cool to see that there is some real use already for this scheduling directive! 😄 It's suiting to see code size reductions for same performance. |
Seeing these code size improvements, I wonder what the impact would be on your amazing work on the median filters, where compile times and code size started skyrocketing. I don't have access to the paper, so I can't eyeball the pipeline, and so I don't even know if Halide was doing partitioning. |
@@ -102,21 +102,36 @@ class LocalLaplacian : public Halide::Generator<LocalLaplacian> { | |||
// Nothing. | |||
} else if (get_target().has_gpu_feature()) { | |||
// GPU schedule. | |||
// 3.19ms on an RTX 2060. | |||
// 2.9ms on an RTX 2060. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. If loop partitioning did not impact performance, than what did in this PR? Newer LLVM? Newer CUDA driver? Or the non-bit-exact change from above?
In another branch, I found myself wanting to partition an inner loop in the tail case of an outer loop, because that tail case could be quite large, and it seemed reasonable to me that Always should do this. I added sugar for always too, for that reason. |
How do I interpret this? Like:
I must say I have not a good understanding of the tail strategies in Halide. Tail strategies are for when a split loop does not factor perfectly in a number of iterations for the outer loop, and a number of iterations for the inner loop. A tail gets inserted after the outer loop that does the modulo number of remaining iterations. However, partitioning is because of potential body simplification in case of BoundaryConditions, which happens all around the domain of the input. I don't understand how a tail and loop partitioning interact. Regarding the modified test that now expects 9 zones, how do I could those?
This way I only count 7, and I really don't understand. I think a better description of what is happening and expected in that test would be nice. |
Loop tails in Halide are implemented using loop partitioning rather than by adding an additional scalar loop at the end. For GuardWithIf, for example, we just generate a single loop next that looks like this:
and then rely on loop partitioning to separate the tail case from the steady-state. In some situations this will simplify down to a scalar loop that does width % 8 iterations, but in others you'll get different stuff. E.g. if xi is vectorized and you're on an architecture that has predicated store instructions it'll just do a single predicated vector computation. In general scalar loop tails are bad for performance, because they're not vectorized, and they can be large on architectures with wide vectors, so we try to avoid them. If you have a tail strategy and a boundary condition, then the prologue handles before the start of the boundary, the steady state handles the interior of the image up to some multiple of the split factor, and the epilogue handles past the end of the boundary and the loop tail case together. The 9 zones in the test are these zones:
With Partition::Auto, we don't partition the prologue and epilogue in y, so you get this instead:
|
In the case I was wrestling with, I just had partitioning from loop tails, not from boundary conditions, so I only have an epilogue, not a prologue. I was getting code with 3 cases that divide the domain like this:
but I want code with 4 cases that divides the domain like this:
|
Thank you for the clarification that tail strategies are basically implemented by modifying the IR such that loop partitioning will trigger. This is the case for I think this explains my confusion I had a long while ago: I was trying to get small code, and was using |
Failures unrelated. Merging. |
Improve code size and compile time for local laplacian and interpolate apps This reduces compile time for the manual local laplacian schedule from 4.9s to 2.2s, and reduces code size from 126k to 82k Most of the reduction comes from avoiding a pointless boundary condition in the output Func. A smaller amount comes from avoiding loop partitioning using RoundUp and Partition::Never. The Partition::Never calls are responsible for a 3% reduction in code size and compile times by themselves. This has basically no effect on runtime. It seems to reduce it very slightly, but it's in the noise.
This reduces compile time for the manual local laplacian schedule from 4.9s to 2.2s, and reduces code size from 126k to 82k
It's all from avoiding loop partitioning. Most of the reduction comes from avoiding a pointless boundary condition in the output Func, which triggered partitioning. A smaller amount comes from using RoundUp and Partition::Never. The Partition::Never calls are responsible for a 3% reduction in code size and compile times by themselves.
This has basically no effect on runtime. It seems to reduce it very slightly, but it's in the noise.
There's also a drive-by fix to Generator.h, exporting the Partition enum class.