-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: loop interchange optimization #4358
Comments
Have you seen a compiler which implements the optimization described in that StackOverflow post? 😄 |
@mikedn, ouch! like the Japanese spirit says, "if nobody has done it, I must do it" ㊗️ [Unwarranted] I think while Intel's compiler is good at loop-interchange optimization, it is also leading in LIH arena, but I am not sure what kind of areas are fuzzy and most challenging (research-wise). |
Yes, Intel's compiler does this optimization, all others vectorize the inner loop and don't do anything else interesting. The interesting thing about the Intel compiler is that it does it only with /O2, not with /O1. That may imply that the compiler wasn't specifically doing this optimization but that the result was the fortunate result of other optimizations. Loop interchange requires data dependence analysis and the compiler might have done that for autovectorization purposes. |
I don't think this is a good fit for CoreCLR. The runtime's job is to take an intermediate representation (IR) and make it target specific as quickly as possible: JIT = Just In Time (compilation). Additionally, the IR has no concept of loops or other structured programming techniques. It only understands branches (goto). Combine this with the runtime's goal of supporting multiple languages and it makes it very difficult to determine what is a candidate for further optimization and what should not be touched due to side effects that may occur. In the example given above, it is entire possible that another thread is updating the array (from a JIT point of view) and so the order of operation is important. Given the above, this type of optimization is better suited towards Roslyn, etc. Additionally, compiler optimizations should not be viewed as a substitute for human programming skill. |
I completely disagree with OtherCrashOverride. In the 90's, developers were instructed to not try to outsmart the compiler, just write readable code and let the optimizer take care of the details of making it fast. Now, we appear to be back to hand-optimizing code because the JIT'er isn't smart enough. Also, the argument that another thread might modify the the array is flawed. Optimizing compilers simply don't work that way...If they did they could never do "loop fusion" for example. Finally, the idea of "JIT = Just In Time (compilation)" is, IMHO, the biggest problem. I'd bet that the majority of code in use today runs on servers and does not care how long it takes to compile. Result: the mandate is to have a "sub-optimizing" compiler. |
Agreeing with @TPSResearch. We've always known that JITs are kind of a dumb idea, and this is just another reiteration of that principle. Yes, there's a place for them, particularly in debugging and testing when the goal is to iterate as quickly as possible, but for deployment of a working product, a good AOT compiler will beat a good JIT hands down, every time. An optimization like this is probably not suitable to the JIT, but it should be shunted over towards LLILC, who has the development of an AOT compiler as their long-term goal. |
If you are looking to get this in LLILC then wouldn't it really be LLVM where the optimization should be performed? |
There will always be a difference in code written for clarity and maintenance and that written for performance. The main thing that has changed since the 90's is first class Ahead-Of-Time compilation (AOT) support since Windows on ARM does not allow execution of JIT code (W^X). AOT is where we can spend as much time as needed performing deep optimizations in an off-line processes. Its important that the JIT do its job as fast as possible: a server environment may not care; use cases outside of that will. CoreCLR serves many domains outside of servers and languages outside of C#. I don't think anyone is under the impression that this is a trivial optimization. Should those that write performant code be penalized by a slower JIT that will not find anything to optimize? With AOT the penalty is a O(1) cost, however with JIT its an O(n) cost. In the end what is being discussed is outside the scope of JIT. It is modifications to IL, not compilation of it. Someone should put forth a pull request of the optimization feature so that it can be tested to answer how much of a degradation it will have on JIT time and also to show real world performance increases that can be expected. Its entirely possible that it will have no statistically meaningful benefit to majority of existing, real world, code. |
Right. The takeaway here that applies to many JIT optimization issues are:
|
@Eyas If "AOT is coming to C# in the form of .NET Native," that really doesn't do much for people using other CLR languages, because it doesn't appear that they're open-sourcing the .NET Native toolchain. So that's kind of an unhelpful observation. |
I believe this type of optimization is outside the scope of both JIT and AOT. This is something for Roslyn. JIT and AOT deal with IL which is analogous to dealing with processor assembly code. Compilers typically optimize at a higher level. Optimizing in Roslyn is analogous to optimizing in a C/C++ compiler. It makes more sense because it has a better view and understanding of what is going on with the code. However, I think this too is probably better served by a compiler warning, Visual Studio 'lightbulb' suggestion, or an offline tool rather than forcing it on everyone. The author of the code should be made aware and also giving the option to alter the source code or ignore it. The option to ignore it becomes important when there is a purposeful order of operation due to the algorithm being multi-threaded. |
Meanwhile should this be a candidate of Roslyn for IL generation by @jaredpar, can you please suggest; ideally what should be the plan of attack here? |
Easy test, can you prove the array is not being mutated? |
@jasonwilliams200OK @OtherCrashOverride The problem with saying "oh, just do it in the front-end compiler" is that, again, there are a lot more than just one (or even two) CLR languages! "Just do it in the compiler" means we're back to the bad old days when every compiler needs to reinvent every wheel every time. Wasn't getting away from that the entire point of having a common language infrastructure in the first place? |
Its probably worth mentioning that this is something the LLVM back-end (dotnet/llilc) can provide: Mono has implemented LLVM backend support already. They also retain classic JIT due to the fact that LLVM based JIT adds considerable time to the JIT process which is undesirable to some customers. |
That is completely irrelevant. Optimizers do and have done for a long time loop optimizations on IR.
Reads do not constitute side-effects and as such they can be reordered and/or eliminated. The JIT already does this kind of stuff.
No, it is not something for Roslyn. Roslyn does very few optimizations on its own and lacks the infrastructure to perform optimizations like the one discussed here. It is certainly in scope of AOT and it might also be in scope of JIT in certain circumstances. At the end of the day, .NET Framework does partial AOT and has done that since day 1. It's called NGEN.
That is incorrect. Compilers actually optimize at a lower level. Optimizing in Roslyn isn't analogous to optimizing in the C/C++ compiler as Roslyn is the equivalent of the frontend and the C/C++ compiler, just like Roslyn, does little to no optimization in the frontend. The better view is simply not there because the frontend simply lacks the infrastructure needed by a lot of modern code optimizations.
RyuJIT already does loop invariant code motion. Neither RyuJIT nor LLVM do loop interchange (well, it's quite possible that LLVM does it but it still doesn't optimize the example code as mentioned in the StackOverflow post). |
Great! So we can close this issue. |
@mikedn, can you point me out to the place in RyuJIT code, where this optimization is taking place. I will try to figure out what needs to be done to make loop-interchange optimization possible, as part of the solution from the linked answer is loop interchange (the very first step). To my knowledge, LLVM, Clang and GCC do not provide loop-interchange optimization either. Neither does MSVCR. Intel has invested heavy amount of resources to make it possible in their compiler. Some areas in LIO are still in research phase, since the heuristics involved are quite prone to backfire given a little mistake. |
@jasonwilliams200OK For loop invariant code motion check |
Just wanted to ask something: how is this related to LICM? I see it more as an opportunity to do loop interchange followed by dimensionality reduction of I-loop. |
Once you do loop interchange you might rely on LICM to hoist the |
Given that loop interchange, loop unswitching, and loop reduction are likely out of the scope of the JIT in the near to medium term (even if we might want them, aspirationally), I'm going to close this issue. |
Source: http://stackoverflow.com/a/11303693/863980
From the link above it is evident that the following two methods are logically equal, due to Loop-Invariant code motion:
and
The driver (
Main
) method looks like:Diffing the produced disasm of
UnHoisted
andHoisted
methods results in:The execution time of
UnHoisted()
is much higher than that ofHoisted()
.Do we have a starting point for this kind of optimization; such as some basic heuristics in place which needs to be enhanced or would it require an implementation from ground-up?
Cc @mikedn, @omariom, @cmckinsey
category:cq
theme:loop-opt
skill-level:expert
cost:extra-large
The text was updated successfully, but these errors were encountered: