-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A method to control FMA execution #310
Comments
Yes, please. The standard requires that "any expression in parentheses shall be treated as a data entity" (10.1.8 in F'202X). Also, sometimes one really doesn't want FMA: |
Thanks @klausler, if the computation in parentheses is meant to be computed separately, then perhaps that's the best news. I created a repository to test this: https://github.com/marshallward/fma_test On GNU 13.1.1,
Asm inspection shows that Intel 2021.7.1,
The functions for each case produce identical asm output:
For Nvidia 22.5
Asm is again identical for all cases:
GNU is doing exactly what we want here. If this just means that I have to take the discussions to the vendors, then that would be a lot simpler. |
While it's useful to have some way to control FMA code generation in Fortran, IMHO this should be left to the Fortran processor (compiler) side, for at least the following reasons:
Of course, someone also needs to draft the proposal before the language committee can proceed on discussing this. |
@wyphan Thanks very much for giving me a chance to clarify this point. We are not asking for an explicit FMA instruction, nor are we asking for the ability to direct an expression to use FMAs. In fact, this is the opposite of what we want. We only want clarity on the order of rounding. To summarize the comments above: there is no issue with a processor deciding to use an FMA when computing I will respond to your points below, but I believe we are in closer agreement than perhaps suggested by your comment.
Again, to be clear: I am not suggesting that a "FMA" operator be added to the standard. We only want an explict method to clarify the order of rounding. Thank you very much for responding, since it gives me a chance to clarify this point. |
I've gone back to check this more carefully, and Intel appears to do what I expect if I use It seems that this is not the issue that I thought it was, so I think that I will close this issue and raise it with Nvidia. Thanks to both @klausler and @wyphan for their feedback. |
I've updated the repository so that it is a little more structured. https://github.com/marshallward/fma-op-order I will trying to continue tracking the issue here as I learn more. |
After some internal discussion, we've decided that it would be beneficial to continue options to permit more predictable use of FMAs. I was so caught up in the very real problems with expressions like
As long as we do not know what the compiler would do, we are left uncertain about the outcome. To echo the concerns by @wyphan, I do not think that a dedicated FMA function or operator should be introduced. It would deviate too far from the "formula translation" foundations of Fortran. But I wonder if some kind of rules could be considered, where an expression would always apply an FMA where possible (assuming the processor permits it) and (for example) would always be applied left-to-right. And if the processor does not support FMA, then it would fall back to For example, GNU and Nvidia currently apply the FMA to Obviously this may potentially cripple performance in some cases, but so can parentheses. Warnings can inform about a suboptimal ordering. And answer-changing optimizations can always be turned on when appropriate. I admit that this could be seen as far too harsh from a compiler developer's perspective, and perhaps even pointless from most user's perspectives. But it's the sort of thing that could allow me to convince operational forecasters that FMAs are a valuable optimization. |
Why not use one set of parentheses to force a single FMA opportunity here? |
Sorry, I was trying to show how an expression like that might be affected by some kind of left-to-right rule. Parentheses would override such a rule. But perhaps that is a distraction from the discussion. |
These simple cases can probably be handled with some rules. But how about the following cases:
|
FMA is a concern only for expressions involving addition and subtraction of real multiplications, and this specific issue deals with such expressions that have multiple multiplications. How does it apply to any of the situations that you listed? |
To follow on @klausler: The issues described above can be circumscribed through various ways. We are focused on how to make We can turn the nondeterministic I am hoping to get us to a place where we can use FMAs in our forecasts, but unfortunately the current state is to simply disable them. (The particular example |
F'202X, coming to your local compiler sometime in the next few years, has IEEE_FMA() to force the use of FMA, but it will (probably? the standard doesn't say) fail if the target processor doesn't support FMA operations. If you want explicit syntax for FMA, here it is.
|
@klausler to answer your question:
I am reacting to the requirement from the very top of this issue:
And the list I provided shows all the ways the bit reproducibility gets broken (based on my understanding) and I am asking how that can be addressed to recover bit reproducibility. For example the vectorization: you need to choose a different width for different platforms (and thus get different bit results), or you can disable vectorization altogether (then you can get predictable bit results). If the goal is not not address any of the above issues that I listed, then I don't see the point of trying to get fma exactly right, since that alone can't get any real world codes that use things from my list to be bit reproducible. So if you can first motivate the big picture and the main goals of this "bit reproducibility" effort, then it will help me understand how to solve these little steps, such as fma. |
I will preface my reply to say that this is already a requirement for us, and one which is already satisfied by every operational weather and climate model which I have used. So this isn't an aspirational or hypothetical thing. To answer your earlier question @certik , we simply don't allow most of the operations you suggest. This isn't the place to get into details, but we either use them in situations which are reproducible, or implement alternatives when they are not. (I presented on some of this at meeting which we both attended, the presentation is here) Vectorization is a challenge, one on par with FMAs, and you are right to raise it as another example. I don't believe we have enough control to guarantee reproducibility. For now, let me just say that we do not take full advantage of vectorization. But this would deserve a separate issue. If you are asking why bit reproducibility is a concern, then the short answer is that we need to replicate our forecasts - even those containing known errors - and also need them as reference solutions for new parameterizations which may be introduced after the runs are completed. (Output is stored at a lower time resolution and is often insufficient for analysis.) I am raising FMAs because it threatened one of the few things left to us - commutativity - and felt that I could articulate the problem. I also worry that we will simply disable them and lose out on significant performance. I'm simply try to avoid that scenario if at all possible. |
No, I think bit reproducibility is very useful for the reasons you stated, to catch bugs and ensure that your implementation agrees with reference results. Once it agrees, then you can always enable all optimizations that will change bits, in exchange for performance. Btw, when I test LFortran against GFortran to ensure that we can compile something correctly, I also require bit to bit reproducibility, it already caught so many bugs and it's easy to catch them (if any bit disagrees after an operation, it's a bug). How do you tackle MPI decomposition --- you figured out a way to do it that is always bit to bit reproducible for a given N MPI ranks (or even the same bits for any N)? Overall, defining or figuring out how to make the compilers execute a large program exactly to deliver bit to bit exact results is helpful. So yes, then the solution using parentheses should work I think, so when you write @marshallward can you create a "main" issue for bit reproducibility, and then individual issues for each subset, and link this "fma" issue there? Then let's discuss each case separately. There will be "themes" or "approaches" that we can reause for each of these individual issues. |
Most of the work in our models are field updates, which are reproducible under decomposition as long as domain halos are updated. Summations and other reduction operations in most models that I have seen gather the values onto a process (either root or all) and do an ordered local sum. Our model uses something like a fixed-precision solution over multiple integer fields. (Default range is 10^-42 to 10^41; see paper), but the idea is similar. Neither is very fast, and the poor performance is only tolerable because it is an infrequent operation. I may be missing some details, but that is the most of it. You're welcome to reach out if you have other questions.
I agree that parentheses help to solve many of these problems, including the initial I will check with others and see if this is a path forward for us.
Good idea, I would appreciate the opportunity! I would not expect the language to accommodate all of our needs, but it could be helpful to document the most challenging use cases. |
@marshallward I see, thanks for the clarification about the MPI. Makes sense to me. Essentially you trade some performance in exchange for bit reproducibility.
Perfect, go ahead and do that. I think that most of your needs can be accommodated by a compiler or compiler options. I think that compilers should coordinate on many of such options, in a way it is outside of the standard, but I think this site is great for that and we should pursue it. |
We require bit reproducibility of our models, which in turn requires a predictable order of operations for our floating point arithmetic. For the most part, parentheses are sufficient to define the order of operations, but this becomes more challenging when fused multiply-adds (FMAs) are enabled.
FMA operations are almost ubiquitous now. When platforms report their peak performance, it relies heavily on FMAs. Every modern platform accessible to me includes FMAs as an atomic, vectorizable operation. For the forseeable future, they would seem a fundamental operation, equal to addition or multiplication.
The challenge for us is the difference in accuracy. If
rn()
is the rounding function, then FMAs increase the accuracy of the operation. For thef = a*b + c
,That is, FMA eliminates one of the rounding operations, thereby changing its bit representation.
For simple
a*b + c
expressions, the FMA is unambiguous. But consider the following expression.With FMAs disabled, the order of operations is unambiguous: Compute
a*b
andc*d
independently, and sum the results. That is,However, if FMAs are enabled, there are two possible results:
But an expression like
a*b + c*d
offers no apparent way to select either operation.I would like to propose that some kind of parsing rules be considered to make this an unambiguous operation.
I am not entirely sure how this could be done - which is why I am posting this as a discussion - but I would start with the following:
I have tested this in a few compilers, and
only GNU appears to do something like this(Edit: GNU and Intel can both do this, with appropriate flags). The others seem to ignore the parentheses and simply apply the FMA in all cases. (Compiled with -O0 but FMA enabled; perhaps -O gets overridden here...? I can gather details if there is interest.)I don't know how far one can carry this; perhaps there are some problems in more complex expressions. But I would like to hear any thoughts on this proposal, especially from compiler developers.
The text was updated successfully, but these errors were encountered: