-
Notifications
You must be signed in to change notification settings - Fork 12.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A miscompilation bug in LICMPass (concurrency) #64188
Comments
This is basically the same issue discussed in the review of https://reviews.llvm.org/D153932 ; our escape analysis doesn't properly account for atomic fences. (See also https://reviews.llvm.org/D137707.) I'm still a little surprised to see it actually show up like this. CC @jdoerfert @aeubanks |
The example here doesn't use fences though. Is the example valid as given? If this is just about fences, then it goes straight on the "technically a bug, but we're not going to fix it" pile as far as I'm concerned. |
I didn't mean "fence" instructions specifically; that's just what I was thinking about looking at the patch I mentioned. The example is valid, as far as I can tell. The key construct here is the "relaxed" store of the address ( Note that if the store of the address is not atomic, another barrier would be necessary, and if the store of the address is "release", it naturally enforces its own ordering. So the miscompile is only visible with a "relaxed" store of newly allocated pointer, I think. As a practical matter, it seems relatively unlikely anyone would stumble over this by accident; lockless datastructures that allocate memory would normally either use a "release" store of the pointer, or use some other form of locking. I'm a little uncomfortable assuming nobody will use atomics this way, but the potential performance regressions from restricting escape analysis around calls are also sort of nasty. |
For LICM, we'd need to do a peel of one iteration for this to be handled. FWIW, atomics do not have to show anywhere for this to appear. I'm not sure if the situation is really common enough to show up if we forbid LICM to do this if the pointer of the final store escapes later. |
An ad hoc remedy for this particular example is to move |
This could be made legal by adding a release fence before the address gets leaked, right? So this kind of code motion becomes a trade-off, where the cost of the fence is weighed against the benefit of moving the stores around. |
This isn't really about LICM in my opinion; it's some sort of combination provenance + concurrency issue (alias analysis will tell you that your locals haven't escaped before you pass them to extern functions; but this isn't true in a world where concurrency can cause time travel). Brief example with no loops: https://gcc.godbolt.org/z/4qKs4TYG3 I think the right fix for this is at the language level rather than LLVM though (and that the language should retroactively bless LLVM's behavior); assuming non-escaped locals don't alias random pointers passed to you seems like it must be fundamental. Edit: fixed typo in godbolt link |
The new testcase is interesting because there is no release barrier... which means no other thread can access the variable "local". But the thread that allocated it is fine, I guess? There's a happens-before relationship between the accesses (and the deallocation) because they're on the same thread. I can imagine a language rule somehow trying to address this, but it would be weird. We'd have to divide up memory into "single-threaded" and "multi-threaded" variables. A variable starts out "single-threaded", then becomes "multi-threaded" when its address is stored to memory visible to other threads (or the memory where the address is stored becomes visible to other threads, or the address is converted to an integer). And then we require a release-acquire edge between the store of the address, and any load that accesses the address. Or something along those lines; the key here is that "publishing" the address of a variable becomes a special operation. The current C++ model, by contrast, only requires that there's a release-acquire edge between the allocation/initialization of the variable and any accesses to the variable's value from another thread; the address itself is irrelevant.
This doesn't help; a "release" only has an effect if there's an "acquire" on the other side. |
The thing I was thinking of is applying something like a happens-before rule not just to the accesses, but also to pointers' provenances. So e.g. in the original example, we declare that at thread2's load of p, even though p's value was obtained via a valid happens-before chain, its provenance was not. This sort of captures the notion of publication in a way that meshes with the rest of the memory model. FWIW, the C++ committee concurrency study group will talk about this in the next couple of days and see if anyone has anything clever to say. |
The simplest way to "solve" the issue is just to forbid relaxed loads/stores of pointers: without those, you can't do anything weird (ignoring inttoptr conversions; not sure how we'd handle those). Saying that relaxed loads/stores of pointers don't have provenance is basically the same thing as forbidding them: a pointer without provenance is just an integer. If you want to allow them, you need some way to transfer the provenance alongside the pointer. The notion of "publishing" is a way out of this conflict: it means you need a barrier at exactly one point, the point of first publication. After that, the provenance transfers like you'd normally expect. This minimizes the amount of code made undefined, and keeps existing compiler optimizations working. |
Some (evolving) slides for an upcoming C++ concurrency group presentations are here. So far, I like David's suggestion, as I understand it. Can we paraphrase that as disallowing copying of a pointer unless the lifetime-start of the object it refers to happens-before the pointer is copied? |
The C++ memory model already forbids accessing an object before it's created, or after it's destroyed. The issue here is atomic operations that happen between the object being created and the object's address "escaping". LLVM optimizations assume such atomic operations are irrelevant... but the C++ memory model doesn't attach any special significance to the point where the pointer escapes. So the fix here is to make the point where the pointer escapes special: there has to be a happens-before edge between the escape point, and any use of the address in other threads. |
@davidtgoldblatt isn't yours a completely different issue? It doesn't involve any release/acquire. It seems more like the example from this recent talk. In your example,
Usually the provenance is part of the pointer value itself. So how would this work, where does the gap in the provenance hb-chain come from?
Ah, I was fooled by the
The issue is that the "escape point" is not really a well-defined notion, is it? As compiler analyses get better and better, they may be able to push the "escape point" further and further down program execution. How did you plan to define "escape point"? |
I think my take is that the fundamental issue here is that the presence of concurrency can "move up" (in sequenced-before) the effective time when some thread's private-seeming data has escaped. That's done with load buffering in my example, but can also happen with store reordering (as in the example here). In this view, the release/acquire synchronization in the original example is a detail -- it's just a way to give defined behavior to our ability to detect that early-escape, but it's not really the cause.
I think that I would say something like "at T2.3, the load should return a pointer value lacking provenance, because the store of the pointer value does not happen before the load". This isn't quite workable (because of standalone fences), but something in that direction at least feels plausible to me as a language-level fix. Maybe more succinctly: the root cause of all these issues is that some line of code gets a pointer value whose bits are equal to a valid pointer, but which they shouldn't be allowed to use. That's the same situation that we want provenance rules to solve; you have to come by your pointer values honestly. Use relaxed accesses to pass them to other threads is cheating, in the same way abusing the 1-past-the-end-of-an-array exception to get a pointer that happens to be equal to some other object's address is cheating. |
That makes sense, thanks. Though I do worry it can be fiendishly difficult in practice to reason about this... except ofc by just always using release-acquire when pointers are involved.^^ However, does that capture all cases of this issue? Here is another example, which I got from a colleague (Chung-Kil Hur):
This should be equivalent to using release stores / acquire loads on I guess you could say that at T2.2 there is no happens-before to T1.5 since relaxed doesn't induce HB -- but that would be basically saying fences don't work when doing the standard message passing idiom with pointers. Which seems bad? I suppose that's what you meant when you said your proposal isn't quite workable because of fences. |
The point where the pointer becomes accessible to another thread. This is, as you note, basically impossible for the compiler to compute precisely; it's a dynamic property of the code at runtime. But the compiler doesn't need to precisely compute it; a conservative approximation is sufficient. Standard compiler escape analysis is such an approximation. |
Even at runtime, this would be a very strange notion to introduce in the very definition of what is Defined Behavior and what not. How is it defined? A global scan of all memory transitively reachable through all pointers in a thread's stack? That's a hell of a property to reason about, if I want to justify precisely why my program is correct. |
Not quite sure. :) This sort of property isn't completely unprecedented; proposals for defining inttoptr are sort of similar. |
The proposals for defining inttoptr that I am involved in (mostly on the Rust side) don't look anything like that. ;) |
Yeah, exactly. (I mean, not limited to just fences of course -- any sort of opaque library synchronization could work too). But that at least feels like something we could put some words around, because now we have some synchronization primitive to hang it off of. Maybe introducing "provisional provenance"; it starts acting like empty provenance but can get promoted if certain synchronization conditions are met. Something like: If there are atomic pointer store S of value V and atomic pointer load load L, where L takes its value from S[1], then if S happens before L, L obtains V. If S does not happen before L, then L obtains V', which has the same address as V but a provisional version of V's provenance. On dereference of V', if there is some evaluation P such that the object to which V refers' lifetime start happens before P, L happens before P, and P happens before the dereference, then the dereference acts as though it has V's provenance. Otherwise, it acts as though it has empty provenance. The intuition being, we pretend that the provenance for some storage lives with the storage, being stored to at its lifetime start. At some unspecified point after we load a pointer value, we try to load its provenance from the storage and update our pointer value with it; these are the conditions we'd need to avoid UB in that case. I think that this blesses LLVM's current behavior on the various weirdo test cases without rendering too much sane code UB. [1] Not this exactly -- we probably need something about release sequences here I'm not 100% sure this could really be formalized because of the operational semantics / axiomatic semantics mismatch in provenance and the memory model, but OTOH we already have issues there anyways. |
There's also the "there is some evaluation P" quantification which doesn't seem a good fit for formalization -- it seems to require running hypothetical alternative executions. Can this be stated in terms of just the current execution and its memory model graphs? |
Is it fundamental, though? With your example, neither GCC nor MSVC performs the problematic optimization, though Clang and ICC do. If I change the relaxed store to an opaque function call, then ICC stops performing the optimization, leaving Clang as the odd one out. In other words, other compilers seem to apply "non-escaped locals" optimizations only to locals that never escape, as opposed to identifying when each local escapes and applying the optimizations at points where they haven't escaped yet. (In ICC's case, the relaxed store should be enough to block the optimization, but the fact an opaque call does block it suggests that the relaxed-store case may be a relatively localized bug in ICC, as opposed to proving that ICC follows a general policy of "aliasing can't happen until after the escape", as LLVM currently does. Maybe. ICC is closed source, so I'm only speculating.) Now to speculate even more loosely: If I had to guess what kind of scenario benefits most from this kind of optimization, I'd guess something like this: A long function contains a local that's manipulated many times, but mostly by value. However, at some point a pointer to it is passed to another, opaque function. In reality, the second function only accesses the pointer for the duration of the call (i.e. it could be But even in such a scenario, LLVM's time-based escape analysis only provides a benefit until the first escape. After that, the local must be treated as forever aliased. That is a pretty limited benefit! It doesn't sound so bad to forgo it, especially if other C/C++ compilers already do so. If this were Rust code, then under most circumstances, the compiler should be allowed to treat the local as non-aliased both before and after the call. The specifics are complicated, and the spec-permitted optimizations are mostly unimplemented, but if they were implemented, I believe they would render time-based escape analysis mostly redundant for Rust code. |
This is just an artifact of the C++ standardese weasel wording we use to avoid being precise about where a memory or synchronization operation happens (see e.g. here ) -- not really meant as a quantification across multiple executions. |
I should say, despite the discussion here, I'm not convinced modifying the standard is the right approach. The discussion here indicates that any modification we make would be confusing. And we don't know what the actual performance penalty is for just following the standard as it's currently written.
Yes. The relevant code is llvm::PointerMayBeCapturedBefore.
I don't see how local variables are any better off here. Sending a reference to another thread without a fence would require unsafe code, but it's not clear to me it would be undefined behavior.
You can directly translate the original example to Rust by replacing "malloc" with "Box::new()". |
Do you mean that in full generality, including the load-buffering-like examples? My impression was that everyone thought that those were too weird / harmful to be supported, while most people thought that the "establish release/acquire synchronization through some opaque function call" ones should remain correct per the standard (and that all we disagreed about was which set this example fell into). |
You mean, specifically, the examples that don't involve any fences (#64188 (comment))? I hadn't really thought about it before... but if the standard could change something there, it might be helpful. Specifically, to remove the happens-before relationship between loads/stores on the same thread if the pointer operand for one of the operations comes from another thread. From the perspective of LLVM optimizations, it would be straightforward to make llvm::PointerMayBeCapturedBefore() treat function calls that aren't marked "nosync" as escape points for pointers that eventually escape. That solves the cases involving fences, but not the cases that only involve relaxed load/store. |
In particular, we currently don't track whether a function contains any relaxed load/store ops, so if we need to track that, it would make the implementation a bit more complicated. The performance impact is 99% the same either way, though: the primary impact is treating unknown functions as an escape point for pointers that eventually escape. |
I think yeah, we should should change the standard to make LLVM's current behavior allowed in those cases. I also think, though, that this specific case is more like the ones you describe as not involving any fences than it seems like at first glance. I think the pattern we want to allow is the one you mentioned upthread, discussed in https://reviews.llvm.org/D153932 . In the usages everyone agrees should be allowed, the pattern is: Publishing thread: Consuming thread: In this example, though, the underlying pattern is: Which is the same root issue in some of the unsynchronized, relaxed-only examples we're looking at elsewhere. I'm OK jettisoning it to make alias analysis (in the sense of "your arguments don't alias storage you allocate") more viable.
It gets subtle fast -- I think you need to consider not just function calls, but things that happen after your function returns, so you can't just infer some nosync-variant in a bottom-up call graph traversal (without giving up on sret/noalias optimizations). |
I was thinking that between the allocation of a variable, and the first atomic operation after the allocation, we could treat treat the variable as "non-escaped" relative to any other memory accesses. That said, in the fenceless variants, I guess it's possible to load the address of a variable before the variable is actually allocated. Which would cause problems for alias analysis.
The problem is that it's hard to establish what, exactly, it means to "get a pointer". For the case where the load and the store are on the same thread, I think we can just forbid loading a pointer before you store it. For the cross-thread cases, the definition gets more complicated. And even if we do manage to establish a definition, what have we actually accomplished? The handling required for cases like D153932 still basically requires treating unknown functions as escape points. |
But that would essentially be a volatile access in that case? |
The aforementioned provenance-y C++ paper is here, if anyone is curious: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p3292r0.html . I think probably to the people on this thread none of the content is new; but perhaps the framing may help focus some thinking. (I don't really have a fix, but the ideal outcome in my mind is that things like this issue get fixed in the compiler, but things like pointers having escaped before their allocation get fixed in the standard, which I think most people agree with; the thing I'd like to narrow down on is the moral justification for a difference). |
Really interesting thread. I'm no expert so I hope if I share some thoughts you can point out what I misunderstood.
I liked what you said before more, with the "treat function calls that aren't marked "nosync" as escape points for pointers that eventually escape". Treating the variable as non-escaped sounds a bit too permissive.
I don't understand this sentence at all. What does it mean to load a pointer before you store it?
You'd probably need a more strict form of a semantic address dependency - a kind of semantic allocation dependency, which determines which allocation to access. For example,
If we have this kind of semantic allocation dependency, then we could say that having no hb between an allocation and a read on which an access to that allocation has a semantic allocation dependency, is UB, based on the fact that we shouldn't have known about the allocation in the first place. Like, if In this case, leaking the pointer ahead of time is forbidden. But publishing the pointer after a fence (maybe in an opaque function) is allowed. And you'd know that the allocation can't escape "out of thin air", but only through proper synchronization - so if the pointer never "eventually escapes", then unknown functions can't make it escape. They can only add proper synchronization for when the pointer eventually does escape, so your suggestion from before about treating functions that might help to synchronize as escape points for allocations that do eventually escape sounds spot-on. The point of distinguishing this from address dependencies is that for example a hash table ought to be allowed, even if the key is not synchronized with the allocation. That's fine because the key just tells us which part of the allocation to access, but doesn't tell us which allocation to access. The question is how to define this s-alloc correctly, since the hash might point us to a linked list... So in some sense we would have an allocation dependence on the key, but we would have to exclude that somehow to avoid false positives. Perhaps the provenance rules can be used here (in the sense that s-alloc = "get-provenance-from") but I don't have it at my fingertips if it can cover all the corner cases. EDIT: I just saw the comment by @davidtgoldblatt discussing this approach and that it doesn't work because of how the fences create the hb relationship. An ugly solution might be to switch to a different model of hb just for the sake of defining "provenance races" - something along the ppo based models, where sb;[F>rel];sb would give hb' but sb would not - and require that the allocation hb' the load, and the load hb' the access unless the allocation full-on hb the load. Perhaps better but still ugly, one could define a relation hb[r] for each read r which contains those hb relations that "use r" (i.e., where the proof of the hb relation makes a reference to r at some point), and require that if r --s-alloc--> b, then alloc^-1(b) --hb[r]--> b
I am not too familiar with Rust but my assumption would be that if you have a mutable reference during some event e1 and any other reference during some event e2, and the events e1 and e2 aren't ordered by hb (e.g., because there are no fences), then you immediately have UB. |
Basically the example at https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p3292r0.html#early-escape . |
Something like https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p3292r0.html#early-escape
You don't need a mutable reference in this context, I think; just an UnsafeCell or something like that. |
Thanks, now I understand what it means. The example you have shared is even more "crass" that loading the pointer before storing it - it is loading the pointer even before the allocation. One could make a more precise example with something in this direction:
With sufficient barriers in the remainder of the code one could ensure that every load &store of the pointer in the program happens after the allocation. But I assume the alias analysis would still allow reordering them, right? Perhaps one needs something stronger than ordering with the allocation as I thought - if I get provenance from some event, then I need to have ordering with that event. The allocation is just one place I could get the provenance from... |
I can't say I've read all of this here. I think a release fence as mentioned earlier would suffice. Specifically if the pointer is written atomically with release ordering instead of relaxed, then there is a dependency ordering (on everything but DEC Alpha). But I guess that's just the issue here that the arbitrary function could have a release ordering store. |
I guess I was stuck in thinking about what the standard allows. But of course you're right, just because the standard allows having an alias does not mean that it can happen in the implementation, and that's what's relevant for the optimization. I don't think DEC is a problem in the examples I've seen, since we have load to store dependencies, which IIRC DEC does not reorder. (There might be examples like that too though). So from a purely technical standpoint, it's maybe fine to say that we just need a kind of release ordering (from a fence or release store) when storing the pointer at a time where otherwise we can assume that the pointer is unique (I think restrict may have the same issues as new). Nevertheless, morally speaking, it seems more reasonable (and easier to remember) to have a definition that mirrors the way hb works. Something in the direction of allocations need to "happen-before" pointer stores and accesses to the allocation, or you are not storing the provenance/racing the allocation, for some good definition of "happen-before". I think one of the main things I'm confused about is whether "accidental synchronization" (e. g., with a store release of a flag, rather than a fence) should still provide provenance. Perhaps it makes not too much difference for which optimizations are allowed or not? |
Well accidental synchronization is rarely accidental. E.g. a shared UnsafeCell guarded by a lock. This one is a bit weird since it's not quite that pattern, it relies on a dependency ordering given by the fact that we've read the value N rather than N - 1 |
I should be more precise about what I mean by accidental synchronization. Informally, I mean that the pointer store->load are not synchronized, but in the very end you happen to be synchronized anyways. Or from another view, you are allowed to leak and receive pointers to objects before they are allocated, as long as you later wait through some other means until the allocation finishes. Formally, I am thinking of something like a
with
, i.e. a happens-before where the store->load appear somewhere in the proof - if you have this kind of hb, then it is "properly synchronized", otherwise "accidentally synchronized". In the case of an UnsafeCell protected by a lock, the pattern is
which would be "properly synchronized" by this definition. On the other hand,
(or the example with the flag) would be "accidentally synchronized", since at the time of the access we do have happens-before, but we actually get the pointer without any synchronization. I.o.w. the synchronization has nothing to do with passing the pointer. From my point of view, only allowing the pattern with "proper synchronization" would theoretically allow more optimizations, since if there was any way to know that foo() does not contain a standalone release fence would mean that we can reoder non-leaked accesses around it. And indeed, with good LTO or with some hypothetical markings like a 'no_release_fence', you could sometimes know that. But is that an important optimization in practice? If "accidental hb" is enough to allow the pattern (as in David's proposal), then it becomes much easier to formalize, but those optimizations would be forbidden - any release operation, including an unlock(), would suffice to prevent reordering across the function. One issue is that checking "accidental hb" only at the end points forbids reasonable compiler optimizations if we have a full loop and return the pointer back to the original thread, like in one of David's example without fences.
becomes
and the additional step introduces UB. Another option discussed above (if I understood correctly) would be along the lines of checking any hb on final access, but additionally if it is the same thread as the allocation, then the load must be sequenced after a store that stores the same provenance. I find myself often forgetting about all the examples discussed here and which ones are intended to be allowed or forbidden. I'll try to summarize my current understanding:
I think the dependency is a red herring, since there are also much simpler examples with just a binary flag; and the synchronization actually seems to come from normal release->acquire ordering. I believe it's just that Sung-Hwan make sure to synchronize with the last of these stores to avoid a data race. |
About restrict, I'm still trying to untangle all the words in the standard definition (looking at C11 now), but I think it may really be a similar situation.
This does not violate "Every other lvalue used to access the value of X shall also have its address based on P. ", does it? I don't see any other clause that would render this UB, but I would assume alias analysis will tell you that magic and p are not the same (EDIT: I checked with David's example and it ends up doing the same optimization as with new). I think this will make "has been stored before" really really hard to make precise, since before calling |
LICMPass of Clang++ (13.0.0+) optimizes the above function into the following (C++ syntax is used for brevity, see godbolt for the IR code):
This is a miscompilation bug. By compiling and running a concurrent program that involves the
bug
function on Armv8 processors Apple M1 and Cavium ThunderX2, we actually observed certain program behaviors forbidden by the original program before the optimization. The actual program we tested can be found here. (The forbidden behavior can be observed couple of times throughout millions~billions of program runs. The test code includes some intricate crafts, e.g., that helps repeat the test many times.)In the following, we explain a simplified version of the concurrent program we used and show why the above LICM is a miscompilation bug.
To see a buggy execution, let's consider two threads where the first thread calls the
bug
function.Thread 1:
Thread 2:
Also suppose that the
foo
function is given as follows:Now, if we create two threads as follows, the program should never print
bug!!!
before the problematic LICM in thebug
function.Indeed, once
thread2
readsN
froma
in an acquire access mode (line T2.2), the last call to thefoo
function (line A6, writingN
toa
with a release mode) synchronizes with the acquire read. Therefore, the last write top
(line A5,*p = N
) bythread1
happens-before the read from*p
(line T2.5,*p != N
) bythread2
, andthread2
can only readN
from*p
. Note that there is no data race in this program (before the LICM).However, after the above LICM, the program becomes racy on
*p
, andthread2
can read a value other thanN
from*p
. We actually observed this program printing "bug!!!" by compiling and running it on Armv8 processors Apple M1 and ThunderX2!Roughly, the above LICM is wrong because it moves a non-atomic write
*p = i
(line A5) after a release write (line A6). This is wrong even thoughp
is never leaked before it is returned. Specifically, LLVM applies the above LICM and inlines thebug
function inthread1
, resulting in the following code:Now, the write to
p
(line O5) is not properly synchronized anymore, andthread2
can read an uninitialized value fromp
(line T2.5). In particular, such behavior is allowed by the Arm architecture because it can reorder the two independent stores O5 and O6. Therefore, this program can print "bug!!!", which is originally forbidden.We conclude that the pointer
p
in the functionbug
should not be treated as completely local even before it is leaked (line A9).The text was updated successfully, but these errors were encountered: