-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Interfacing LLVM's "atomic volatile" in Rust #212
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reads really nicely. Good job!
Emphasize why load/store tearing is bad. Co-Authored-By: gnzlbg <gnzlbg@users.noreply.github.com>
Typographical fix Co-Authored-By: gnzlbg <gnzlbg@users.noreply.github.com>
In the case of shared memory, there is no reason why volatile accesses should be limited to primitive types. In fact, it is very common to want to read/write an entire struct in the shared memory. With this API you would have to read/write each field of the struct individually, which makes the code more complicated for no benefit. Shared memory tends to have two components:
At no point do we ever need volatile atomics. In fact, volatile atomics would only be useful for theoretical MMIO which must be interacted with atomic instructions instead of normal loads and stores, but AFAIK no such hardware exists. Also, memory orderings don't really make sense in the context of MMIO. It is sometimes necessary to ensure ordering of writes to MMIO, but the fence instructions emitted by atomics will only ensure synchronization with other CPU cores. A specialized set of fence instructions are needed to ensure synchronization with other memory-mapped hardware (on ARM this is the difference between |
|
||
1. Concurrent use of volatile memory operations on a given memory location is | ||
considered to be Undefined Behavior, and may therefore result in unintended | ||
compilation output if detected by the optimizer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree with this definition of volatile accesses: data races with volatile operations do not create data races (assuming both sides of the concurrent access are volatile), it just produces garbage data (i.e. whatever happened to be in that memory location at the time). This is what is meant by volatiles acting just like a load or store instruction.
I'll accept that my interpretation is controversial, but this is how compilers implement volatile because it matches how all of the existing C/C++ code uses volatile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a definition, it just documents that we currently do not guarantee the behavior of programs where this happens, which is true, because our read_volatile
/write_volatile
functions are implemented on top of LLVM, and LLVM does not provide this guarantee.
I think everybody agrees that everything would be much better if LLVM were to guarantee this, but we opened a documentation bug and it doesn't look like LLVM will be guaranteeing this any time soon: https://bugs.llvm.org/show_bug.cgi?id=42435
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, LLVM actively warn that they consider concurrent volatile accesses to be data races in their atomics documentation:
The rule is essentially that all memory accessed with basic loads and stores by multiple threads should be protected by a lock or other synchronization; otherwise, you are likely to run into undefined behavior. If your frontend is for a “safe” language like Java, use Unordered to load and store any shared variable. Note that NotAtomic volatile loads and stores are not properly atomic; do not try to use them as a substitute. (Per the C/C++ standards, volatile does provide some limited guarantees around asynchronous signals, but atomics are generally a better solution.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll accept that my interpretation is controversial, but this is how compilers implement volatile because it matches how all of the existing C/C++ code uses volatile.
We tried to convince LLVM to make this their official docs, and they refused. So, while what you say is true for current LLVM, this may change any time and LLVM does not seem willing to commit to a proper guarantee here. Until LLVM changes their stanza, our hands are bound.
cc @elast0ny |
This is definitely a case where the MMIO and shared memory use cases are at odds with each other. In principle, the shared memory use case can be handled on top of primitives for MMIO (guaranteed absence of tearing) by building a memcpy operation on top of those. But in practice, I could accept this as a motivation for keeping
I believe that this assertion is not generally true. The single most central assumption of the Rust Abstract Machine, which almost every single compiler optimization relies on for correctness, is that the contents of memory can only change when the program allows it to. Atomics are not exempt from this rule. They simply ask rustc to consider the possibility of interaction between threads instead of restricting the analysis to the current thread. But even with atomics, it should be okay for rustc to optimize this program... use std::sync::atomic::{AtomicBool, Ordering};
static x: AtomicBool = AtomicBool::new(true);
fn main() {
x.store(true, Ordering::Relaxed);
println!("{}", x.load(Ordering::Relaxed));
} ...into this one... use std::sync::atomic::AtomicBool;
static x: AtomicBool = AtomicBool::new(true);
fn main() {
println!("{}", true);
} ...and indeed, rustc already performs this sort of optimizations today when presented with a very simple program like the one above. For sure, today, the optimizations will go away as soon as your program becomes just a tad bit more complex. But there's no fundamental reason why that is the case, it's just a limitation of the optimizer's current program analysis capabilities. Future-proofing against this is why we need atomic volatile operations in shared memory scenarios. To tell the compiler that its knowledge of memory accesses in a certain region, and of their visibility to the outside world, is not complete, even if there's no obvious side-effect going on, and that a conservative stance towards optimization must be adopted. Because what rustc's optimizer does on toy programs today, it may do on non-toy programs in the future...
Using plain loads and stores for interprocess communication is subtly incorrect for the same reason that using non-volatile atomics is:
Unless you get lucky with the aforementioned poor ability of rustc and LLVM to optimize atomics-based code, these effects are likely to break shared memory communication based on plain loads and stores. The underlying issue is that the compiler's model of an isolated address space is incorrect in the case of shared memory, and must be amended by disabling the kind of optimization described above through use of volatile. As @RalfJung pointed out in the context of seqlocks, using volatile memory accesses in an attempt to turn data races into well-defined behavior is a hack which isn't guaranteed to continue working in the future by either the current Rust definition of undefined behavior or LLVM's current implementation of non-atomic volatile. Which is why LLVM has atomic volatile, and I think Rust should have it too.
As explained by this RFC, volatile atomics bring two very useful properties with respect to current volatile operations even if the only thing that one uses them for are
Indeed, orderings stronger than |
IMO This avoids relying on LLVM to do the tearing for us, meaning we can quite clearly document what is happening.
I don't agree with the "which is why" part here. seqlocks shouldn't use any Atomic volatile are useful for shared-memory concurrency, to answer questions like "what happens on a data race". They are more well-behaved that non-atomic volatile because their behavior is better specified. I'd love if we could get rid of non-atomic volatile as that would side-step a lot of torny questions, but alas, some hardware disagrees... but I feel it would be fine for such hardware to have to jump through hoops and not use the normal convenient APIs. |
Bad wording, I suspect. This was intended to respond to @Amanieu's repeated assertions that using volatile to handle data races is okay because it would just return garbage torn data. Volatile and atomic are orthogonal. You need volatile when your memory accesses are visible to someone else (or conversely you are seeing someone else's memory accesses), and you need atomics when one thread is racing with another. If you are racing with a foreign process in shared memory, you need both. Ergo, atomic volatile is useful.
You may like the section where I mention that we could actually support more hardware with atomic volatile if we expanded our model of atomic operations beyond global cache coherence (aka |
This doesn't sound right. If that were true then any memory that interacts with a system call would need to be accessed with volatile, because anything that happens in the kernel is outside the Rust Abstract Machine. Consider the case of a buffer being filled from a file: the compiler clearly cannot assume that the contents of the buffer remain the same across
rustc can optimize this program because it can prove that the address of
Again, you are misunderstanding the reason the compiler is allowed to perform these optimizations. This is because there is no synchronization operation (i.e. an atomic operation or a fence) between the loads/stores, so it is allowed to assume the memory contents have not changed. However any shared data structure (whether intra-process or in shared memory) will use atomic operations on the control signals to control access to data, which prevents merging the loads/stores across synchronization operations. |
Not impossible, but likely very unreliable in practice, at least in the general case where you have some arbitrary data structures containing
First, I don't know why you'd be passing an Second, I don't see why you'd want the default for FFI functions to be more conservative than the default for non-FFI functions. In your scenario, if you had a Rust function where:
the compiler would have to assume that it was multithreaded. The same should apply to FFI functions with pointer parameters. If anything, FFI code is more likely to do "weird things" than Rust code, not less. There could of course be unsafe annotations for FFI functions to make assumptions about how they behave: not just single-threadedness, but stronger constraints like "it won't keep using this pointer after it's done executing". But they shouldn't be the default. |
Indeed, making this work reliably on trait objects would probably require either of...
I guess of those three, a compiler understanding of sealed traits would be option that's closest to being actually feasible in the medium term... and it definitely wouldn't resolve the general problem, but only solve the problem for sealed traits.
For Arc I'm not sure if I can think of a good example, but for the others I could totally see myself dumping a struct with an AtomicXyz or Mutex member to some kind of I/O interface...
Fair enough, that's also what I remembered the #215 conclusion was. If we accept it, then we go back to the "we don't really have a use for atomic volatile beyond atomic loads and stores" starting point above. To answer your underlying question about why I find this conclusion somewhat unsatisfying (but probably not so unsatisfying as to feel like relitigating it too much)...
FFI is special in how opaque it is. With Rust code, as it's currently compiled at least, the compiler has a lot of freedom to look through the code of the crate that is being compiled and its dependencies and to infer many useful properties from it, whereas with FFI we only get to do that in very specific compilation scenarios (cross-language LTO) and at a very low level (LLVM only). This means that in the FFI case, how we set the default interface contract is very important, because often it's the only interface contract we're going to get. Unlike in Rust, where the compiler can potentially infer a lot more than what the developer told him in the code. You could reasonably argue, however, that writing FFI bindings is already so much work that it wouldn't hurt too much to ask performance-conscious people to add "may not do X or Y" manual annotations to the FFI bindings which are used in the hot parts of their code. (Note also that there is some argument in there for being careful if we ever try to stabilize some form of Rust ABI for e.g. dynamic linking, as it will have mostly the same problems as C FFI if done in the most straightforward way of mapping functions to ELF/PE symbols without any extra metadata.) By the way, if I didn't manage to convince you that i'm insane yet...
Would this allow things like excluding memory which could not have possibly been touched by any other thread from That intuitively seems even more far-fetched than eliding atomics in single-threaded code, but if doable, it could be a way to resolve the longstanding Great Atomics Ordering expert debate by making the "safe choice" of using |
Alright, I'm finally done giving this a good read and updating it with respect to the outcomes of the recent discussion with @RalfJung . In the end, the magnitude of the changes made this pretty much a full rewrite, which together with shortage of my volunteer time is why it took so much time. Things that I'm still wondering about:
To answer the latter question, I could use the opinion of people who implemented compiler backends for those kind of targets. Any suggestion of a good place to ask the question? |
I'd start on IRLO. |
(Replying to @RalfJung's comment which GitHub is now hiding under a "show resolved" button...)
Overly long answer: Yeah, it’s common to have a sequence like:
You often do need some kind of synchronization between steps 1 and 2, but it’s dependent on the hardware and how you’ve configured it. It might be any of:
Traditionally the synchronization is done as a separate operation; prepping for DMA is usually considered a more heavyweight activity that doesn’t need to be optimized as tightly as atomics. But, for example, ARMv8 has a dedicated store-release instruction, and AFAIK it can provide the necessary guarantees for this, depending on how the memory is configured. In other words, it’s a use case where ‘volatile atomic release store’ would have meaningful semantics, and where you can’t get the same instruction sequence a different way (as opposed to architectures where the assembly for store-release is just a barrier followed by a normal store). It’s just… not really necessary and not usually done. But it’s at least plausible that some future system would have a stronger demand for it. |
I opened an IRLO thread to discuss the remaining issue of applicability of unordered atomics to super-weak memory models (language virtual machines, GPUs...) |
New concern from the GPU discussion, but which is also relevant on x86 IIRC: some hardware architectures define SIMD loads and stores to be element-wise atomic, not atomic as a whole. If I'm aiming for total deprecation of |
This is what we were discussing here too: #209 |
Ah, yes, thank you for reminding me about this topic and linking to it! Reading through it again gives me the impression that volatile SIMD loads and stores are a rather tricky beast, because...
As far as I can tell, our options here are to...
|
So, I've been having a very interesting chat with @mcy, who had some reservations about the currently proposed API. Here's my biased and incomplete summary. First of all, the concerns:
We felt that it should be possible to do better on these fronts by trying to stay closer to the original design of With this in mind, a possible alternative API design would be to use a trait to express volatile operations... // This trait is unsafe to implement, because implementors must guarantee that the
// operations are indeed atomic and volatile, with semantics as described by this RFC.
unsafe trait VolatileLoadStore {
// This operation is unsafe to use because the pointer might not be valid
unsafe fn load_volatile(self: *const Self) -> Self;
// This operation is unsafe to use because the pointer might not be valid
unsafe fn store_volatile(self: *mut Self, val: Self);
} ...and then we would implement that trait for every type which has native machine load and store operations. Speaking personally, I do like this design direction better than the current design, but that's another major RFC rewrite so I wanted to test waters here before changing all the text. Here's my quick biased evaluation. Advantages:
Drawbacks:
Neutral:
|
While I do not have that much time to dedicate to this RFC at the moment, I just wanted to mention that the ongoing discussion about non- |
Alright, 2-3 months later I think it's fair to say that I will not manage to finish this one myself anytime soon, as life got busy with too many other things. I hope that the notes which I left at the end of this PR thread will be enough to help anyone interested in pushing this forward into taking things over from where I left them. Otherwise, feel free to ask me any question about whatever seems unclear. |
Basically my proposal to move forward on volatile support in Rust, by interfacing LLVM's atomic volatile which I consider to be a better match for volatile's intended "defer to hardware memory model" semantics than today's
ptr::[read|write]_volatile
in every single respect.Emerges from the #152 discussion, and a step toward resolving it, but will not fully resolve that problem because as far as I know, that's impossible without LLVM involvement.
Having this RFC reviewed here is intended as a way to weed out obvious problems in a small and focused group before going for the broader audience of a pre-RFC on IRLO.
Rendered