Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UB due to lack of a full fence in full_fence (on x86) #70

Closed
RalfJung opened this issue Aug 13, 2023 · 5 comments · Fixed by #71
Closed

UB due to lack of a full fence in full_fence (on x86) #70

RalfJung opened this issue Aug 13, 2023 · 5 comments · Fixed by #71

Comments

@RalfJung
Copy link

I was recently made aware of this code:

// HACK(stjepang): On x86 architectures there are two different ways of executing
// a `SeqCst` fence.
//
// 1. `atomic::fence(SeqCst)`, which compiles into a `mfence` instruction.
// 2. `_.compare_exchange(_, _, SeqCst, SeqCst)`, which compiles into a `lock cmpxchg` instruction.
//
// Both instructions have the effect of a full barrier, but empirical benchmarks have shown
// that the second one is sometimes a bit faster.
//
// The ideal solution here would be to use inline assembly, but we're instead creating a
// temporary atomic variable and compare-and-exchanging its value. No sane compiler to
// x86 platforms is going to optimize this away.
atomic::compiler_fence(Ordering::SeqCst);
let a = AtomicUsize::new(0);
let _ = a.compare_exchange(0, 1, Ordering::SeqCst, Ordering::SeqCst);
atomic::compiler_fence(Ordering::SeqCst);

As the comment says, this is UB -- and the myth that no sane compiler is going to optimize atomics is a myth. However, I admit I don't know enough about this specific case to say whether this is a risk for this crate.

What I am wondering, why doesn't this use inline assembly? That is the intended mechanism to force the compiler to generate code in a particular way, and it is clearly sound here. The alternative is even mentioned in the comment, but unfortunately without an explanation of why it was discarded.

@notgull
Copy link
Member

notgull commented Aug 13, 2023

The MSRV of the smol-rs project previously did not allow for inline assembly. See smol-rs/smol#244. But now it does, so we should probably bump the MSRV so we can directly use the code that we want.

@taiki-e
Copy link
Collaborator

taiki-e commented Aug 13, 2023

Filed #71 to resolve this.

why doesn't this use inline assembly?

When this crate was first released, inline assembly was not stable. And for a long time after that, we used MSRV that older than 1.59.

Since we have recently raised the MSRV of some smol-rs crates to 1.63, I think this is an appropriate time to address this issue.

and the myth that no sane compiler is going to optimize atomics is a myth. However, I admit I don't know enough about this specific case to say whether this is a risk for this crate.

The C++20 memory model states that the SC operation "establish a single total modification order of all atomic operations that are so tagged". https://en.cppreference.com/w/cpp/atomic/memory_order#Sequentially-consistent_ordering

While it may be possible to "merge two atomic operations when there are only pure operations in the middle" as described in the article you linked to, I don't think it is allowed to completely remove independent SC operations (unless the compiler can prove that there are never any SC operations that could be synchronized with it by inspecting all programs). (It may be possible to replace SC CAS with SC fence or other SC RMW, but it doesn't lose the property we need here.)

So, unless the C++ memory model changes the property mentioned above in the future, I expect that no wrong code will be generated here, but inline assembly is the "definitely correct" choice here, and I agree that using inline assembly is definitely preferable.

@RalfJung
Copy link
Author

RalfJung commented Aug 13, 2023

The MSRV of the smol-rs project previously did not allow for inline assembly. See smol-rs/smol#244. But now it does, so we should probably bump the MSRV so we can directly use the code that we want.

Ah, if it's just an MSRV issue that is great! And thanks @taiki-e for getting the ball rolling with that PR. :)

If the lock cmpxchg can be faster than mfence it might also be worth trying to get LLVM to generate the former instead of the latter for SC fences -- then all code would benefit.

The C++20 memory model states that the SC operation "establish a single total modification order of all atomic operations that are so tagged". https://en.cppreference.com/w/cpp/atomic/memory_order#Sequentially-consistent_ordering

Yeah, but the way that order interacts with the other orders isn't always entirely how one would expect, I think... so it's certainly not obvious to me that compilers are not allowed to drop the access. In particular this order AFAIK does not participate when determining synchronizes-with edges -- but an SC fence has the effects of a release-acquire fence and can hence establish synchronizes-with edges.

You have to study the formalizations of this model, I don't think the English prose can be precise enough for subtle questions like this.

@RalfJung
Copy link
Author

RalfJung commented Aug 13, 2023

I don't think it is allowed to completely remove independent SC operations

LLVM certainly does not agree with your reasoning: this becomes a NOP.

pub fn test() {
    let x = AtomicUsize::new(0);
    x.load(Ordering::SeqCst);
}

Stores also get removed. I think they just didn't implement the pass that removes dead RMWs.

@taiki-e
Copy link
Collaborator

taiki-e commented Aug 13, 2023

Oh, that is interesting. I can reproduce it, even with a very old rustc, even if I used compiler_fence around SC load/store. https://godbolt.org/z/xsvG5Ke58

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

3 participants