Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alter Once to remove unnecessary SeqCst usage #31650

Closed
wants to merge 1 commit into from

Conversation

AlisdairO
Copy link
Contributor

Just had a look at Once, and it looks like it's unnecessarily using SeqCst where acquire/release semantics would be sufficient. Would appreciate a review to be sure, but I can't see why it would require SeqCst.

@rust-highfive
Copy link
Collaborator

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @nikomatsakis (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

@@ -71,7 +71,7 @@ impl Once {
#[stable(feature = "rust1", since = "1.0.0")]
pub fn call_once<F>(&'static self, f: F) where F: FnOnce() {
// Optimize common path: load is much cheaper than fetch_add.
if self.cnt.load(Ordering::SeqCst) < 0 {
if self.cnt.load(Ordering::Acquire) < 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly more efficient would be to use a Relaxed load and only fence(Acquire) if the count is in fact negative.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree, in this case you would want to acquire to be on the load itself. This is because some architectures (ARM, AArch64) have a load-acquire instruction that is much faster than the fence instruction. Since this branch is going to be taken in the majority of cases, you would want to avoid the fence overhead here.

@gereeter
Copy link
Contributor

Memory orderings are tricky enough that I think that having comments explaining why the more relaxed bounds are correct would be very useful.

@@ -102,11 +102,11 @@ impl Once {
// calling `call_once` will return immediately before the initialization
// has completed.

let prev = self.cnt.fetch_add(1, Ordering::SeqCst);
let prev = self.cnt.fetch_add(1, Ordering::AcqRel);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be Relaxed with a fence(Acquire) if the count is negative: since swap and fetch_add are atomic, no extra synchronization is necessary to make the number of threads that pass through the fetch_add seeing a positive integer exactly the number that will be read as the result of the swap.

@AlisdairO
Copy link
Contributor Author

@gereeter Thanks a lot for the thorough review - really made me think about/learn more about the use of relaxed ordering, where I'd just been concerned with the sequential consistency aspect. Made changes which hopefully address your thoughts.

@AlisdairO
Copy link
Contributor Author

Ah, looks like there's an issue filed for this: #27610

@gereeter
Copy link
Contributor

@Amanieu (continuing out of a now hidden comment)

Pro load(Acquire):

  • You are correct that having a fence inside the branch will prevent the use of a load-acquire instruction on ARMv8 and AArch64. Moreover, PowerPC and ARMv7 are forced to use a data barrier instead of an instruction barrier after a branch.
  • The common case requires the Acquire, and we should optimize that case even at the expense of the less common case.
  • The use of a mutex and several atomic operations swamps the overhead introduced by a single superfluous lightweight barrier.

Pro load; fence(Acquire):

  • ARMv8 and AArch64 also introduced the lighter dmb ld fence, so the overhead of using a separate fence should not be too large.
  • A separate fence is more descriptive, both to a reader and to the compiler, as it more accurately captures what synchronization is needed.
  • Related to the previous point, a sufficiently smart compiler could do more to optimize a separate fence because it restricts ordering less. Note that LLVM is nowhere near this point.

Neither of the two options would even allow LLVM to produce the optimal code on PowerPC and ARMv7, which would be an instruction sync only in the branch for the early exit. The fence would prevent LLVM from using an instruction sync at all, while the load(Acquire) would force there to be an instruction sync on both branches.

Given all that, I'm inclined to support switching back to a load(Acquire) along with a comment explaining the reasoning. This choice optimizes for the common case and the more modern processor.

@AlisdairO
Copy link
Contributor Author

@gereeter @Amanieu thanks - I've altered the first fence back into a simple load(Acquire).

@alexcrichton
Copy link
Member

I'm personally very wary of relaxing any orderings in the standard library, especially if we're coming up with the design ourselves. These are notoriously hard to get right unfortunately. Along those lines, do you have any benchmarks to show the impact of these relaxed orderings? It would be good to get a handle on what sort of runtimes we're talking about. I suspect x86 will benefit, but the greatest benefit is likely from ARM.

I also suspect that we only really need to relax one ordering in this function to get any real benefit, which is the first one. None of the orderings really have any contention which needs to be super fast, so I would personally prefer to see what our best perf increase would be and then move as much as possible back to SeqCst while retaining the same perf wins.

@AlisdairO
Copy link
Contributor Author

@alexcrichton The benefit of relaxing SeqCst for the first instruction will likely be zero on x86 - the extra barrier for SeqCst comes on the write. Acquire/Release comes for free on x86, while the extra fence for sequential consistency adds (IIRC) about 100 cycles to each write.

I can understand a fear of using Ordering::Relaxed - I do tend to think it can leave code harder to change safely. On other other hand I'd argue against insisting on full sequential consistency everywhere in the standard library that isn't absolutely performance critical. The difference between sequential consistency and acquire/release is not complicated for simple code like this, and given the broad scope of usage of standard library constructs, it seems a shame to spend cycles on obviously pointless work - even if the overall benefit is pretty minor.

@alexcrichton
Copy link
Member

To me at least these orderings seem pretty non-trivial. I've found that to reason through these you basically have to reason about all orderings with respect to other orderings, and there's quite a few going on here.

In terms of performance I think we'll only really get wins by fixing the first few (the ones that check to see if the initialization has already run and finished). Only modifying those would be much easier to reason about (to me at least)

@AlisdairO
Copy link
Contributor Author

Yeah, I don't by any means intend to say that Acquire/Release vs Sequential Consistency is always trivial, just that it is in this particular case - if we were to ignore Relaxed and use Acquire/Release for everything, all the synchronization necessary to guarantee visibility of the results of the closure call occurs on the self.cnt variable. That being the case, sequential consistency is irrelevant, because to be useful over AcqRel it requires a minimum of two different atomic variables being altered in different threads.

All that said, any performance gain from this is obviously going to be too minor to quibble extensively about, so I'm of course perfectly happy to change this so that it just uses Acquire on the early exit checks. I'm busy tonight but I'll sort it out as soon as I can :-).

@alexcrichton
Copy link
Member

Ok, I think that making the first two operations unconditionally Acquire is still correct in terms of memory orderings, and I suspect that'd get 99.9% of the benefit of the patch. (and would be much easier to reason about).

I'd want to confirm with others still, though.

@Amanieu
Copy link
Member

Amanieu commented Feb 16, 2016

Since we're talking about optimizing the hot path, I think it would be nice to split the cold path of call_once into a separate function that is marked with #[cold].

@alexcrichton
Copy link
Member

I would be ok doing so assuming that benchmarks are shown that it's a sizable improvement.

@nikomatsakis
Copy link
Contributor

r? @alexcrichton

I am not familiar enough with this code to have a strong opinion, and don't care to become so. ;)

cc @aturon

@arthurprs
Copy link
Contributor

+1 for cold path split

@bors
Copy link
Contributor

bors commented Mar 27, 2016

☔ The latest upstream changes (presumably #32325) made this pull request unmergeable. Please resolve the merge conflicts.

@huonw huonw added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Apr 26, 2016
@huonw
Copy link
Member

huonw commented Apr 26, 2016

This has been stuck in limbo for a while, so @rust-lang/libs should discuss it. I'm in favour of optimising it, but we should definitely be sure it is correct. I imagine there's other implementations of once (or similar) that use weaker orderings which we could consult with.

As @alexcrichton says, it seems like we could at the very least just weaken the first ordering (together with its associated Release) to get the main fast path fast.

@Amanieu
Copy link
Member

Amanieu commented Apr 27, 2016

The big issue here is that we need to construct a Mutex object and then destroy it when it is no longer needed. I did a quick survey of existing implementations of pthread_once and __cxa_guard_acquire. They generally use one of two approaches, both of which avoid this issue:

  • Use a global mutex. This prevents separate initializers from running concurrently, but this doesn't seem to be much of a problem in practice. The mutex is recursive to avoid deadlocks.
  • Use a futex, which doesn't require any initialization or destruction. Similar functionality is available on Windows with SRWLock, however I don't think there is an equivalent for OSX.

@alexcrichton
Copy link
Member

@Amanieu note that after #32325 we no longer have a mutex at all, so that concern is essentially moot.

In my opinion the only ordering which needs to change is this one which can likely be relaxed to Acquire as pointed out here. I believe this will have no impact as all on x86/x86_64 as a SeqCst load is just a mov instruction (I think), but it may have an impact on ARM (where it should be benchmarked)

@AlisdairO
Copy link
Contributor Author

My apologies, this completely fell off my radar. I shall try to take a look at it this weekend.

@AlisdairO
Copy link
Contributor Author

OK, looks like it's a pretty simple change at this point. @alexcrichton is correct on the potential performance impact, although alas I have no multi-core ARM machine to test on.

@Amanieu
Copy link
Member

Amanieu commented May 2, 2016

I ran some tests on ARM & AArch64 machines and the performance of load(Acquire) and load(SeqCst) are identical. However I still think this change is the right thing to do.

@alexcrichton
Copy link
Member

The libs team discussed this during triage yesterday and the conclusion was that we don't want merge this at this time. This unfortunately reduces readability as the non-SeqCst ordering is unexpected, won't have a perf impact on x86, and @Amanieu has measured to have no impact on ARM.

Along those lines, without evidence in favor of this, we'd like to keep the code as it is today, but thanks regardless for the PR @AlisdairO!

@huonw
Copy link
Member

huonw commented May 5, 2016

Additionally, @alexcrichton looked at the ARM assembly, and apparently it was identical for both Acquire and SeqCst.

@Amanieu
Copy link
Member

Amanieu commented May 5, 2016

It seems that while load(Acquire) and load(SeqCst) does generate the exact same assembly on ARM, there is a difference on PowerPC:

load(Acquire):

    ld 3, 0(3)
    lwsync

load(SeqCst):

    sync
    ld 3, 0(3)
    lwsync

Since sync is a pretty heavyweight barrier, I would expect it to have a significant performance impact.

@alexcrichton
Copy link
Member

Perhaps, but if we're just fishing for platforms where this makes a difference then we'll of course find such a platform. It still stands that no performance measurements show an improvement and it decreases code readability, so until that state changes we'll likely leave as-is.

@jeehoonkang
Copy link
Contributor

I have a different opinion on readability. According to this paper, SeqCst load/store instructions are broken. (disclosure: I'm an author.) There is no way to compile the SeqCst accesses as in the C11 standard to Power/ARM architectures. The paper proposes a fix, but it is much weaker and more complicated than expected. So in my opinion, as opposed to commonly believed, using SeqCst accesses are very wary.

@alexcrichton
Copy link
Member

@jeehoonkang heh I think you're far more well versed on this topic than I, so I'm definitely willing to defer to you!

I was mostly just probing for rationale on #44331 as the rule of thumb seems to be SeqCst is the "most correct" in terms of "hopefully it's just a compiler bug if it goes wrong".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants