Alter Once to remove unnecessary SeqCst usage #31650

AlisdairO · 2016-02-14T13:15:25Z

Just had a look at Once, and it looks like it's unnecessarily using SeqCst where acquire/release semantics would be sufficient. Would appreciate a review to be sure, but I can't see why it would require SeqCst.

rust-highfive · 2016-02-14T13:15:40Z

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @nikomatsakis (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

gereeter · 2016-02-14T17:47:14Z

src/libstd/sync/once.rs

@@ -71,7 +71,7 @@ impl Once {
    #[stable(feature = "rust1", since = "1.0.0")]
    pub fn call_once<F>(&'static self, f: F) where F: FnOnce() {
        // Optimize common path: load is much cheaper than fetch_add.
-        if self.cnt.load(Ordering::SeqCst) < 0 {
+        if self.cnt.load(Ordering::Acquire) < 0 {


Slightly more efficient would be to use a Relaxed load and only fence(Acquire) if the count is in fact negative.

I disagree, in this case you would want to acquire to be on the load itself. This is because some architectures (ARM, AArch64) have a load-acquire instruction that is much faster than the fence instruction. Since this branch is going to be taken in the majority of cases, you would want to avoid the fence overhead here.

gereeter · 2016-02-14T18:03:19Z

Memory orderings are tricky enough that I think that having comments explaining why the more relaxed bounds are correct would be very useful.

gereeter · 2016-02-14T18:05:29Z

src/libstd/sync/once.rs

@@ -102,11 +102,11 @@ impl Once {
        // calling `call_once` will return immediately before the initialization
        // has completed.

-        let prev = self.cnt.fetch_add(1, Ordering::SeqCst);
+        let prev = self.cnt.fetch_add(1, Ordering::AcqRel);


This can be Relaxed with a fence(Acquire) if the count is negative: since swap and fetch_add are atomic, no extra synchronization is necessary to make the number of threads that pass through the fetch_add seeing a positive integer exactly the number that will be read as the result of the swap.

AlisdairO · 2016-02-14T20:48:44Z

@gereeter Thanks a lot for the thorough review - really made me think about/learn more about the use of relaxed ordering, where I'd just been concerned with the sequential consistency aspect. Made changes which hopefully address your thoughts.

AlisdairO · 2016-02-14T21:05:33Z

Ah, looks like there's an issue filed for this: #27610

gereeter · 2016-02-15T16:30:14Z

@Amanieu (continuing out of a now hidden comment)

Pro load(Acquire):

You are correct that having a fence inside the branch will prevent the use of a load-acquire instruction on ARMv8 and AArch64. Moreover, PowerPC and ARMv7 are forced to use a data barrier instead of an instruction barrier after a branch.
The common case requires the Acquire, and we should optimize that case even at the expense of the less common case.
The use of a mutex and several atomic operations swamps the overhead introduced by a single superfluous lightweight barrier.

Pro load; fence(Acquire):

ARMv8 and AArch64 also introduced the lighter dmb ld fence, so the overhead of using a separate fence should not be too large.
A separate fence is more descriptive, both to a reader and to the compiler, as it more accurately captures what synchronization is needed.
Related to the previous point, a sufficiently smart compiler could do more to optimize a separate fence because it restricts ordering less. Note that LLVM is nowhere near this point.

Neither of the two options would even allow LLVM to produce the optimal code on PowerPC and ARMv7, which would be an instruction sync only in the branch for the early exit. The fence would prevent LLVM from using an instruction sync at all, while the load(Acquire) would force there to be an instruction sync on both branches.

Given all that, I'm inclined to support switching back to a load(Acquire) along with a comment explaining the reasoning. This choice optimizes for the common case and the more modern processor.

AlisdairO · 2016-02-15T17:26:23Z

@gereeter @Amanieu thanks - I've altered the first fence back into a simple load(Acquire).

alexcrichton · 2016-02-16T05:52:32Z

I'm personally very wary of relaxing any orderings in the standard library, especially if we're coming up with the design ourselves. These are notoriously hard to get right unfortunately. Along those lines, do you have any benchmarks to show the impact of these relaxed orderings? It would be good to get a handle on what sort of runtimes we're talking about. I suspect x86 will benefit, but the greatest benefit is likely from ARM.

I also suspect that we only really need to relax one ordering in this function to get any real benefit, which is the first one. None of the orderings really have any contention which needs to be super fast, so I would personally prefer to see what our best perf increase would be and then move as much as possible back to SeqCst while retaining the same perf wins.

AlisdairO · 2016-02-16T07:36:39Z

@alexcrichton The benefit of relaxing SeqCst for the first instruction will likely be zero on x86 - the extra barrier for SeqCst comes on the write. Acquire/Release comes for free on x86, while the extra fence for sequential consistency adds (IIRC) about 100 cycles to each write.

I can understand a fear of using Ordering::Relaxed - I do tend to think it can leave code harder to change safely. On other other hand I'd argue against insisting on full sequential consistency everywhere in the standard library that isn't absolutely performance critical. The difference between sequential consistency and acquire/release is not complicated for simple code like this, and given the broad scope of usage of standard library constructs, it seems a shame to spend cycles on obviously pointless work - even if the overall benefit is pretty minor.

alexcrichton · 2016-02-16T17:51:38Z

To me at least these orderings seem pretty non-trivial. I've found that to reason through these you basically have to reason about all orderings with respect to other orderings, and there's quite a few going on here.

In terms of performance I think we'll only really get wins by fixing the first few (the ones that check to see if the initialization has already run and finished). Only modifying those would be much easier to reason about (to me at least)

AlisdairO · 2016-02-16T20:45:32Z

Yeah, I don't by any means intend to say that Acquire/Release vs Sequential Consistency is always trivial, just that it is in this particular case - if we were to ignore Relaxed and use Acquire/Release for everything, all the synchronization necessary to guarantee visibility of the results of the closure call occurs on the self.cnt variable. That being the case, sequential consistency is irrelevant, because to be useful over AcqRel it requires a minimum of two different atomic variables being altered in different threads.

All that said, any performance gain from this is obviously going to be too minor to quibble extensively about, so I'm of course perfectly happy to change this so that it just uses Acquire on the early exit checks. I'm busy tonight but I'll sort it out as soon as I can :-).

alexcrichton · 2016-02-16T21:23:35Z

Ok, I think that making the first two operations unconditionally Acquire is still correct in terms of memory orderings, and I suspect that'd get 99.9% of the benefit of the patch. (and would be much easier to reason about).

I'd want to confirm with others still, though.

Amanieu · 2016-02-16T21:53:02Z

Since we're talking about optimizing the hot path, I think it would be nice to split the cold path of call_once into a separate function that is marked with #[cold].

alexcrichton · 2016-02-16T22:57:36Z

I would be ok doing so assuming that benchmarks are shown that it's a sizable improvement.

nikomatsakis · 2016-02-17T15:10:02Z

r? @alexcrichton

I am not familiar enough with this code to have a strong opinion, and don't care to become so. ;)

cc @aturon

arthurprs · 2016-02-21T02:44:16Z

+1 for cold path split

bors · 2016-03-27T00:33:02Z

☔ The latest upstream changes (presumably #32325) made this pull request unmergeable. Please resolve the merge conflicts.

huonw · 2016-04-26T03:54:17Z

This has been stuck in limbo for a while, so @rust-lang/libs should discuss it. I'm in favour of optimising it, but we should definitely be sure it is correct. I imagine there's other implementations of once (or similar) that use weaker orderings which we could consult with.

As @alexcrichton says, it seems like we could at the very least just weaken the first ordering (together with its associated Release) to get the main fast path fast.

Amanieu · 2016-04-27T23:48:05Z

The big issue here is that we need to construct a Mutex object and then destroy it when it is no longer needed. I did a quick survey of existing implementations of pthread_once and __cxa_guard_acquire. They generally use one of two approaches, both of which avoid this issue:

Use a global mutex. This prevents separate initializers from running concurrently, but this doesn't seem to be much of a problem in practice. The mutex is recursive to avoid deadlocks.
Use a futex, which doesn't require any initialization or destruction. Similar functionality is available on Windows with SRWLock, however I don't think there is an equivalent for OSX.

alexcrichton · 2016-04-28T02:11:50Z

@Amanieu note that after #32325 we no longer have a mutex at all, so that concern is essentially moot.

In my opinion the only ordering which needs to change is this one which can likely be relaxed to Acquire as pointed out here. I believe this will have no impact as all on x86/x86_64 as a SeqCst load is just a mov instruction (I think), but it may have an impact on ARM (where it should be benchmarked)

AlisdairO · 2016-04-30T07:38:34Z

My apologies, this completely fell off my radar. I shall try to take a look at it this weekend.

AlisdairO · 2016-05-02T19:14:25Z

OK, looks like it's a pretty simple change at this point. @alexcrichton is correct on the potential performance impact, although alas I have no multi-core ARM machine to test on.

Amanieu · 2016-05-02T20:29:56Z

I ran some tests on ARM & AArch64 machines and the performance of load(Acquire) and load(SeqCst) are identical. However I still think this change is the right thing to do.

alexcrichton · 2016-05-04T23:53:32Z

The libs team discussed this during triage yesterday and the conclusion was that we don't want merge this at this time. This unfortunately reduces readability as the non-SeqCst ordering is unexpected, won't have a perf impact on x86, and @Amanieu has measured to have no impact on ARM.

Along those lines, without evidence in favor of this, we'd like to keep the code as it is today, but thanks regardless for the PR @AlisdairO!

huonw · 2016-05-05T00:19:11Z

Additionally, @alexcrichton looked at the ARM assembly, and apparently it was identical for both Acquire and SeqCst.

Amanieu · 2016-05-05T00:34:50Z

It seems that while load(Acquire) and load(SeqCst) does generate the exact same assembly on ARM, there is a difference on PowerPC:

load(Acquire):

    ld 3, 0(3)
    lwsync

load(SeqCst):

    sync
    ld 3, 0(3)
    lwsync

Since sync is a pretty heavyweight barrier, I would expect it to have a significant performance impact.

alexcrichton · 2016-05-06T18:26:12Z

Perhaps, but if we're just fishing for platforms where this makes a difference then we'll of course find such a platform. It still stands that no performance measurements show an improvement and it decreases code readability, so until that state changes we'll likely leave as-is.

jeehoonkang · 2017-09-05T23:44:09Z

I have a different opinion on readability. According to this paper, SeqCst load/store instructions are broken. (disclosure: I'm an author.) There is no way to compile the SeqCst accesses as in the C11 standard to Power/ARM architectures. The paper proposes a fix, but it is much weaker and more complicated than expected. So in my opinion, as opposed to commonly believed, using SeqCst accesses are very wary.

alexcrichton · 2017-09-06T14:20:55Z

@jeehoonkang heh I think you're far more well versed on this topic than I, so I'm definitely willing to defer to you!

I was mostly just probing for rationale on #44331 as the rule of thumb seems to be SeqCst is the "most correct" in terms of "hopefully it's just a compiler bug if it goes wrong".

rust-highfive assigned nikomatsakis Feb 14, 2016

gereeter reviewed Feb 14, 2016
View reviewed changes

rust-highfive assigned alexcrichton and unassigned nikomatsakis Feb 17, 2016

huonw added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Apr 26, 2016

huonw added the I-nominated label Apr 27, 2016

Alter once to use acquire semantics for fast path

15213ac

AlisdairO force-pushed the once branch from 327289f to 15213ac Compare May 2, 2016 19:12

alexcrichton closed this May 4, 2016

alexcrichton mentioned this pull request Sep 5, 2017

Relax orderings in std::sync::once #44331

Closed

jeehoonkang mentioned this pull request Apr 1, 2018

Add AtomicCell crossbeam-rs/crossbeam-utils#13

Merged

gereeter mentioned this pull request Jul 14, 2018

sync::Once use release-acquire access modes #52349

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alter Once to remove unnecessary SeqCst usage #31650

Alter Once to remove unnecessary SeqCst usage #31650

AlisdairO commented Feb 14, 2016

rust-highfive commented Feb 14, 2016

gereeter Feb 14, 2016

Amanieu Feb 14, 2016

gereeter commented Feb 14, 2016

gereeter Feb 14, 2016

AlisdairO commented Feb 14, 2016

AlisdairO commented Feb 14, 2016

gereeter commented Feb 15, 2016

AlisdairO commented Feb 15, 2016

alexcrichton commented Feb 16, 2016

AlisdairO commented Feb 16, 2016

alexcrichton commented Feb 16, 2016

AlisdairO commented Feb 16, 2016

alexcrichton commented Feb 16, 2016

Amanieu commented Feb 16, 2016

alexcrichton commented Feb 16, 2016

nikomatsakis commented Feb 17, 2016

arthurprs commented Feb 21, 2016

bors commented Mar 27, 2016

huonw commented Apr 26, 2016 •

edited

Loading

Amanieu commented Apr 27, 2016

alexcrichton commented Apr 28, 2016

AlisdairO commented Apr 30, 2016

AlisdairO commented May 2, 2016

Amanieu commented May 2, 2016

alexcrichton commented May 4, 2016

huonw commented May 5, 2016

Amanieu commented May 5, 2016 •

edited

Loading

alexcrichton commented May 6, 2016

jeehoonkang commented Sep 5, 2017

alexcrichton commented Sep 6, 2017

Alter Once to remove unnecessary SeqCst usage #31650

Alter Once to remove unnecessary SeqCst usage #31650

Conversation

AlisdairO commented Feb 14, 2016

rust-highfive commented Feb 14, 2016

gereeter Feb 14, 2016

Choose a reason for hiding this comment

Amanieu Feb 14, 2016

Choose a reason for hiding this comment

gereeter commented Feb 14, 2016

gereeter Feb 14, 2016

Choose a reason for hiding this comment

AlisdairO commented Feb 14, 2016

AlisdairO commented Feb 14, 2016

gereeter commented Feb 15, 2016

AlisdairO commented Feb 15, 2016

alexcrichton commented Feb 16, 2016

AlisdairO commented Feb 16, 2016

alexcrichton commented Feb 16, 2016

AlisdairO commented Feb 16, 2016

alexcrichton commented Feb 16, 2016

Amanieu commented Feb 16, 2016

alexcrichton commented Feb 16, 2016

nikomatsakis commented Feb 17, 2016

arthurprs commented Feb 21, 2016

bors commented Mar 27, 2016

huonw commented Apr 26, 2016 • edited Loading

Amanieu commented Apr 27, 2016

alexcrichton commented Apr 28, 2016

AlisdairO commented Apr 30, 2016

AlisdairO commented May 2, 2016

Amanieu commented May 2, 2016

alexcrichton commented May 4, 2016

huonw commented May 5, 2016

Amanieu commented May 5, 2016 • edited Loading

alexcrichton commented May 6, 2016

jeehoonkang commented Sep 5, 2017

alexcrichton commented Sep 6, 2017

huonw commented Apr 26, 2016 •

edited

Loading

Amanieu commented May 5, 2016 •

edited

Loading