Add some support for atomic operations on mutable fields and Ptrs #37847

vtjnash · 2020-10-01T21:31:04Z

This starts much of the work of implementing Julia Atomics Manifesto for fields. In particular, it adds:

the @atomic macro
the ability to specify a field of a mutable object as atomic
the ability to specify memory orderings to getfield, setfield!, and isdefined (with error checking and codegen optimization)
the runtime Intrinsics for doing our standard set of atomic operations on Ptr, plus fences
the generic function wrappers for all operations (e.g. atomic_getproperty)
doc strings for new functions
support for atomics (pointers and fields) of up to 8 bytes in size that are guaranteed to be compatible with C (on some processors, these will imply locks managed by libc), and unlimited sizes of atomics for fields (which will imply the creation of a lock in the object—currently at the start)
most @atomic macro special forms and field declarations
swapfield!, mutatefield!, and replacefield! functions
renamed cmpswap to replace

It does not yet add (which can be future work for a later PR):

optimizations (codegen) for the new Ptr Intrinsics
optimizations for the swapfield, mutatefield!, and replacefield! functions
the lowering support for @atomic specifiers on variables (closure or global)
~~updates to Threads manual (which currently is just the same as the above link)~~
~~does not safely handle GC during field loads while locked (deadlock hazard)~~ (fixed)
model modifyfield!_tfunc better (with op call)

closes #29943, closes #32455, closes #37683

base/Base.jl

base/expr.jl

src/dump.c

src/julia-syntax.scm

src/intrinsics.cpp

JeffBezanson · 2020-10-07T21:35:37Z

Looks great! It's a lot of stuff, but really pretty straightforward.

src/builtins.c

On x86 loads have implicit acquire semantics, but the same is not true on other architectures (e.g. aarch64). We've been observing test failures on aarch64 as a result (#38555). Currently, we don't really have a good way to modify fields in an atomic way (pending #37847), so this just achieves the same thing using llvmcall to try and get the tests passing.

src/runtime_intrinsics.c

base/docs/basedocs.jl

doc/src/manual/multi-threading.md

src/ast.c

tkf

Thanks for keeping track of the discussion in the Manifesto document. I just went through the docs and have a look at the code a bit.

Misc questions

Release burrier for each constructor

Motivation section of Julia Atomics Manifesto mentions that

This is particularly relevant for ensuring the type-tag of the object is always valid and that the fields of immutable objects (or simply constant fields) always are seen to have their correct value. This implies a release barrier must exist between the end of the constructor and the first time the object is stored to a shared location from a given thread.

Isn't it sufficient to use sequentially consistent ordering for the task enqueue/dequeue? Since any non-atomic data transfer between the tasks are done via the closure stored in the Task which in turn is transferred through the task queue, sequentially consistent store in the task queue seems to be sufficient to guarantee the visibility of the memory operations prior to spawn. It sounds like a much cheaper (presumably free) strategy than requiring a release fence after every constructor which can get in the way of the compiler.

Weak atomics and boxed fields

TODO: however, that still may leave open the question of whether @atomic :relaxed a.x += 1 should get upgraded to include a release barrier if we decide not to allocate x inline.

I was actually writing a comment on exactly this point (although I suppose you need acquire on the read side as well). But I wonder if it should be considered an undefined behavior and implement it as a compile-time error since it likely indicates a misuse of the API. Declaring that this is an undefined behavior leaves us an opportunity to implement auto-promotion afterwards.

(Side note: If we document that this is detected as an error, adding auto-promotion is a breaking change. Thus, we need to not document it or document that this is an undefined behavior if we want to define it later.)

Atomic fields can be read without specifying an ordering (or specifying ":not_atomic"), which will automatically upgrade it to a ":relaxed" read. This is useful for generic code (like show_any).

For the same reason, this is not enough when the field is boxed since the store of the pointer to a heap-allocated object can be reordered with respect to (potentially temporary invariance-breaking) manipulations "prior" to the store.

(But, I don't think we can make something like show_any race-free in general anyway since any thread can be breaking the internal invariance of a mutable object; e.g., adding non-zeros to a sparse array)

API usability for writing lock-free data structures

When implementing a lock-free hash table, I found a couple of atomics use cases that I'm not sure how to fit within the API proposed in this PR. Some more notes to the comments below are in JuliaConcurrent/ConcurrentCollections.jl#4

Multi-field atomics

There was a case where I needed to use atomic load on each field and CAS on two fields. This is for avoiding cmpxchg16b for just loading the fields. Schematically, I think it can be mapped to the API of this PR like this:

mutable struct SixteenBytes
    @atomic a::Int64
    @atomic b::Int64
end

x = SixteenBytes(0, 0)
a = @atomic x.a
b = @atomic x.b
cmpswapfield!(x, (:a, :b), ===, (a, b), (2, 3))  # hypothetical API use

Or is it better/required to declared at the field level (for alignment control)?

mutable struct SixteenBytes
    @union begin  # something like C/C++'s union (hypothetical API)
        @atomic ab::Tuple{Int64,Int64}
    end begin
        @atomic a::Int64
        @atomic b::Int64
    end
end

x = SixteenBytes(0, 0)
a = @atomic x.a
b = @atomic x.b
cmpswapfield!(x, :ab, ===, (a, b), (2, 3))

This API may not be needed for the first version. But I think it's worth start considering to make adding this later easier.

Defining order-generic user function

Some user function defined based on atomics operations can also be used with different or non-atomic orderings. For example, it would be useful in non-atomic batch construction of a hash table. To support such use cases, it would be useful to if the ordering flags were singletons, so that the users don't have to worry about constant propagation heuristics.

It is possible for users to define their own singletons that are manually mapped to the symbols at use sites. However, it does not look like Julia Atomics Manifesto's Alternatives Considered section explicitly considers this pattern and reject it. Although the simplicity argued by the Manifesto is appealing, defining singletons for each code base may introduce cognitive overhead.

Also, I think it's an example where the point mentioned in the Manifesto

Atomic-declared fields can only be written with atomic orderings (and vice versa).

can make it harder to write non-blocking data structures.

However, an unsafe adapter interface that wraps get/set fields can be enough for such use case. It may even be possible to prototype/experiment this outside Base if the representation of the @atomic fields is straightforward enough.

Atomic getindex/setindex!

Future work section of the Manifesto mentions that

However, there are additional complexities with this that make it seem rare a user would be using atomic operations on individual entries of an array, both due to the algorithmic complexity of such usage will likely prefer towards wrapper objects that define locks, and the rate of cache misses would be likely also to kill performance of such an algorithm.

I think adding atomic getindex/setindex! actually enables interesting programming opportunities. For one thing, a concurrent dictionary I implemented already required atomic getindex/setindex!. It would be useful for implementing other array-based lock-free data structures such as array-based queues.

base/Base.jl

tkf · 2021-05-11T04:52:45Z

base/expr.jl

+            l, r = esc(ex.args[1]), esc(ex.args[2])
+            return :(getproperty($l, $r, $order))


Can we have a guarantee that @atomic call is lowered to a single atomic API call and nothing else? That is to say, disallow invocations such as

@atomic a.b.c @atomic list.head += list.tail.tail.tail.head @atomic a.x, b.x = b.x, a.x

...? The asymmetry in the lowering of the last expression is especially concerning to me. I think allowing composite expression can provide miss-impression and potentially promote misuse that assume that everything inside @atomic is done in a transactional manner.

Since allowing more expressions inside the macro is backward compatible (unless the detection of the error is the API guarantee), I think it'd be better to allow lowering to only one call inside @atomic at least as the first version of the API. Also, I think it'd be better to mechanically verify this invariance (= arguments are all symbols or constants).

I suppose so, but I don't generally like to create artificial limits on what expressions people can write. There is just too many holes left behind by static limitations (like syntax) that I didn't consider it worthwhile here. For example, a tuple might be a constant too, or x+1, and so on. And I think that the useful examples are more numerous than that.

I agree something like @atomic a.x += x + 1 and @atomic a.x = (0, 0) can be handy. But I see @atomic as an expert feature so I feel starting from a restrictive API makes sense. Since a line with @atomic will be a very delicate part of the code, making this expression as minimal as possible can promote the readability of the code.

Adding to my example abode, it's also not clear to me what @atomic a.x += 2 * a.x should do. It's conceivable (= I'm not suggesting it) to lower this to

op(x, _) = x + 2 * x modifyproperty!(a, :x, op, nothing, ...)

which in turn is compiled to a CAS loop.

yeah, that seems a bit too magic to scan for expressions. If you want it, you can write it with anonymous functions:

@atomic a.x = ((x, _) -> x + 2 * x)(a.x, nothing)

Otherwise, you just need to observe a rule that each atomic operation should appear only once in the original expression. Since a.x appeared twice, that expression will happen twice. Both atomically, but twice. (one sequentially_consistent, since it pairs with the store, and the other relaxed, since it doesn't have an ordering, either in the original expression nor with a simultaneous store)

Otherwise, you just need to observe a rule that each atomic operation should appear only once in the original expression.

My point is that it sounds better to mechanically verify this at the macro expansion time.

Since a.x appeared twice, that expression will happen twice.

Yeah, but, since this is not the case for @atomic a.x, z = y, a.x, I think it's hard to remember and be sure when a.x will be pattern-matched and "fused"

vtjnash · 2021-05-11T18:46:04Z

It would be better to ask those questions either on the document itself or in the minutes, as Github quickly swallows discussion. However:

Since any non-atomic data transfer between the tasks

No, that is still an atomic data transfer, and probably not the most common way to transfer data.

much cheaper than

I used to have a PR for this, since it is basically free on Intel CPUs.

implement auto-promotion afterwards

Auto-promotion to a stronger ordering is always fine. It happens most of the time in all compilers, since often there isn't a hardware instruction for the actual behavior specified by the standard. It doesn't trigger UB or concerns of UB to do.

this is not enough when the field is boxed since the store of the pointer to a heap-allocated object can be reordered with respect to (potentially temporary invariance-breaking) manipulations "prior" to the store

Those are write-side issues, while here we're discussing read-side behaviors. Your intuition about missing invariants is spot-on, and is why this upgrades it merely to a relaxed read and not sequentially-consistent. I think this is mentioned? The write-side already is mandatory to specify the ordering, so the concerns you mention are already required to be handled. Sure, you can specify them incorrectly, but then that is a problem for all reads and other writes. And precisely how simple we make the write-side defaults is the topic discussed elsewhere, as you noted earlier about the motivation section.\

Multi-field atomics

This seems much too complicated, as this example isn't showing a multifield atomic, but an atomic partial read of a single field. A compiler is legally allowed to avoid reading bytes which aren't observable, so it should be demoting the atomic load+extract into a smaller atomic load automatically. If we see that this optimization isn't happening now, we can implement it later.

it does not look like Julia Atomics Manifesto's Alternatives Considered section explicitly considers this pattern and reject it

Did you mean to say it does consider this and reject it? This is the first bullet point under the first heading in that section.

Atomic-declared fields can only be written with atomic orderings

As far as I know, this requirement is also still currently mandatory for C++ and C also. There was some discussion of relaxing it, e.g. in http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3710.html, since it would permit some algorithms that are currently illegal to specify, but I don't know if any consensus was reached.

Atomic getindex/setindex

Yes, these may be sometimes useful; that is why the document specifies these are out of scope of current discussion, aside from noting that we don't anticipate anything here interfere with or complicating later development of them.

tkf · 2021-05-12T00:51:08Z

Personally, I find it difficult to find old comments in google docs. It also hard to write code, format comments, and include image. If I link back the comments from the google docs, would it solve the discoverability issue? Otherwise, if you prefer inline comment on the google docs anyway, I can move my comments there in the next time around.

much cheaper than

I used to have a PR for this, since it is basically free on Intel CPUs.

This is true for CPU but I believe not for the optimizer which also has to obey the C/C++ (≈ LLVM) memory model. That's why I said "which can get in the way of the compiler." Release fence can impedes the optimizer: https://godbolt.org/z/3n1a5zv46

It happens most of the time in all compilers, since often there isn't a hardware instruction for the actual behavior specified by the standard.

I agree that it's safe to compile to stronger operation in the code the user writes. However, I think promoting @atomic :relaxed a.x += 1 is doing something different. The CAS of the pointer at a.x has a hardware support and there is no need to promote the instruction just for executing the user code (at LLVM IR level) as-is. That is to say, we don't get a lock from this code https://godbolt.org/z/Me1G4cnbf

#include <atomic>

struct S {
    std::atomic<int*> p;
};

void f(S &s, int *p) {
    s.p.store(p, std::memory_order_relaxed);
}

IIUC, the auto-promotion you are describing that all compilers do is based solely on the size of the field (e.g., std::atomic<LargeStruct> a;). On the other hand, the promotion of @atomic :relaxed a.x += 1 discussed here requires to understand the representation of the type of the field.

It doesn't trigger UB or concerns of UB to do.

Just to be clear, I'm not talking about UB on LLVM side, if that's what you are referring to. I was trying to note that if we want to define a behavior of a certain condition X later, we need to declare that it's user's responsibility to avoid such condition X. That is to say, not X has to be (a part of) the precondition of the API. Relaxing the precondition is always valid.

Those are write-side issues, while here we're discussing read-side behaviors.

We can't discuss them separately, right? Atomic operation always concerns both read and write sides (except for relaxed ordering). In this case, don't you need at least acquire on read and release on write?

But if we want safe generic code by default, I don't think there is a good way to handle atomic fields containing pointers. I think it'd be better to ignore such fields in generic code in this case. I agree handling isbitsunion types with relaxed load is still data race-free.

This seems much too complicated, as this example isn't showing a multifield atomic, but an atomic partial read of a single field.

In the sample code, I tried to exemplify two single field ("partial") reads but also single two-field CAS. Two fields are loaded separately to avoid auto-promotion to cmpxchg16b.

Atomic-declared fields can only be written with atomic orderings

As far as I know, this requirement is also still currently mandatory for C++ and C

Doesn't atomic_ref of C++20 let you mix atomic and non-atomic operations on the same location? https://en.cppreference.com/w/cpp/atomic/atomic_ref

Also, Lahav et al. (2017) (which is the basis of C++20 atomics standard) has many examples of valid executions starting with non-atomic write (W^na), e.g.,

I think initialization with non-atomic operation is a pretty useful (and valid) optimization. I don't remember if they discuss forbidding non-atomic operations after atomic operations on the same location. It sounds like an unnecessary restriction, as far as the memory model is concerned (it could be nice as an API design though).

vtjnash · 2021-05-12T01:36:08Z

Release fence can impedes the optimizer

Yes, fences are entirely different beasts and really not at all comparable. Your example code is also missing the branches normally required at that load (null check) and store (write barrier), and thus not relevant to the code fragment that would be affected here. As you noted, this is a TODO item still for later discussion, and not yet finalized for design (and therefore not yet part of this PR).

@atomic :relaxed a.x += 1

If .x is ::Int, then x86 would already convert this to LOCK XADD, IIUC. That LOCK prefix is TSO (total-store-order), which is much stronger than required for "relaxed" (IIRC, it is between sequentially_consistent and acquire_release). Other CPUs won't be this strong. And if .x is ::Any, then it will already be LOCK CMPXCHG. So the question is generally if we should force other platforms and the compiler to match the behavior of x86. But currently I'm avoiding answering that for now, by avoiding promising anything.

don't you need at least acquire on read and release on write

Without a garbage collector, yes, but with a garbage collector, I don't believe that the problematic code pattern should be possible to write.

vtjnash · 2021-05-12T01:43:02Z

let you mix atomic and non-atomic operations on the same location

No, the page you linked notes that mixing is explicitly prohibited: "While any atomic_ref instances referencing an object exists, the object must be exclusively accessed through these atomic_ref instances"

I think initialization with non-atomic operation is a pretty useful (and valid) optimization

Yeah, I don't think I mentioned this optimization much, though it is implemented here. It should also be a valuable contribution (if not already implemented in LLVM) to weaken atomic declarations to non-atomic when the memory is known not to alias any memory accessible by another thread (for example, during initialization).

base/expr.jl

tkf · 2021-05-12T05:04:22Z

Release fence can impedes the optimizer

Yes, fences are entirely different beasts and really not at all comparable.

My bad, your wording was release barrier. But, store release also impedes the optimizer anyway: https://godbolt.org/z/Masch6e9E

I guess this example is still technically misleading since, IIUC, the compiler is still allowed to move code into the critical sections and use the vectorized instruction if it really wants to. But it'd be strictly disabled if there is a sequentially consistent load of unrelated location in the loop body while such loads alone can still allow the optimization (with the same argument that moving code into the critical sections is OK).

don't you need at least acquire on read and release on write

Without a garbage collector, yes, but with a garbage collector, I don't believe that the problematic code pattern should be possible to write.

GC is great for avoiding problems like ABA but is it enough? For example

x = Dict()
merge!(x, pairs(1:64))  # internally temporary break invariance
@atomic a.x = x

seems to require release store as otherwise we can't be sure x is well-formed on the load side. I don't think GC can help ensuring user-defined constraints.

No, the page you linked notes that mixing is explicitly prohibited: "While any atomic_ref instances referencing an object exists, the object must be exclusively accessed through these atomic_ref instances"

I think the important point is "While any atomic_ref instances referencing an object exists." The documentation also says "For the lifetime of the atomic_ref object, the object it references is considered an atomic object." My reading is that this is allowed:

struct Counters { int a; int b; } counter;
non_atomic_use(counter);
{
    std::atomic_ref<Counters> cnt(counter);
    concurrent_atomic_use(cnt);
}
non_atomic_use(counter);

vtjnash · 2021-05-12T18:07:15Z

Here's a more realistic example, since the compiler would never emit your version (due to GC): https://godbolt.org/z/8jcefMdc8. Notice that some stores got moved around, but overall it is pretty much identical.

seems to require release store

Yes, it does require a release store (assuming we intended that to be visible to another thread), which is why I've proposed elsewhere to make that the default behavior, even if you are using it improperly. The example code you have there includes a sequentially_consistent store, which would be more than sufficient. This isn't the subject of this PR though, so these comments would be more appropriate on the design manifesto or on the PR that proposes it (#35535) to avoid veering off topic.

My reading is that this is allowed

Yes, re-using the memory location for other purposes is allowed in C/C++, as long as the accesses are not mixed (it is to be assumed concurrent_atomic_use does some sort of thread-join before return, and ensures no copies of cnt outlive that block)

tkf · 2021-05-14T07:59:39Z

Thanks for the example. The optimization I had in mind (sinking store) won't work if there is a nonlined function call like wb anyway. I moved the comments to hackmd.

vtjnash requested a review from JeffBezanson October 1, 2020 21:31

JeffBezanson added the multithreading Base.Threads and related functionality label Oct 1, 2020

KristofferC mentioned this pull request Oct 6, 2020

add an optimization for immutable structs without type arguments KristofferC/LazilyInitializedFields.jl#1

Open