Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Reduce atomics macro API contracts for future extensions #41269

Closed
wants to merge 1 commit into from

Conversation

tkf
Copy link
Member

@tkf tkf commented Jun 18, 2021

Currently, @atomicreplace a[1].x expected => desired is equivalent to

tmp = a[1]
@atomicreplace tmp.x expected => desired

However, it forbids future extensions such as lowering

@assert sizeof(K) == sizeof(V) == sizeof(UInt64)
xs = Vector{@NamedTuple{key::K, value::V}}(...)
k = @atomic xs[i].key

to a 64 bit (not 128 bit) load.

As I explained in #37847, this kind of "torn read" is a common technique when dealing with 128 bit entries (e.g., https://doi.org/10.1145/3309206 p.6 and https://doi.org/10.1145/2442516.2442527 Fig. 3, line 37). It looks like this is discussed as "Tearable Atomics" in C++ http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0690r1.html

Other expressions such as x.y.z and f(x).y also have a possible well-defined interpretation as a memory location using the notion of lens a la Accessors.jl.

The point of this PR is not for discussing if the above suggestions are adequate (although I understand that dismissing them as out of the scope of Julia is an unfortunate but possible outcome).

Rather, I suggest avoid expanding API contracts prematurely and leaving these expressions unsupported, for possible extensions in the future.

@tkf tkf added the multithreading Base.Threads and related functionality label Jun 18, 2021
@vtjnash
Copy link
Sponsor Member

vtjnash commented Jun 18, 2021

However, it forbids future extensions such as lowering

@assert sizeof(K) == sizeof(V) == sizeof(UInt64)
xs = Vector{@NamedTuple{key::K, value::V}}(...)
k = @atomic xs[i].key

It forbids it because those are not quite equivalent syntactical expressions, not because that is invalid to use a 64-byte load. In particular, according to the C memory model, it sounds like you are thinking of the definition of volatile (which constrains the number of bytes loaded, and is not exposed in Julia), which is distinct from atomic (which constrains the happens-before ordering of the bytes loaded, and is the operation defined here).

As I also asserted in #37847, this seems likely already legal to describe a 64-byte atomic load:

k = (@atomic xs[i]).key

@tkf
Copy link
Member Author

tkf commented Jun 18, 2021

I don't think volatile is relevant because volatile in C/C++ sense is nothing to do with atomics.

If you meant to point it out that this is not incorporated in the C/C++ memory model yet, I agree. That's why I was linking the ongoing discussion in C++ committee. But my point is that it is a conceivable extension and the current API does not allow adding this later in a non-breaking manner.

k = (@atomic xs[i]).key

This a 128 bit load followed by field extraction of the first 64 bit component. It requires cmpxchg16b in x86-64 https://godbolt.org/z/Mb3GTjzGj even for relaxed load.

It forbids it because those are not quite equivalent syntactical expressions

Expressions inside a macro can (and very commonly does) have non-standard semantics. I think it's reasonable to keep the opportunity to provide concise expressions for the useful atomic operations.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jun 19, 2021

Huh, that seems like an LLVM problem. Even for

  %unused = load atomic i64, i64* %2 unordered, align 8

it refuses to eliminate that dead load instruction. But there is no happens-before or other ordering (it is even nosync) that requires it to exist.

@vtjnash vtjnash added backport 1.7 design Design of APIs or of the language itself labels Jun 19, 2021
@tkf
Copy link
Member Author

tkf commented Jun 20, 2021

The C++ proposal I linked seems to insist on that this is not optimizable

struct alignas(sizeof(intptr_t) * 2) Node {
    intptr_t l, r;
};

Node action(const Node& old);

void do_action(std::atomic<Node> *n) {
    Node old = n->load(std::memory_order_relaxed); // Relaxed loads can’t tear.
    while (!n->compare_exchange_weak(old, action(old), std::memory_order_release,
                                     std::memory_order_acquire))
      ;
}
};

In this example, all lock-free operations (including load / store) must be implemented as a compare-and-exchange or load-linked / store-conditional

--- http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0690r1.html

But re-reading your comments makes me realize that I don't understand exactly why this is the case.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jun 20, 2021

The c++ proposal explains at the bottom that this is not a valid optimization (or in fact a valid portion of the proposal) since action(old) is unspecified. If that was pure and didn't dereference the argument, the optimization would be valid without the proposed changes to the spec. The other part of the document (seqlock) is more interesting, as it may provide an improvement to the full lock used currently, and might not be clear how to express currently (particularly to TSan).

I also thought aligned AVX instructions would probably give a non-tearable read, but I can understand the reluctance to assume that is promised architecturally, and to instead use cx16.

@tkf
Copy link
Member Author

tkf commented Jun 20, 2021

Thanks. So, in principle, it seems like the compiler can optimize out cmpxchg16b from the load if I write this?

intptr_t old_l = n->load(std::memory_order_relaxed).l;
intptr_t old_r = n->load(std::memory_order_relaxed).r;
Node old(old_l, old_r);

But what about tearable stores? It seems hard to express tearable stores

@atomic :monotonic xs[i].key = key
@atomic :monotonic xs[i].value = value

by

p0 = @atomic :monotonic xs[i]
p1 = (key = key, value = p0.value)
@atomic :monotonic xs[i] = p1
p2 = @atomic :monotonic xs[i]
p3 = (key = p2.key, value = value)
@atomic :monotonic xs[i] = p2

(expecting the optimizations).

This is used in the work-stealing deque example. The C++ proposal mentions that the performance advantage "is more likely to be significant" (sounds easy to expect).

But, actually, going back to the main topic in the OP, if we have the direct way to express tearable atomics like

@atomic :tearable xs[i] = (; key, value)  # or `:nonatomic` as in C++

then arguably it's cleaner than assigning .key and .value separately anyway. Assuming the solution in the C++ will be something like p0690r1, maybe we don't need to worry about the point I was raising in the OP.

Maybe a more problematic part is that nonatomic and our LLVM-based naming :notatomic are easy to mix up, if C++ committee go with the plan in p0690r1.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Mar 11, 2022

This would be a breaking change now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design Design of APIs or of the language itself multithreading Base.Threads and related functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants