-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[API Proposal]: Volatile barrier APIs #98837
Comments
Tagging subscribers to this area: @mangod9 Issue DetailsBackground and motivationThis API proposal exposes methods to perform non-atomic volatile memory operations. Our volatile semantics are explained in our memory model, but I will outline the tl;dr of the relevant parts here:
Currently, we expose APIs on API Proposalnamespace System.Threading;
public static class Volatile
{
public static T ReadNonAtomic<T>(ref readonly T location)
{
//ldarg.0
//volatile.
//ldobj !!T
}
public static void WriteNonAtomic<T>(ref T location, T value)
{
//ldarg.0
//ldarg.1
//volatile.
//stobj !!T
}
public static T ReadUnalignedNonAtomic<T>(ref readonly T location)
{
//ldarg.0
//unaligned. 1
//volatile.
//ldobj !!T
}
public static void WriteUnalignedNonAtomic<T>(ref T location, T value)
{
//ldarg.0
//ldarg.1
//unaligned. 1
//volatile.
//stobj !!T
}
} API UsageAn example where non-atomic volatile operations would be useful is as follows. Consider a game which wants to save its state, ideally while continuing to run; these are the most obvious options:
But there is actually another option which utilises non-atomic volatile semantics:
//main threads sets IsSaving to true and increments SavingVersion before starting the saving thread, and to false once it's definitely done (e.g., on next frame)
//saving thread performs a full memory barrier before starting (when required, since starting a brand new thread every time isn't ideal), to ensure that _value is up-to-date
//memory synchronisation works because _value is always read before any saving information, and it's always written after the saving information
//if the version we read on the saving thread is not the current version, then our read from _value is correct, otherwise our read from _savingValue will be correct
//in the rare case that we loop to saving version == 0, then we can manually write all _savingVersion values to 0, skip to version == 1, and go from there (excluded from here though for clarity)
static class SavingState
{
public static bool IsSaving { get; set; }
public static nuint SavingVersion { get; set; }
}
struct SaveableHolder<T>
{
nuint _savingVersion;
T _value;
T _savingValue;
//Called only from main thread
public T Value
{
get => _value;
set
{
if (SavingState.IsSaving)
{
if (SavingVersion != SavingState.SavingVersion)
{
_savingVersion = SavingState.SavingVersion;
_savingValue = _value;
}
//_value can only become torn or incorrect after we have written our saving value and version
Volatile.WriteNonAtomic(ref _value, value); //write must occur after prior operations
}
else
{
_value = value;
}
}
}
//Called only from saving thread while SavingState.IsSaving with a higher SavingState.SavingVersion than last time
public T SavingValue
{
get
{
var value = Volatile.ReadNonAtomic(in Value); //read must occur before subsequent code
//_savingVersion must be read after _value is, so if it's visibly changed/changing then we will either catch it here
if (_savingVersion != SavingState.SavingVersion) return value;
return _savingValue
}
}
} Alternative Designs
namespace System.Threading;
public static class Volatile
{
public static void ReadBarrier();
public static void WriteBarrier();
}
RisksNo more than other volatile APIs really, other than the lack of atomicity (which is in the name).
|
I think it is very rare to actually need volatile and misaligned memory together. It would make more sense to cover that scenario by |
Another alternative design: class Interlocked
{
// Existing API
void MemoryBarrier();
// New APIs
void ReadMemoryBarrier();
void WriteMemoryBarrier();
} |
@jkotas to clarify, do you mean something like |
It's not great that .NET's atomic APIs are split across Unfortunate situation all around. Really makes me wish we could slap I guess I lean 51/49 in favor of putting them on |
Right, I do not opinion about the exact names. The naming is all over the place as you have pointed out. I assume that the only difference between |
That is the idea, yes. They just emit the barrier that |
@mangod9 @kouvel (or anyone else who can mark it), what needs to be done to get this
api-ready-for-review
|
I have looked at your API usage example - it appears to be either incomplete or buggy. Consider this interleaving of foreground and saving thread:
|
You are correct - I needed the read/write to version to be volatile also, and the write to version/saving-value to be swapped. I'll update it shortly. I'm pretty sure the volatile operations on value are still needed though. Thanks for noticing that :) |
@jkotas was there anything else I need to fix or change? |
The area owners should take it from here. |
I’d prefer read/write barriers. Ordering accesses that are themselves nonatomic would have complicated semantics. Reasoning about barriers is easier and they can be used to construct “nonatomic” volatile accesses.
|
We shouldn't need to emit a Volatile.WriteBarrier(); //write barrier would be used on stlur for the first alignof(T) write of the following line
location = value; And I would expect similar combining for read barrier where possible too. And on X86 it'd be even better assembly code, obviously. Other than that, I'd be 100% happy with the barriers solution, I changed it to the |
I also prefer a barrier approach. I assume the barriers would be no-op on x86/x64.
That doesn't seem to be the case from what I can tell. The docs on "Barrier-ordered-before" say the following:
Which seems to suggest that In any case, from a spec perspective we probably shouldn't specify any stronger ordering guarantees for the volatile barriers than are necessary. |
As for this part:
Is that something specific to arm/arm64? I'm not sure at the moment if other archs offer the same guarantee with acquire/release loads/stores. My inclination is to not specify that, unless it's a typical thing to be expected. Anyway it seems the barriers don't guarantee that behavior, only the acquire/release load/store instructions, so may not be relevant here. |
Optimization that fuses a barrier with the following/preceding ordinary access into an acquiring/releasing access does not work in general. That is why barriers are useful even when releasing/acquiring accesses are also available. Barriers order multiple accesses at once, while an acq/rel accesses order just one access against all preceding/following (in program order.) Example:
Here R4 is "ordered after" R2 and R1
Now R4 is not ordered with R2 and R1. Thus the optimization is invalid.
Still R4 is not ordered with R1. Same goes for writes. |
Why not? I thought that ordering guarantee was exactly what the VolatileRead was supposed to offer. |
There is effect on x86/x64. In the current implementation of the |
The spec is terse and should be read very carefully to notice that it is not symmetrical. Arm64 does not have a barrier instruction that is ReadWrite-Write. I think RISC-V has.
I think it is another implementation quirk of arm64 ISA that can be ignored here. What arm64 calls "Acquire" is slightly stronger than what dotnet memory model specifies for volatile reads. "AcquirePC" is a closer match. |
Agreed, I meant there wouldn't be any special instructions for these on x86/x64. |
To summarise:
The barriers match semantics of acquire/release/volatile in terms of ordering reads or writes, correspondingly, relative to all operations. The barriers are stronger than acquire/release/volatile in terms of ordering multiple reads or writes, correspondingly, not just a particular single read or write. The actual implementation will depend on underlying platform.
|
Do ordinary memory operations on x86/x64 guarantee Read-ReadWrite and ReadWrite-Write ordering for acquire/release semantics? I believe stores are not reordered there, but I thought loads could be reordered after an ordinary store (which has release semantics). Wonder if these should just be Read-Read and Write-Write barriers? I see the benefits of the stronger guarantees, but it seems the implementation would not be much more efficient. |
Is the main benefit of these APIs for unaligned or larger-than-atomic reads/writes? |
One example of existing code using a read barrier could be found here: runtime/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/CastCache.cs Line 181 in 4cf19ee
In that scenario we do following reads The order between To ensure that the first read happens before source/targetResult tuple, we just need to Volatile.Read(version). So with only acquires we would have That is 3 volatile reads. If we can have a read barrier, we can do:
That is one barrier replacing 2 volatile reads. Benchmarks pointed that even that is profitable. However there is also a generic version of the same cache. runtime/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/GenericCache.cs Line 165 in 4cf19ee
In the generic version the entry ( basically what goes into [ ... ] and needs to be "sandwiched" between two version reads) contains {_key, _value} which are generic, so reading the entry takes unknown number of reads.
|
@VSadov I will update it when I'm able later today :) |
I do think the semantics of the operations need to be clearly defined. For instance, though it would be unfortunate, the difference indicated here would need to be clearly specified. |
It may be a matter of documentation, but we should have a clear understanding of what the aim is for now from the OP. |
For the desired semantics. I think we should start with:
The important difference from
The important difference from The actual implementation will depend on underlying platform.
|
My main aim is to enable the API usage I have as an example. It would also be nice if we could fix Notably, I wouldn't actually need |
@VSadov I've updated it, can you double check that it's fine? |
@hamarb123 Looks very good! Thanks! I will add the expected semantics to the proposed entry points. But we will see where we will land with those after reviews. |
Btw @VSadov, both my example and the runtime's usages only seem to need |
An alternative may be to overload |
For instance, there are already use cases in |
Yes, I noticed. It is a common thing with volatile. While volatile orders relatively to all accesses, some cases, typically involving a chain of several volatile accesses when you have just writes or just reads in a row, could use a weaker fence. This is a case in both scenarios that you mention. The main impact of a fence is forbidding optimizations at hardware level. They would not necessarily make the memory accesses to cost more. The level of cache that is being used is likely a lot more impactful than forcing a particular order of accesses. Figuring the minimum strength required would be even more difficult and error-prone task than figuring when Volatile is needed. I think going all the way of |
I think one datapoint that could be useful for the |
The perf differences may be more apparent in memory-intensive situations where the extra ordering constraints would disable some optimizations and impose extra work on the processor / cache. It may be difficult to measure the difference in typical microbenchmarks, though perhaps it would become more apparent by somehow folding in some memory pressure and measuring maybe not just the operation in question but also latency of other memory operations. |
I agree. I think we should start with simple barriers that are aligned with .NET memory model, and wait for evidence that we need more. It is a non-goal for .NET programs to express everything that is possible. We strike a balance between simplicity and what may be possible in theory. |
It is worth mentioning, out of my use case, and the 2 uses of ReadBarrier in the runtime, both only require Edit: I'd still be happy if we just ended up with the ones that matched our memory model, but I'd obviously be more happy if we got the |
Looks good as proposed. There was a very long discussion about memory models, what the barrier semantics are, and whether we want to do something more generalized in this release. In the end, we accepted the original proposal. namespace System.Threading;
public static class Volatile
{
public static void ReadBarrier();
public static void WriteBarrier();
} |
Background and motivation
This API proposal exposes methods to perform non-atomic volatile memory operations. Our volatile semantics are explained in our memory model, but I will outline the tl;dr of the relevant parts here:
unaligned.
is used), and either 1) the size of the type is at most the size of the pointer, or 2) a method onVolatile
orInterlocked
such asVolatile.Write(double&, double)
has been calledCurrently, we expose APIs on
Volatile.
for the atomic memory accesses, but there is no way to perform the equivalent operations for non-atomic types. If we have Volatile barrier APIs, they will be easy to write, and it should make it clear which memory operations can move past the barrier in which ways.API Proposal
=== Desired semantics:
Volatile.ReadBarrier()
Provides a
Read-ReadWrite
barrier.All reads preceding the barrier will need to complete before any subsequent memory operation.
Volatile.ReadBarrier()
matches the semantics ofVolatile.Read
in terms of ordering reads, relative to all subsequent, in program order, operations.The important difference from
Volatile.Read(ref x)
is thatVolatile.ReadBarrier()
has effect on all preceeding reads and not just a particular single read ofx
.Volatile.WriteBarrier()
Provides a
ReadWrite-Write
barrier.All memory operations preceding the barrier will need to complete before any subsequent write.
Volatile.WriteBarrier()
matches the semantics ofVolatile.Write
in terms of ordering writes, relative to all preceeding, in program order, operations.The important difference from
Volatile.Write(ref x)
is thatVolatile.WriteBarrier()
has effect on all subsequent writes and not just a particular single write ofx
.The actual implementation will depend on underlying platform.
API Usage
The runtime uses an internal API
Interlocked.ReadMemoryBarrier()
in 2 places (here and here) to batch multiple reads on both CoreCLR and NativeAOT, and is supported on all platforms. This ability is also useful to third-party developers (such as me, in my example below), but is currently not possible to write efficiently.An example where non-atomic volatile operations would be useful is as follows. Consider a game which wants to save its state, ideally while continuing to run; these are the most obvious options:
But there is actually another option which utilises non-atomic volatile semantics:
Alternative Designs
We do have IL instructions, but they're currently broken and not exposed, see #91530 - the proposal here was originally to expose APIs for
volatile. ldobj
andvolatile. stobj
+ the unaligned variants (as seen aobe), and fix the instructions (or implement these without the instructions and have the instructions call these APIs - not much of a difference really). It was changed based on feedback to expose barrier APIs, which can provide equivalent semantics, but also allow additional scenarios. It is also clearer which memory operations can be reordered with the barrier APIs.Unsafe
instead:volatile.
-initblk
andcpblk
, people may have use for these also:Open Questions
There is a question as to whether we should have
Read-ReadWrite
/ReadWrite-Write
barriers orRead-Read
/Write-Write
barriers. I was initially in favour of the former (as it matches our current memory model), but now think the latter is probably better, since there are many scenarios (including in my example API usage, and the runtime's uses too) where the additional guarantees provided by the former are unnecessary, and thus may cause unnecessary overhead. We could also just provide both if we think they're both useful.Risks
No more than other volatile/interlocked APIs really, other than potential misunderstanding of what they do.
The text was updated successfully, but these errors were encountered: