- "Basically, they already know 'this step squeaks.'"
We previously made a few decisions deliberately to test them out in a prototype. It's been used internally and externally and we've gathered some feedback and have some recommended changes.
The preview implementation allows regular string literals "..."
to be target typed to the Utf8 representation types. This has several problems:
First of all it doesn't tell you that the implicit conversion causes a Utf8 encoding; the different string encoding is silently adopted.
More problematically it is a significant breaking change: There are in fact quite a lot of methods with overloads for both string
and e.g. ReadOnlySpan<byte>
, and existing calls may now either fail or silently pick a different overload.
If we want to mitigate, the choice seems to be between always requiring the u8
suffix for Utf8 literals or coming up with some tie-breaking shenanigans to reduce the risk of a break.
Let's require the u8
suffix in order for string literals to be Utf8 encoded.
In the preview implementation the natural type of a "..."u8
literal is byte[]
. This has the benefit that byte[]
is already implicitly convertible to the other types we would like to use to represent Utf8 strings, most notably ReadOnlySpan<byte>
.
However, this leads to some very subtle costs, where a byte[]
is allocated even just to be a receiver or argument of a call that doesn't need an array.
Also, we often offer fallback byte[]
overloads for the use of languages that don't know about spans, and with that being the natural type we end up preferring those more expensive overloads in C# as well.
The proposal is to instead make ReadOnlySpan<byte>
the natural type of Utf8 literals. We would then have a conversation about whether there should be target typing to some of the other byte sequence types.
Since the literals would be allocated as global constants, the lifetime of the resulting ReadOnlySpan<byte>
would not prevent it from being returned or passed around to elsewhere. However, certain contexts, most notably within async functions, do not allow locals of ref struct types, so there would be a usage penalty in those situations, with a ToArray()
call or similar being required.
We are ok changing the natural type to ReadOnlySpan<byte>
. See discussion on target typing next.
Should we alleviate the async scenario, described above, by allowing Utf8 literals to target type to byte[]
? That way, in an async method you could still have a local variable:
byte[] myU8 = "I am an async resilient Utf8 string"u8;
On the other hand, even though ReadOnlySpan<byte>
would win in an overload resolution scenario, there is some risk of unwanted implicit allocation.
Don't target type. It is ok to have to explicitly use ToArray()
in these scenarios, especially since it makes it clearer what you are doing.
Today, for string literals, we put a null terminator beyond the last character in memory (and outside the length of the string) in order to handle some interop scenarios where the string is being fixed
and the call expects null terminated strings. We do not do the same for other string values, so the solution is at best partial, and at worst a source of a false sense of security. However, it is what it is.
In the preview implementation we don't null-terminate the data from a Utf8 literal in the same way. However, an inconsistency with string literals here might lead to further bugs or misunderstandings, especially if interop code is copied and converted over from Unicode to Utf8.
While the current string literal behavior is debatable, we think the least risky approach is to be consistent and adopt the same behavior for Utf8 literals.
We looked at two ways in which you might want to be able to modify the default rules around escaping of refs and ref structs.
Up until now the default assumptions around escaping of refs and ref structs have been inconsistent. For every combination of incoming and outgoing refs and ref structs on a given method, we have assumed that incoming might escape through the outgoing. Except in one case: When a ref struct was returned, incoming ref
s were not assumed to be captured. This was because there was no safe way to create a ref struct from a ref
at the time, but we're looking to change that, e.g. with literally a new constructor overload on Span<T>
taking a ref T
to be wrapped by the span!
Changing the assumptions to remove the exception would of course be a breaking change, but it would also be a highly desirable simplification. We've therefore done an audit of all the places the break would affect the core libraries, and the result is that only two public methods (and a number of private methods) rely on being able to take refs and return ref structs. Those methods are already inherently unsafe and documented as such.
However they are also used extensively! We need a way to modify them to for them to continue to be callable in the same contexts (even if it's unsafe!). Such a modifier would also help other APIs out there which might have relied on the current assumptions to avoid a break for their users.
Given the general sophistication of developers trading in combination of refs and ref structs, and the availability of such a modifier, we assess that a break would be quite well contained.
In general we want to allow methods to "scope" the use of incoming refs and ref structs to signal that they will not be returned. We could use a keyword or an attribute for this purpose, and our previous leaning has been to use a keyword.
Today's proposal is to introduce a keyword scoped
as a modifier on ref and ref struct parameters.
Bless the breaking change in light of its small perceived impact, and adopt the scoped
modifier proposal as the mechanism to block "escaping" of incoming refs and ref structs.
Even with the above decision, one assumption that still deviates from the default expectation of escaping is around references to the receiver struct. The escape analysis does assume that ref or ref struct returning members do not share references to this
or to fields within this
.
The proposal therefore is to introduce a second modifier, unscoped
, to override the default and allow such escaping when desired.
There are several concerns with that:
- Why is the default for
this
different than for parameters? Is this something we could also change with a similarly well-contained break? - While we have a clear and present need for
scoped
, this seems to be pure additional expressiveness. Do we have enough motivation to add it at this point? - The
unscoped
keyword seems like a distinct misnomer, since it does not seem like the exact opposite of thescoped
modifier. The things it "unscopes" (this
and alsoout
parameters) are distinct from things thescoped
modifier "scopes" (ref and ref struct parameters). - If we adopt the feature it is unclear if it is "worthy" of a keyword. Its expected use might be rare enough that an attribute is a better choice. Attributes cannot go on local variables, but whereas that might be a useful future scenario for
scoped
it does not seem as useful forunscoped
.
On the other end we would like to allow ref capture where it's not allowed today.
Consider "unscoped" as a separate feature, which needs more thinking and fleshing out based on the above feedback. If we do want to change the default assumptions we should try to get it done in this same release, but if not, "unscoped" can potentiall wait for a future release.