Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Unicode and escape codes in literals #3349

Merged
merged 8 commits into from
Oct 18, 2023

Conversation

m-ou-se
Copy link
Member

@m-ou-se m-ou-se commented Nov 15, 2022

@m-ou-se m-ou-se added T-lang Relevant to the language team, which will review and decide on the RFC. A-syntax Syntax related proposals & ideas labels Nov 15, 2022
@Diggsey
Copy link
Contributor

Diggsey commented Nov 15, 2022

The first part seems like an obviously good thing. The second part "Allow \x… escape codes in regular string literals, as long as they are valid UTF-8." I'm not strongly opposed to but I also can't really think of a use-case that wouldn't be better served by a byte literal? Did you have one in mind?

The downsides of the second part are:

  • Possible confusion about why the \x** sometimes works in string literals but not other times (and not consistently for the same byte values).
  • Issues with concatenation (eg. depending on how concat!() is implemented, splitting a unicode character across two literals may or may not work in practice.)
  • People accidentally using string literals when they should be using a byte literal, but not realising until they come to write a specific byte value which is not valid UTF-8.

@m-ou-se
Copy link
Member Author

m-ou-se commented Nov 15, 2022

I don't have serious use cases for it in mind for \x-encoded UTF-8 in regular string literals, but I also don't think we should disallow it. That is, if we were to design the language from scratch today, I'd argue that forbidding \x in "" just makes things inconsistent and doesn't bring much value. And since it's a backwards-compatible change to make, I think we should make that change to Rust.

As mentioned in the RFC, it also helps with macros like cstr!("\xff"). (Although I suppose we could still disallow \x escape codes entirely at the point of converting it to a literal AST token.)

@nnethercote
Copy link

This RFC talks about byte string literals, e.g. b"foo". Does it also need to discuss byte literals, e.g. b'x'?

Also, the RFC as written makes it sound like \x escapes are never allowed in regular string literals.

Allow \x… escape codes in regular string literals, as long as they are valid UTF-8. E.g. "\xf0\x9f\xa6\x80"

Only extend b"", but still don't accept \x in regular string literals ("").

But something like "\x61' is fine, the escape just must be in the range \x00-\x7f. (This isn't an issue with the RFC's intent, just a point of clarification.)

@BurntSushi
Copy link
Member

BurntSushi commented Nov 15, 2022

Allow \x… escape codes in regular string literals, as long as they are valid UTF-8. E.g. "\xf0\x9f\xa6\x80"

Only extend b"", but still don't accept \x in regular string literals ("").

But something like "\x61' is fine, the escape just must be in the range \x00-\x7f. (This isn't an issue with the RFC's intent, just a point of clarification.)

I think the RFC wording is precisely correct here. Today, x-escapes are only allowed if they're ASCII. This RFC expands it to allow everything that is valid utf8, which includes ASCII.

Edit: or maybe you're saying the rfc should call out what is supported today and phrase it as an expansion.

@nnethercote
Copy link

I think both examples are ambiguously worded, and I read them both as meaning that no \x escapes are allowed at all.

Here are possible rewordings that I think would make things clearer.

Allow \x escape codes in the range \x80-\xff in regular string literals, as long as they are valid UTF-8. E.g. "\xf0\x9f\xa6\x80". (Escape codes in the range \x00-0x7f are currently allowed.)

Only extend b"", but still don't accept \x80-\xff in regular string literals ("").

@m-ou-se
Copy link
Member Author

m-ou-se commented Nov 17, 2022

Thanks for all the feedback! I've updated the document. :)

@joshtriplett
Copy link
Member

This seems ready to FCP, apart from incorporating @scottmcm's feedback.

@m-ou-se
Copy link
Member Author

m-ou-se commented Nov 30, 2022

Consider splitting the tokenizing part of this out from the rest of the RFC. Given that lang just agreed with this rationale for what should be a tokenizing problem vs a semantic problem (rust-lang/rust#102944 (comment)), we could probably do the ""\u{D8D8}" and "\xFF" are valid tokens, but not valid values" part of this quickly as it's in-line with existing and recent precedent.

I've updated the RFC by moving the "validate later" part to the future possibilities section. I'll submit a separate RFC for that. (Edit: started a discussion on Zulip.)

@joshtriplett I think this is ready to FCP. :)

@nnethercote
Copy link

I think this RFC should use the characters and strings table, because it's easy to overlook cases when dealing with prose.

Here's the status quo:

  Example # sets* Characters Escapes
Character 'H' 0 All Unicode Quote & ASCII & Unicode
String "hello" 0 All Unicode Quote & ASCII & Unicode
Raw string r#"hello"# <256 All Unicode N/A
Byte b'H' 0 All ASCII Quote & Byte
Byte string b"hello" 0 All ASCII Quote & Byte
Raw byte string br#"hello"# <256 All ASCII N/A

Here is what the RFC is proposing, AIUI, with changes in bold, and uncertain changes with ?:

  Example # sets* Characters Escapes
Character 'H' 0 All Unicode Quote & ASCII [1] & Unicode
String "hello" 0 All Unicode Quote & Byte & Unicode
Raw string r#"hello"# <256 All Unicode N/A
Byte b'H' 0 All ASCII [2] Quote & Byte & Unicode? [3]
Byte string b"hello" 0 All Unicode Quote & Byte & Unicode
Raw byte string br#"hello"# <256 All ASCII [4] N/A

I have numbered some inconsistencies.

  • [1] Should this be Byte? '¥' is already allowed. Why not '\xa5', its equivalent?
  • [2] Should this be All Unicode? b'\xa5' is already allowed. Why not b'¥'?
  • [3] (Already suggested) b\xa5 is already allowed. Why not b\u{a5}?
  • [4] Should this be All Unicode? If we are going to allow b"¥¥¥", why not allow br"¥¥¥"?

Answering all of those questions in the affirmative gives the maximally permissive, maximally consistent table:

  Example # sets* Characters Escapes
Character 'H' 0 All Unicode Quote & Byte & Unicode
String "hello" 0 All Unicode Quote & Byte & Unicode
Raw string r#"hello"# <256 All Unicode N/A
Byte b'H' 0 All Unicode Quote & Byte & Unicode
Byte string b"hello" 0 All Unicode Quote & Byte & Unicode
Raw byte string br#"hello"# <256 All Unicode N/A

One possible drawback is that it becomes more important to understand that vanilla string literals are utf8 encoded while char literals are not. Which means that '\xa5' would be valid, while "\xa5" would not. But this is just a slight extension of the existing drawback of this RFC as written.

@nnethercote
Copy link

An alternative, more minimal proposal would be this:

  Example # sets* Characters Escapes
Character 'H' 0 All Unicode Quote & ASCII & Unicode
String "hello" 0 All Unicode Quote & ASCII & Unicode
Raw string r#"hello"# <256 All Unicode N/A
Byte b'H' 0 All ASCII Quote & Byte
Byte string b"hello" 0 All Unicode Quote & Byte & Unicode
Raw byte string br#"hello"# <256 All Unicode N/A

Arguments in favour:

  • Still solves the primary motivation: "byte string literals are currently not a superset of regular string literals".
  • A near-minimal change.
    • The minimal change would leave raw byte string literals alone, but that seems silly. Byte string literals and raw byte string literals should behave the same except for escape handling.
  • All other possible changes could cause confusion, without a clear benefit.

Arguments against:

  • Byte literals are the odd one out. But then, they already are the odd one out. E.g. they're not a superset of char literals the way byte string literals are a superset of string literals.

Right now, this is the direction I am leaning in.

@m-ou-se
Copy link
Member Author

m-ou-se commented Dec 6, 2022

  • [1] Should this be Byte? '¥' is already allowed. Why not '\xa5', its equivalent?

  • [2] Should this be All Unicode? b'\xa5' is already allowed. Why not b'¥'?

  • [3] (Already suggested) b\xa5 is already allowed. Why not b\u{a5}?

  • [4] Should this be All Unicode? If we are going to allow b"¥¥¥", why not allow br"¥¥¥"?

@nnethercote These questions seem to be mixing up a character's codepoint with its UTF-8 representation.

\xa5 is invalid unicode. '¥' has codepoint 0xa5, which in UTF-8 is encoded as two bytes: "\xc2\xa5".

b'¥' doesn't work because that's two bytes, not one. Same for b'\u{a5}'.

b'\u{30}' should be fine though. That's just a single byte as UTF-8 (so, ascii).

I think this RFC should use the characters and strings table

My proposal is basically to just remove the "Characters" and "Escapes" column, and replace it by a requirement that some types of literals must be valid UTF-8. All literals then accept all escape codes, and validation is now about whether the result is valid UTF-8, after the escape codes have been processed.

The only open question is what to do with character literals, since multi-character literals are parsed as two lifetimes rather than as an opening and closing quote. But that seems fine to me, because allowing '\xc2\xa5' would be a bit weird anyway, considering that char doesn't store UTF-8. ('\x30' or '¥' is fine though.)

@nnethercote
Copy link

I have been assuming that UTF-8 encoding is irrelevant for char literals and byte literals, and relevant for the other four kinds of literal.

You seem to agree on this for char literals:

The only open question is what to do with character literals... allowing '\xc2\xa5' would be a bit weird anyway, considering that char doesn't store UTF-8.

What about byte literals? A u8 also doesn't store UTF-8. With that in mind, I think my questions above do make sense.

My proposal is basically to just remove the "Characters" and "Escapes" column, and replace it by a requirement that some types of literals must be valid UTF-8. All literals then accept all escape codes, and validation is now about whether the result is valid UTF-8, after the escape codes have been processed.

The only open question is what to do with character literals...

This text says "some types of literals", then "all literals", then adds an exception for char literals. Also, "All literals then accept all escape codes" is clearly imprecise because raw string literals and raw byte string literals don't accept any escape codes.

The reason I like the table approach is that it forces us to be precise and look at every possibility; it shows there are lots of different possibilities. I find it easier and more precise to think about.

@joshtriplett
Copy link
Member

I do think we should permit br"¥¥¥", but I don't think we should make any of the other changes proposed in that table, for the reasons @m-ou-se stated.

I'm going to go ahead and propose FCP for this. This does not preclude making further changes to how this information is presented.

@rfcbot merge

@rfcbot concern raw-byte-strings-with-unicode

@rfcbot
Copy link
Collaborator

rfcbot commented Jan 19, 2023

Team member @joshtriplett has proposed to merge this. The next step is review by the rest of the tagged team members:

Concerns:

Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

cc @rust-lang/lang-advisors: FCP proposed for lang, please feel free to register concerns.
See this document for info about what commands tagged team members can give me.

@rfcbot rfcbot added proposed-final-comment-period Currently awaiting signoff of all team members in order to enter the final comment period. disposition-merge This RFC is in PFCP or FCP with a disposition to merge it. labels Jan 19, 2023
@nikomatsakis
Copy link
Contributor

@rfcbot fcp reviewed

This makes sense to me.

@m-ou-se
Copy link
Member Author

m-ou-se commented Aug 10, 2023

Thanks for all the feedback and patience! I finally got around to updating the RFC. I think it's all much clearer now. :)

I think both of the concerns registered with the rfcbot are resolved now.

@m-ou-se m-ou-se changed the title RFC: UTF-8 characters and escape codes in (byte) string literals RFC: Unicode and and escape codes in literals Aug 10, 2023
@m-ou-se m-ou-se requested a review from a team August 14, 2023 09:31
@m-ou-se m-ou-se added A-string Proposals relating to strings. I-lang-nominated Indicates that an issue has been nominated for prioritizing at the next lang team meeting. labels Aug 14, 2023
@joshtriplett
Copy link
Member

@rfcbot resolved raw-byte-strings-with-unicode

@joshtriplett
Copy link
Member

@pnkfelix Could you please resolve waiting-on-update-re-using-char-and-string-tables?

@pnkfelix
Copy link
Member

@rfcbot resolved waiting-on-update-re-using-char-and-string-tables

@joshtriplett
Copy link
Member

Pinging @pnkfelix, @scottmcm, and @tmandry for checkboxes, now that concerns have been resolved.

@mattheww
Copy link

Are numbers which don't represent a Unicode scalar value excluded from the definition of a Unicode escape (eg \u{DC00} or \u{FFFFFF})?

The Reference isn't currently very explicit about that (which it can get away with because at present those escapes can only appear in contexts where we promise valid utf-8). I think \u{DC00} does have a natural interpretation in a byte string.

If they're excluded, I think the "Valid unicode code point" text for validation of character literals is unnecessary (and perhaps misleading), as I think there's no way to write a character literal that would fail that validation rule.

@rfcbot rfcbot added the final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. label Sep 12, 2023
@rfcbot
Copy link
Collaborator

rfcbot commented Sep 12, 2023

🔔 This is now entering its final comment period, as per the review above. 🔔

@rfcbot rfcbot removed the proposed-final-comment-period Currently awaiting signoff of all team members in order to enter the final comment period. label Sep 12, 2023
@shepmaster shepmaster changed the title RFC: Unicode and and escape codes in literals RFC: Unicode and escape codes in literals Sep 13, 2023
@tmandry tmandry removed the I-lang-nominated Indicates that an issue has been nominated for prioritizing at the next lang team meeting. label Sep 19, 2023
@rfcbot rfcbot added finished-final-comment-period The final comment period is finished for this RFC. to-announce and removed final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. labels Sep 22, 2023
@rfcbot
Copy link
Collaborator

rfcbot commented Sep 22, 2023

The final comment period, with a disposition to merge, as per the review above, is now complete.

As the automated representative of the governance process, I would like to thank the author for their work and everyone else who contributed.

This will be merged soon.

@traviscross
Copy link
Contributor

This RFC has been merged, and we've opened a tracking issue: rust-lang/rust#116907

Thanks go out to the authors of this RFC for making Rust better by drafting it and pushing it through to acceptance.

@m-ou-se m-ou-se deleted the mixed-utf8-literals branch December 7, 2023 12:07
bors added a commit to rust-lang-ci/rust that referenced this pull request Jan 25, 2024
…, r=<try>

Implement RFC 3349, mixed utf8 literals

RFC: rust-lang/rfcs#3349
Tracking issue: rust-lang#116907

r? `@ghost`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-string Proposals relating to strings. A-syntax Syntax related proposals & ideas disposition-merge This RFC is in PFCP or FCP with a disposition to merge it. finished-final-comment-period The final comment period is finished for this RFC. T-lang Relevant to the language team, which will review and decide on the RFC. to-announce
Projects
None yet
Development

Successfully merging this pull request may close these issues.