Allow escaping arbitrary characters #2360

laniakea64 · 2024-09-10T23:59:24Z

Not sure the error reporting is ideal, and not sure if the tests are covering enough cases?

casey · 2024-09-11T06:06:14Z

What do you think about introducing a State enum, which can be declared inside the parse string function, like so:

enum State {
  Initial, // initial state
  Backslash, // after `\`
  Unicode, // after `\u`, but not opening `{`
  UnicodeValue { value: String }, // after `\u{`
}

Then you would do:

let mut state = State::Initial;
while let Some(c) = chars.next() {
  match state {
  }
}

if state != State::Initial {
  // return error
}

I think this might make the code a little simpler.

Also, I think we should see if the error messages from u32::from_str_radix are good enough to use without needing to define our own error type. If they're reasonable, then we could accept any characters between \u{ and }, and let u32::from_str_radix complain if they're not hex, or if they overflow a u32.

laniakea64 · 2024-09-11T19:01:56Z

Done. It does simplify the code and looks better, thanks for the suggestions!

we could accept any characters between \u{ and }, and let u32::from_str_radix complain if ... they overflow a u32.

The syntax we're implementing only accepts up to 6 hex digits, while u32 is up to 8 hex digits, so the code needs to keep its own length check.

casey · 2024-09-14T07:53:31Z

Looking at the error message from std::num::ParseIntError, I think it's actually not great, so I'm coming around to doing more checking ourselves and not delegating the error message.

I think we should have these errors:

UnicodeEscapeEmpty for \u{}
UnicodeEscapeCharacter { character: char } for \u{X} where X is anything other than 0-9a-fA-F
UnicodeEscapeLength for when there are more than six characters inside an escape sequence
UnicodeEscapeRange when the u32 is not in range for a char, the error message should mention that 10FFFF is the maximum valid codepoint
UnicodeEscapeUnterminated when we don't encounter } before the end of the string

This should allow us to just .unwrap when calling from_str_radix, since it will be guaranteed to be a valid hex string that's in range for a u32.

Sorry for waffling!

laniakea64 · 2024-09-14T18:07:16Z

I think we should have these errors:

But when just encounters an invalid escape sequence, it points at the entire string, which means the errors need more information than that to pinpoint. To make the exact suggested errors awesome, could just please point only at the malformed escape sequence? 🙂

In the mean time, pushed a commit trying to find a middle ground, since digging into why just points at the whole string is more than I have capacity for atm 😦

casey · 2024-09-15T09:02:24Z

Compilation error message context is based on tokens, so since a string is one token, by default it points at the whole string. This could definitely be improved. The easiest solution would be to construct a new, fake token which only points at the sub-section of the string containing the invalid escape sequence, but that's probably best left to a follow-up PR.

casey · 2024-09-15T10:00:15Z

Merged! This is great. Tweaked the code a little, plus error messages.

laniakea64 · 2024-09-15T15:47:52Z

Tweaked the code a little, plus error messages.

Nice, I like the tweaks, thanks! Shows a few Rust syntaxes I'm not sufficiently familiar with.

laniakea64 added 4 commits September 14, 2024 12:38

Initial implementation of \u{...} escape sequence

eeeacc9

Add tests

a5d3320

Use enum and use u32::str_from_radix error messages

4c58223

Clippy

aecee51

laniakea64 force-pushed the character-escape branch from 60b66ca to aecee51 Compare September 14, 2024 16:38

Don't use u32::from_str_radix error messages

5f6d0a9

casey added 15 commits September 15, 2024 16:15

Rename InvalidUEscapeSequence to UnicodeEscapeDelimiter

6246f25

Rename other to use shorthand syntax

e16462b

char_u32 -> codepoint

7436b6a

Use ok_or_else

43aaf46

Handle \u separately to avoid continue

c2cb6a5

Shorthand intitializer

ae0f383

Avoid continue

d796348

Move escape processing into dedicated function

894ee88

Use if instead of match

0abd691

Use & instead of as_str

0876183

Use pattern matching instead of contains

ae8b779

Tweak error messages and test names

2ea4f84

Reform

f27ab6d

Test maximum valid char

c568fd0

Test all hex digits

8b714b8

casey enabled auto-merge (squash) September 15, 2024 09:58

casey merged commit d2c66e9 into casey:master Sep 15, 2024
5 checks passed

laniakea64 mentioned this pull request Sep 15, 2024

Document \u{...} #2371

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow escaping arbitrary characters #2360

Allow escaping arbitrary characters #2360

laniakea64 commented Sep 10, 2024

casey commented Sep 11, 2024

laniakea64 commented Sep 11, 2024

casey commented Sep 14, 2024

laniakea64 commented Sep 14, 2024

casey commented Sep 15, 2024

casey commented Sep 15, 2024

laniakea64 commented Sep 15, 2024

Allow escaping arbitrary characters #2360

Allow escaping arbitrary characters #2360

Conversation

laniakea64 commented Sep 10, 2024

casey commented Sep 11, 2024

laniakea64 commented Sep 11, 2024

casey commented Sep 14, 2024

laniakea64 commented Sep 14, 2024

casey commented Sep 15, 2024

casey commented Sep 15, 2024

laniakea64 commented Sep 15, 2024