2024 Additions --- a.k.a. the `v` flag

Background

The change set w.r.t the previous edition can be seen here. The discussion before merge can be seen here. Some more discussions can be found in the original google doc.

This edition incorporates two proposals:

« Support properties of strings (a.k.a. “sequence properties”) in Unicode property escapes »
« RegExp v flag with set notation + properties of strings ».

The latter technically subsumes the former, but these can be seen as two separate additions:

String properties;
Set notations.

Discussion of the new features

String properties

The biggest impact of this feature is the change of definition of CharSet. Whereas previously a character class was being compiled to a set of character (hence the name CharSet), it can now be compiled to a set of strings (or sequence of characters), depending on the presence of v.

This change could prove problematic, since the operations available on CharSetElement differ wildly depending on the flags. A solution might be to abstract the logic manipulating CharSets in CompileAtom, like we handled Canonicalize.

Set notations

These new features should require minimal changes. Character classes should be changed to accommodate nesting. Most of the work to implement the new operators can be delegated to operations on sets.

The changes, section per section

Note that the table numbers have been shifted.

22.2.1 Patterns

The changes here will cause some irrelevant changes in the subsequent sections, as some variables will be renamed accordingly.

22.2.1.1 Static Semantics: Early Errors

Safeguard preventing string properties if v is not toggled;
Negated character classes ([^...], \P{...}) are rejected.

22.2.1.6 Static Semantics: CharacterValue

New rules for special characters/escapes in ClassSetCharacter

22.2.1.7 Static Semantics: MayContainStrings

New analysis: detects whether a character class might match strings, i.e. whether it might consume more than one character from the input.

22.2.2.1 Notation

CaptureRange and MatchState are now said to be record rather than tuples; note that they were already being used (and were mechanized) as if they were records. Note that this make the diff look more bigger than it actually is.
CharSetElement is now defined "dependently" of the v flag as either a Character (as before) or a list Character.

22.2.2.1.1 RegExp Records

Add v flag to record.

22.2.2.3 Runtime Semantics: CompileSubpattern

Modularize Disjunction, Empty, Sequence matchers into functions (sections 22.2.2.3.[2-4])

22.2.2.7 Runtime Semantics: CompileAtom

AllCharacters is now a function

Atom :: CharacterClass and AtomEscape :: CharacterClassEscape become way more complex, mostly in order to implement the longest-match semantics of string properties. The behavior being wildly different depending on the flag (which is amplified by the wide difference on the definition of CharSetElement) make these functions candidates for abstraction. Additionally, the two implement the exact same logic. The differences are concentrated in the very first steps (- is Atom :: CharacterClass and + AtomEscape :: CharacterClassEscape):

 + 1. Let cs be CompileToCharSet of CharacterClassEscape with argument rer.

 - 1. Let cc be CompileCharacterClass of CharacterClass with argument rer.
 - 2. Let cs be cc.[[CharSet]].

 + 2. If rer.[[UnicodeSets]] is false, or if every CharSetElement of cs consists of a single character (including if cs is empty), return CharacterSetMatcher(rer, cs, false, direction).
 - 3. If rer.[[UnicodeSets]] is false, or if every CharSetElement of cs consists of a single character (including if cs is empty), return CharacterSetMatcher(rer, cs, cc.[[Invert]], direction).

 - 4. Assert: cc.[[Invert]] is false.

22.2.2.7.1 CharacterSetMatcher ( rer, A, invert, direction )

New assertions at the beginning if v; should hold by new early errors.

22.2.2.7.3 Canonicalize ( rer, ch )

Unicode mode check changed to use the new HasEitherUnicodeFlag function.

22.2.2.8 Runtime Semantics: CompileCharacterClass

Use CharacterComplement with v flag.

22.2.2.9 Runtime Semantics: CompileToCharSet

Use CharacterComplement.
Use MaybeSimpleCaseFolding.
Implement new set operators.
Implement new character classes.

22.2.2.9.2 HasEitherUnicodeFlag ( rer )

New function.

22.2.2.9.3 WordCharacters ( rer )

Unicode mode check changed to use the new HasEitherUnicodeFlag function.

22.2.2.9.[4-6] AllCharacters ( rer )

New functions.

22.2.2.9.7 UnicodeMatchProperty ( rer, p )

Add lookup into string properties table.

22.2.2.10 Runtime Semantics: CompileClassSetString

New function.

22.2.[3-...]

Not covered.

Comments and questions

What happens when v is enabled but not u (yes, it's valid; see this issue)?
The fact that the the MayContainStrings analysis is done before compiling the set forces it to be extremely pessimistic, e.g. [\p{emoji}--\p{emoji}] is analyzed as potentially containing strings, despite being empty.
In positive lookarounds: from Let y be r's MatchState. to Assert: r is a MatchState.. Makes more sense from spec perspective, but the previous wording was better for us.
AllCharacters has a special case for vi flags; why? Seems like the answer might be in this discussion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vFlag.md

vFlag.md

2024 Additions --- a.k.a. the `v` flag

Background

Discussion of the new features

String properties

Set notations

The changes, section per section

22.2.1 Patterns

22.2.1.1 Static Semantics: Early Errors

22.2.1.6 Static Semantics: CharacterValue

22.2.1.7 Static Semantics: MayContainStrings

22.2.2.1 Notation

22.2.2.1.1 RegExp Records

22.2.2.3 Runtime Semantics: CompileSubpattern

22.2.2.7 Runtime Semantics: CompileAtom

22.2.2.7.1 CharacterSetMatcher ( rer, A, invert, direction )

22.2.2.7.3 Canonicalize ( rer, ch )

22.2.2.8 Runtime Semantics: CompileCharacterClass

22.2.2.9 Runtime Semantics: CompileToCharSet

22.2.2.9.2 HasEitherUnicodeFlag ( rer )

22.2.2.9.3 WordCharacters ( rer )

22.2.2.9.[4-6] AllCharacters ( rer )

22.2.2.9.7 UnicodeMatchProperty ( rer, p )

22.2.2.10 Runtime Semantics: CompileClassSetString

22.2.[3-...]

Comments and questions

Files

vFlag.md

Latest commit

History

vFlag.md

File metadata and controls

2024 Additions --- a.k.a. the v flag

Background

Discussion of the new features

String properties

Set notations

The changes, section per section

22.2.1 Patterns

22.2.1.1 Static Semantics: Early Errors

22.2.1.6 Static Semantics: CharacterValue

22.2.1.7 Static Semantics: MayContainStrings

22.2.2.1 Notation

22.2.2.1.1 RegExp Records

22.2.2.3 Runtime Semantics: CompileSubpattern

22.2.2.7 Runtime Semantics: CompileAtom

22.2.2.7.1 CharacterSetMatcher ( rer, A, invert, direction )

22.2.2.7.3 Canonicalize ( rer, ch )

22.2.2.8 Runtime Semantics: CompileCharacterClass

22.2.2.9 Runtime Semantics: CompileToCharSet

22.2.2.9.2 HasEitherUnicodeFlag ( rer )

22.2.2.9.3 WordCharacters ( rer )

22.2.2.9.[4-6] AllCharacters ( rer )

22.2.2.9.7 UnicodeMatchProperty ( rer, p )

22.2.2.10 Runtime Semantics: CompileClassSetString

22.2.[3-...]

Comments and questions

2024 Additions --- a.k.a. the `v` flag