VSCode's tmLanguage support cannot match zero-width `begin` and `end` correctly. #12

be5invis · 2016-05-18T19:16:52Z

The syntax is:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>fileTypes</key>
    <array>
        <string>testlang</string>
    </array>
    <key>name</key>
    <string>testlang</string>
    <key>patterns</key>
    <array>
        <dict>
            <key>include</key>
            <string>#indentedVerbatimOp</string>
        </dict>
    </array>
    <key>repository</key>
    <dict>
        <key>indentedVerbatimOp</key>
        <dict>
            <key>begin</key>
            <string>^([ \t]*)(?=(.*?)\|$)</string>
            <key>end</key>
            <string>^(?!\1[ \t])(?=[ \t]*\S)</string>
            <key>name</key>
            <string>string.unquoted.verbatim.youki</string>
        </dict>
    </dict>
    <key>scopeName</key>
    <string>text.testlang</string>
    <key>uuid</key>
    <string>159375af-d9b4-448c-a9da-d235eadf3556</string>
</dict>
</plist>

It matches indented block with a leading line ending with a bar (|). I've found that Code cannot match the block's ending, which is zero-width:

Sublime can match it perfectly:

If I add some character to the end tag then Code works again:

            <key>end</key>
            <string>^(?!\1[ \t])(?=[ \t]*\S)xx</string>

The text was updated successfully, but these errors were encountered:

alexdima · 2016-05-23T11:42:44Z

@be5invis This grammar produces an endless loop because the begin and end patterns do not consume anything. This is what happens when tokenizing [test]|:

state machine is initial root state, textToTokenize = [test]|, position = 0.
the begin rule matches since ^([ \t]*)(?=(.*?)\|$) matches [test]|
From the begin rule, group 1 ([ \t]*) matches empty string, and group 2 (?=(.*?)\|$) matches [test]|
Since group 1 is empty and group 2 is written as a look-ahead, the entire begin regex does not consume any text, so position is left to be 0 (no text consumed by the regex).
The end regex ^(?!\1[ \t])(?=[ \t]*\S) is now resolved, as it contains back references (\1 to group 1 which was empty) to be: ^(?![ \t])(?=[ \t]*\S)
state machine is now inside indentedVerbatimOp state with the resolved end regex ^(?![ \t])(?=[ \t]*\S) and without any internal patterns, textToTokenize = [test]|, position = 0.
the end rule matches since ^(?![ \t])(?=[ \t]*\S) matches [test]|
the state pops, and the state machine is again in initial root state with textToTokenize = [test]|, position = 0.

We have code to detect endless loops, and the entire line is skipped since it seems to trigger an endless loop driven by the grammar. It is possible that Sublime has implemented endless loop handling different than we did, but AFAIK handling of "bad" grammars (that contain endless loops) is unspecified.

Might I suggest that you tweak your rules to not trigger endless loops and unspecified behaviour. e.g. (consume text in the begin rule: make begin = ^([ \t]*)((.*?)\|$)). It would make your grammar loop-free and would guarantee the same behaviour across all the editors that support TM grammars.

alexdima · 2016-05-23T13:17:05Z

I've went further and made our handling of looping grammars in this case not pop the bad state, giving these results without any changes on your part in vscode:

But it is a good idea to remove the loop anyways :)

be5invis · 2016-05-23T14:00:58Z

@alexandrudima
The syntax is DESIGNED to make no advance. The key difficulty is that in this syntax is that the first line should be colorized either (the [test] is a function call in Youki's syntax). Therefore the begin tag should not match any text, to make the first line colorized by <patterns>. I do not know how to assign sub-patterns to the text matched by begin or end.

alexdima · 2016-05-26T06:41:30Z

@be5invis Here is how to colorize the begin different than the contents (using beginCaptures), while having a non-looping (correct) grammar:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>fileTypes</key>
    <array>
        <string>testlang</string>
    </array>
    <key>name</key>
    <string>testlang</string>
    <key>patterns</key>
    <array>
        <dict>
            <key>include</key>
            <string>#indentedVerbatimOp</string>
        </dict>
    </array>
    <key>repository</key>
    <dict>
        <key>indentedVerbatimOp</key>
        <dict>
            <key>begin</key>
            <string>^([ \t]*)((.*?)\|$)</string>
            <key>beginCaptures</key>
            <dict>
                <key>3</key>
                <dict>
                    <key>name</key>
                    <string>meta.storage.type.function</string>
                </dict>
            </dict>
            <key>patterns</key>
            <array>
                <dict>
                    <key>match</key>
                    <string>.+</string>
                    <key>name</key>
                    <string>string</string>
                </dict>
            </array>
            <key>end</key>
            <string>^(?!\1[ \t])(?=[ \t]*\S)</string>
            <key>name</key>
            <string></string>
        </dict>
    </dict>
    <key>scopeName</key>
    <string>text.testlang</string>
    <key>uuid</key>
    <string>159375af-d9b4-448c-a9da-d235eadf3556</string>
</dict>
</plist>

be5invis · 2016-05-26T08:40:02Z

@alexandrudima However the leader line can have nested sub-parts, like this:

[test "string"]|
    whatever

The "string" should be colored separately. beginCaptures is not enough.

alexdima · 2016-05-27T08:57:51Z

@be5invis You can (re)tokenize the captured text via beginCaptures:

See here an example where it is done for a match captures, but the principle is the same:
https://github.com/Microsoft/vscode-textmate/blob/master/test-cases/first-mate/fixtures/makefile.json#L350

Here the match captures the entire line which is then further (re)tokenized to assign more fine-grained tokens. You just need to convert that into equivalent plist.

be5invis · 2016-05-27T09:10:08Z

@alexandrudima Thank you.

octref · 2018-09-25T19:02:35Z

I still don't understand why we do this:

Why does begin & end have to consume anything? If the inner pattern consumes something it should be enough to advance parsing without infinite loop.
I think there is a step wrong in VSCode's tmLanguage support cannot match zero-width begin and end correctly. #12 (comment). @alexandrudima my understanding is if you have begin that matches the beginning location of an empty location, the end pattern will go forward from that point on instead of starting matching at that point, causing the begin and end to consume the same zero-width part.
There are many existing grammars that use lookahead or \G for boundaries. What's a migration path for them? I don't want to add (?=.) to each of them: microsoft/vscode@ccd3c1f.

/cc @aeschli

RedCMD · 2023-12-06T03:21:34Z

For anyone that comes across this
the "end" rule is attempted before anything in the "patterns" array

to delay it
use "applyEndPatternLast": true
OR
use (?!\\G) in your end rule
"end": "(?!\\G)..."

be5invis mentioned this issue May 18, 2016

VSCode's tmLanguage support cannot match zero-width begin and end correctly. microsoft/vscode#6493

Closed

alexdima closed this as completed in 8b9f58f May 23, 2016

This was referenced Apr 5, 2017

[c++] grammar in endless loop microsoft/vscode#23850

Closed

Grammar in endless loop errors #40

Closed

alexdima mentioned this issue Sep 19, 2018

anchor operation seems sparadic past first repetition on a line. #70

Closed

octref mentioned this issue Sep 19, 2018

CSS grammer in endless loop microsoft/vscode#57407

Closed

matter123 mentioned this issue Mar 22, 2019

if constexpr causes incorrect syntax highlighting for rest of line jeff-hykin/better-cpp-syntax#43

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VSCode's tmLanguage support cannot match zero-width `begin` and `end` correctly. #12

VSCode's tmLanguage support cannot match zero-width `begin` and `end` correctly. #12

be5invis commented May 18, 2016

alexdima commented May 23, 2016

alexdima commented May 23, 2016

be5invis commented May 23, 2016 •

edited

Loading

alexdima commented May 26, 2016

be5invis commented May 26, 2016 •

edited

Loading

alexdima commented May 27, 2016 •

edited

Loading

be5invis commented May 27, 2016

octref commented Sep 25, 2018

RedCMD commented Dec 6, 2023

VSCode's tmLanguage support cannot match zero-width begin and end correctly. #12

VSCode's tmLanguage support cannot match zero-width begin and end correctly. #12

Comments

be5invis commented May 18, 2016

alexdima commented May 23, 2016

alexdima commented May 23, 2016

be5invis commented May 23, 2016 • edited Loading

alexdima commented May 26, 2016

be5invis commented May 26, 2016 • edited Loading

alexdima commented May 27, 2016 • edited Loading

be5invis commented May 27, 2016

octref commented Sep 25, 2018

RedCMD commented Dec 6, 2023

VSCode's tmLanguage support cannot match zero-width `begin` and `end` correctly. #12

VSCode's tmLanguage support cannot match zero-width `begin` and `end` correctly. #12

be5invis commented May 23, 2016 •

edited

Loading

be5invis commented May 26, 2016 •

edited

Loading

alexdima commented May 27, 2016 •

edited

Loading