Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VSCode's tmLanguage support cannot match zero-width begin and end correctly. #12

Closed
be5invis opened this issue May 18, 2016 · 9 comments

Comments

@be5invis
Copy link

The syntax is:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>fileTypes</key>
    <array>
        <string>testlang</string>
    </array>
    <key>name</key>
    <string>testlang</string>
    <key>patterns</key>
    <array>
        <dict>
            <key>include</key>
            <string>#indentedVerbatimOp</string>
        </dict>
    </array>
    <key>repository</key>
    <dict>
        <key>indentedVerbatimOp</key>
        <dict>
            <key>begin</key>
            <string>^([ \t]*)(?=(.*?)\|$)</string>
            <key>end</key>
            <string>^(?!\1[ \t])(?=[ \t]*\S)</string>
            <key>name</key>
            <string>string.unquoted.verbatim.youki</string>
        </dict>
    </dict>
    <key>scopeName</key>
    <string>text.testlang</string>
    <key>uuid</key>
    <string>159375af-d9b4-448c-a9da-d235eadf3556</string>
</dict>
</plist>

It matches indented block with a leading line ending with a bar (|). I've found that Code cannot match the block's ending, which is zero-width:
image

Sublime can match it perfectly:
image

If I add some character to the end tag then Code works again:

            <key>end</key>
            <string>^(?!\1[ \t])(?=[ \t]*\S)xx</string>

image

@alexdima
Copy link
Member

@be5invis This grammar produces an endless loop because the begin and end patterns do not consume anything. This is what happens when tokenizing [test]|:

  • state machine is initial root state, textToTokenize = [test]|, position = 0.
  • the begin rule matches since ^([ \t]*)(?=(.*?)\|$) matches [test]|
  • From the begin rule, group 1 ([ \t]*) matches empty string, and group 2 (?=(.*?)\|$) matches [test]|
  • Since group 1 is empty and group 2 is written as a look-ahead, the entire begin regex does not consume any text, so position is left to be 0 (no text consumed by the regex).
  • The end regex ^(?!\1[ \t])(?=[ \t]*\S) is now resolved, as it contains back references (\1 to group 1 which was empty) to be: ^(?![ \t])(?=[ \t]*\S)
  • state machine is now inside indentedVerbatimOp state with the resolved end regex ^(?![ \t])(?=[ \t]*\S) and without any internal patterns, textToTokenize = [test]|, position = 0.
  • the end rule matches since ^(?![ \t])(?=[ \t]*\S) matches [test]|
  • the state pops, and the state machine is again in initial root state with textToTokenize = [test]|, position = 0.

We have code to detect endless loops, and the entire line is skipped since it seems to trigger an endless loop driven by the grammar. It is possible that Sublime has implemented endless loop handling different than we did, but AFAIK handling of "bad" grammars (that contain endless loops) is unspecified.

Might I suggest that you tweak your rules to not trigger endless loops and unspecified behaviour. e.g. (consume text in the begin rule: make begin = ^([ \t]*)((.*?)\|$)). It would make your grammar loop-free and would guarantee the same behaviour across all the editors that support TM grammars.

@alexdima
Copy link
Member

I've went further and made our handling of looping grammars in this case not pop the bad state, giving these results without any changes on your part in vscode:

But it is a good idea to remove the loop anyways :)

image

@be5invis
Copy link
Author

be5invis commented May 23, 2016

@alexandrudima
The syntax is DESIGNED to make no advance. The key difficulty is that in this syntax is that the first line should be colorized either (the [test] is a function call in Youki's syntax). Therefore the begin tag should not match any text, to make the first line colorized by <patterns>. I do not know how to assign sub-patterns to the text matched by begin or end.

@alexdima
Copy link
Member

@be5invis Here is how to colorize the begin different than the contents (using beginCaptures), while having a non-looping (correct) grammar:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>fileTypes</key>
    <array>
        <string>testlang</string>
    </array>
    <key>name</key>
    <string>testlang</string>
    <key>patterns</key>
    <array>
        <dict>
            <key>include</key>
            <string>#indentedVerbatimOp</string>
        </dict>
    </array>
    <key>repository</key>
    <dict>
        <key>indentedVerbatimOp</key>
        <dict>
            <key>begin</key>
            <string>^([ \t]*)((.*?)\|$)</string>
            <key>beginCaptures</key>
            <dict>
                <key>3</key>
                <dict>
                    <key>name</key>
                    <string>meta.storage.type.function</string>
                </dict>
            </dict>
            <key>patterns</key>
            <array>
                <dict>
                    <key>match</key>
                    <string>.+</string>
                    <key>name</key>
                    <string>string</string>
                </dict>
            </array>
            <key>end</key>
            <string>^(?!\1[ \t])(?=[ \t]*\S)</string>
            <key>name</key>
            <string></string>
        </dict>
    </dict>
    <key>scopeName</key>
    <string>text.testlang</string>
    <key>uuid</key>
    <string>159375af-d9b4-448c-a9da-d235eadf3556</string>
</dict>
</plist>

image

@be5invis
Copy link
Author

be5invis commented May 26, 2016

@alexandrudima However the leader line can have nested sub-parts, like this:

[test "string"]|
    whatever

The "string" should be colored separately. beginCaptures is not enough.

@alexdima
Copy link
Member

alexdima commented May 27, 2016

@be5invis You can (re)tokenize the captured text via beginCaptures:

See here an example where it is done for a match captures, but the principle is the same:
https://github.com/Microsoft/vscode-textmate/blob/master/test-cases/first-mate/fixtures/makefile.json#L350

Here the match captures the entire line which is then further (re)tokenized to assign more fine-grained tokens. You just need to convert that into equivalent plist.
image

@be5invis
Copy link
Author

@alexandrudima Thank you.

@octref
Copy link

octref commented Sep 25, 2018

I still don't understand why we do this:

  • Why does begin & end have to consume anything? If the inner pattern consumes something it should be enough to advance parsing without infinite loop.
  • I think there is a step wrong in VSCode's tmLanguage support cannot match zero-width begin and end correctly. #12 (comment). @alexandrudima my understanding is if you have begin that matches the beginning location of an empty location, the end pattern will go forward from that point on instead of starting matching at that point, causing the begin and end to consume the same zero-width part.
  • There are many existing grammars that use lookahead or \G for boundaries. What's a migration path for them? I don't want to add (?=.) to each of them: microsoft/vscode@ccd3c1f.

/cc @aeschli

@RedCMD
Copy link

RedCMD commented Dec 6, 2023

For anyone that comes across this
the "end" rule is attempted before anything in the "patterns" array

to delay it
use "applyEndPatternLast": true
OR
use (?!\\G) in your end rule
"end": "(?!\\G)..."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants