Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to define begin and end patterns which span multiple lines? #41

Closed
kumarharsh opened this issue Apr 14, 2017 · 16 comments
Closed

How to define begin and end patterns which span multiple lines? #41

kumarharsh opened this issue Apr 14, 2017 · 16 comments

Comments

@kumarharsh
Copy link

kumarharsh commented Apr 14, 2017

Hi, I'm trying to expand the support for my plugin, graphql-for-vscode, to support gherkin feature files, but I'm stuck while defining the grammer. The begin regexp seems to not do any matching when I specify a newline (\n):

Specifically:

With this syntax definition:

{
  "fileTypes": ["feature"],
  "scopeName": "text.gherkin.feature.graphql",
  "injectionSelector": "L:text -comment",
  "patterns": [
    {
      "begin": "graphql request\\s+\"\"\"",
      "end": "\"\"\"",
      "patterns": [
        {
          "name": "featureGQL",
          "include": "source.graphql"
        }
      ]
    }
  ]
}

I don't get any matches:
image

but modifying the source text to bring the beginning """ to the same line as graphql request does the trick:

image

I tried modifying the begin regexp to be: "begin": "graphql request\\n\\s+\"\"\"", but it didn't help - in fact, it stopped highlighting anything within quotes.

I've spent some time browsing other syntaxes in vscode and textmate, but could only find \n to be used in match regexp sections, but none yet in begin or end sections.

@kumarharsh kumarharsh changed the title Do begin and end not match newlines? How to define begin and end patterns which span multiple lines? Apr 14, 2017
@kumarharsh
Copy link
Author

I also tried running npm run inspect on my syntax definition, and see that the tokenizer takes one line at a time. So, it seems like it's not possible to define multi-line begin/end rules? If not, then is there an alternate?

kumarharsh pushed a commit to kumarharsh/graphql-for-vscode that referenced this issue Apr 15, 2017
- new lines don't work in begin/end patterns, so there doesn't seem the be a way to match the intent: "start highlighting only when a line with `graphql request` is followed by `"""` in the next line, and end when the next set of `"""` is found.

Ref: microsoft/vscode-textmate#41
@kumarharsh
Copy link
Author

@alexandrudima friendly ping

@ghost
Copy link

ghost commented Nov 14, 2017

Regular expressions in TextMate grammars will not match across multiple lines. This is a fundamental limitation of the format. The only way to match content across multiple lines is to chain begin/end matches.

@alexdima
Copy link
Member

@freebroccolo is correct. Each line (with \n appended at the end) will be evaluated, one at a time, in order, against a grammar...

@colinfang
Copy link

Do you happen to know what does the prefix L: mean in the injectionSelector? I cannot find any docs on it.

@alexdima
Copy link
Member

I think @aeschli would know.

@kumarharsh
Copy link
Author

kumarharsh commented Nov 27, 2017

The L: part means left injection, i.e., the grammar rules are injected to the left of the existing rules for the scope being highlighted. When doing syntax highlighting, the left-most rule has higher precedence than the rules to it's right. So the L: ensures that this syntax highlighting will override the default ones.

Ref: textmate/markdown.tmbundle#15 (comment)

@kumarharsh
Copy link
Author

@aeschli I have a follow-up question on defining grammars to match across multiple lines. I've created a syntax for the gherkin syntax, which would match and highlight graphql syntax defined as so:

    Given I make a graphql request
    """
    mutation {
      UserCreate(input: {...}) {
        clientMutationId
      }
    }
    """
    Then I expect the response
    """
    {
      "data": {...},
      "errors": [...],
    }
    """

The syntax definition should apply on matching the following conditions:

  • Line X should end with graphql request string.
  • Line X+1 should have the """ docstring marker
  • Line X+2 to X+Y should have the graphql query which would be highlighted using the graphql syntax.
  • Line X+Y+1 should end with a """ docstring marker.
  • Line X+Y+2 would have a newline or some other gherkin code, which should be highlighted with the gherkin syntax.

The syntax definition I tried was like this:

{
  "injectionSelector": "L:text -comment",
  "patterns": [
    {
      "begin": "graphql request\\s*$",  // STAGE_1
      "patterns": [
        {
          "begin": "^\\s*(\"\"\")$",  // STAGE_2
          "beginCaptures": {
            "1": { "name": "string.quoted.double.graphql.begin" }
          },
          "end": "^\\s*(\"\"\")$",
          "endCaptures": {
            "1": { "name": "string.quoted.double.graphql.end" }
          },
          "patterns": [
            { "include": "source.graphql" }
          ]
        }
      ]
    },
  ]
  ...
}

It highlights the graphql syntax correctly, but then the highlighter doesn't revert back to gherkin syntax after encountering the closing """ (string.quoted.double.graphql.end). I believe this is because there is no end pattern defined in STAGE_1 part? But then how would I go about defining an end pattern there, as the docstrings are both captured within the STAGE_2 patterns, so there is nothing left to enable me to define as the end of the STAGE_1 pattern.

@aeschli
Copy link
Contributor

aeschli commented Jan 16, 2018

@kumarharsh Sorry, I'm no expert either, but I also believe that has to do with the missing end rule.
The markdown grammar uses the begin/while loop for something similar, maybe that helps (just guessing):
https://github.com/Microsoft/vscode/blob/1eca6b9817f1f44486cc966d8fc448ee95728b8f/extensions/markdown/syntaxes/markdown.tmLanguage.base#L75

@kumarharsh
Copy link
Author

I have seen that, but still can't make any sense of how to go about it. Since as soon as I define the while part, the first """ will match and the syntax will end right there.

    Given I make a graphql request  // begin matches this
    """                             // while would end here(?)
    mutation {

Guess I'll just drop this try here - as I feel the textmate grammar is handicapped by default.
Also, there doesn't seem to be any definitive guide to how to construct such grammars. Even the original textmate guide doesn't even describe while usage, just devs suffering a world of pain.

kumarharsh pushed a commit to kumarharsh/graphql-for-vscode that referenced this issue Jan 17, 2018
- remove support for syntax starting with "graphql request" followed by a new line with `"""`. Seems impossible (microsoft/vscode-textmate#41 (comment)).
@ghost
Copy link

ghost commented Jan 17, 2018

@kumarharsh For situations like this you have to make use of oniguruma lookarounds.

Here is a modification of your original example using a lookbehind:

{
  "injectionSelector": "L:text -comment",
  "patterns": [
    {
      "begin": "graphql request\\s*$",  // STAGE_1
      "end": "(?<=\"\"\")", // end if the last token consumed was the closing """
      "patterns": [
        {
          "begin": "^\\s*(\"\"\")$",  // STAGE_2
          "end": "^\\s*(\"\"\")",
          "patterns": [
            { "include": "source.graphql" }
          ]
        }
      ]
    },
  ]
  ...
}

Alternatively you can use a lookahead (although I find this is usually a worse choice):

{
  "injectionSelector": "L:text -comment",
  "patterns": [
    {
      "begin": "graphql request\\s*$",  // STAGE_1
      "end": "(\"\"\")$",
      "patterns": [
        {
          "begin": "^\\s*(\"\"\")$",  // STAGE_2
          "end": "^\\s*(?=\"\"\")",
          "patterns": [
            { "include": "source.graphql" }
          ]
        }
      ]
    },
  ]
  ...
}

One pattern you will want to use to match multi-line syntactic constructs accurately is as follows. This is what I was referring to earlier about chaining "begin"/"end" matches. You might want to adapt your example to this style depending on how many stages there will be after the first quoted block.

{
  "patterns": [
    {
      "begin": "A",
      "end": "B",
    },
    {
      "begin": "(?<=B)",
      "end": "C",
    },
    {
      "begin": "(?<=C)",
      "end": "D",
    },
    …
  ]
}

@kumarharsh
Copy link
Author

Thanks a lot @freebroccolo. I was under the impression that lookbehinds were not supported by JS/TS. Didn't think of the second way though. The third example is great. I had misconstrued 'chaining' to mean 'nesting' 🤦‍♂️

@ghost
Copy link

ghost commented Jan 18, 2018

Yeah, the regexp engine used for handling TextMate grammars is usually oniguruma so you have a lot more flexibility than you do with JS regexps.

@JoshCheek
Copy link

To clarify: for this pattern to work, it must be able to identify confidently, from the first line, that it applies?

Eg there is no way to do a markdown header where the current line could be a paragraph or could be a h1 or h2, and we can't know which, until we see the next line.

I attempted it, expecting that if the end didn't match, then it would not apply the begin, but instead, I think it just entered that node and never left it: Everything afterwards continued to have the header on it.

$ git diff --cached
diff --git a/markdown.tmLanguage.base.yaml b/markdown.tmLanguage.base.yaml
index 15df966..bf78da0 100644
--- a/markdown.tmLanguage.base.yaml
+++ b/markdown.tmLanguage.base.yaml
@@ -3,12 +3,13 @@ keyEquivalent: ^~M
 name: Markdown
 patterns:
 - {include: '#frontMatter'}
+- {include: '#heading-atx'}
+- {include: '#heading-setext'}
 - {include: '#block'}
 repository:
   block:
     patterns:
     - {include: '#separator'}
-    - {include: '#heading'}
     - {include: '#blockquote'}
     - {include: '#lists'}
     - {include: '#fenced_code_block'}
@@ -22,6 +23,8 @@ repository:
       '2': {name: punctuation.definition.quote.begin.markdown}
     name: markup.quote.markdown
     patterns:
+    - {include: '#heading-atx'}
+    - {include: '#heading-setext'}
     - {include: '#block'}
     while: (^|\G)\s*(>) ?
 {{languageDefinitions}}
@@ -38,7 +41,7 @@ repository:
     endCaptures:
       '3': {name: punctuation.definition.markdown}
     name: markup.fenced_code.block.markdown
-  heading:
+  heading-atx:
     match: (?:^|\G)[ ]{0,3}(#{1,6}\s+(.*?)(\s+#{1,6})?\s*)$
     captures:
       '1':
@@ -83,9 +86,12 @@ repository:
     patterns:
     - {include: '#inline'}
   heading-setext:
-    patterns:
-    - {match: '^(={3,})(?=[ \t]*$\n?)', name: markup.heading.setext.1.markdown}
-    - {match: '^(-{3,})(?=[ \t]*$\n?)', name: markup.heading.setext.2.markdown}
+    name: 'heading.1.markdown'
+    begin: (?:^|\G)(\w[^\n]*)$\n
+    beginCaptures: {'1': {name: entity.name.section.markdown}}
+    end: \G^(={3,})[ \t]*$\n?
+    endCaptures: {'1': {name: markup.heading.setext.1.markdown}}
+    patterns: [{match: '(?<=^={3,}[ \t]*$\n?)\G'}]
   html:
     patterns:
     - begin: (^|\G)\s*(<!--)
@@ -154,7 +160,6 @@ repository:
     patterns:
     - {include: '#inline'}
     - {include: text.html.derivative}
-    - {include: '#heading-setext'}
     while: (^|\G)(?!\s*$|#|[ ]{0,3}([-*_>][ ]{2,}){3,}[ \t]*$\n?|[ ]{0,3}[*+->]|[
       ]{0,3}[0-9]+\.)
   lists:
@@ -182,7 +187,6 @@ repository:
     patterns:
     - {include: '#inline'}
     - {include: text.html.derivative}
-    - {include: '#heading-setext'}
     while: (^|\G)((?=\s*[-=]{3,}\s*$)|[ ]{4,}(?=\S))
   raw_block: {begin: '(^|\G)([ ]{4}|\t)', name: markup.raw.block.markdown, while: '(^|\G)([
       ]{4}|\t)'}

I'm guessing it's impossible? 😞 Specifically, on "this is a paragraph", there is just no way to handle that.

$ cat test/colorize-fixtures/h1.md
nice 1
======

* list 1
* list 2

this is a paragraph
still just a paragraph

# shitty 1

nice 2
------

## shitty 2

$ jq < test/colorize-results/h1_md.json 'map({c, t})[]' -c
{"c":"nice 1","t":"text.html.markdown heading.1.markdown entity.name.section.markdown"}
{"c":"======","t":"text.html.markdown heading.1.markdown markup.heading.setext.1.markdown"}
{"c":"*","t":"text.html.markdown markup.list.unnumbered.markdown punctuation.definition.list.begin.markdown"}
{"c":" ","t":"text.html.markdown markup.list.unnumbered.markdown"}
{"c":"list 1","t":"text.html.markdown markup.list.unnumbered.markdown meta.paragraph.markdown"}
{"c":"*","t":"text.html.markdown markup.list.unnumbered.markdown punctuation.definition.list.begin.markdown"}
{"c":" ","t":"text.html.markdown markup.list.unnumbered.markdown"}
{"c":"list 2","t":"text.html.markdown markup.list.unnumbered.markdown meta.paragraph.markdown"}
{"c":"this is a paragraph","t":"text.html.markdown heading.1.markdown entity.name.section.markdown"}
{"c":"still just a paragraph","t":"text.html.markdown heading.1.markdown"}
{"c":"# shitty 1","t":"text.html.markdown heading.1.markdown"}
{"c":"nice 2","t":"text.html.markdown heading.1.markdown"}
{"c":"------","t":"text.html.markdown heading.1.markdown"}
{"c":"## shitty 2","t":"text.html.markdown heading.1.markdown"}

@jeff-hykin
Copy link

jeff-hykin commented Feb 16, 2021

To clarify: for this pattern to work, it must be able to identify confidently, from the first line, that it applies?

Yes, and that is actually a very succinct way of stating the overall fundamental limitation of the TextMate engine (not VS Code's implementation). The Tree Sitter (used by Atom) was created precisely to solve this limitation.

@JoshCheek
Copy link

Thanks for confirming 🙏

For any future readers, note that there is apparently an addition to the TextMate grammar, called Semantic Highlighting. I haven't looked into it yet, but it is introduced like this:

Starting with release 1.43, VS Code also allows extensions to provide tokenization through a Semantic Token Provider. Semantic providers are typically implemented by language servers that have a deeper understanding of the source file and can resolve symbols in the context of the project. For example, a constant variable name can be rendered using constant highlighting throughout the project, not just at the place of its declaration.

Highlighting based on semantic tokens is considered an addition to the TextMate-based syntax highlighting. Semantic highlighting goes on top of the syntax highlighting. And as language servers can take a while to load and analyze a project, semantic token highlighting may appear after a short delay.
-- https://code.visualstudio.com/api/language-extensions/syntax-highlight-guide

seanwu1105 added a commit to seanwu1105/vscode-qt-for-python that referenced this issue Dec 17, 2021
TextMate grammer cannot move across multiple lines. Use the suggestion
workaround mentioned in this issue:
microsoft/vscode-textmate#41
michal-kapala added a commit to michal-kapala/vscode-jitterbit that referenced this issue Dec 22, 2022
Syntax highlight-level grammar support for:
+ string/int/float/bool constants
+ local/global/system variables
+ function call foundation
+ trans tag
+ line comments

- multiline comments don't work - refer to microsoft/vscode-textmate#41 (comment)
aceArt-GmbH pushed a commit to aceArt-GmbH/vscode-java that referenced this issue Nov 29, 2023
aceArt-GmbH pushed a commit to aceArt-GmbH/vscode-java that referenced this issue Dec 5, 2023
aceArt-GmbH pushed a commit to aceArt-GmbH/vscode-java that referenced this issue Dec 11, 2023
michal-kapala added a commit to michal-kapala/vscode-jitterbit that referenced this issue Mar 30, 2024
+ single-line multiline comments are properly highlighted
+ multiline comments are still broken likely due to textmate's limitation (see microsoft/vscode-textmate#41)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants