Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing to parse some sublime-syntax files #156

Closed
sharkdp opened this issue May 2, 2018 · 10 comments
Closed

Failing to parse some sublime-syntax files #156

sharkdp opened this issue May 2, 2018 · 10 comments

Comments

@sharkdp
Copy link
Contributor

sharkdp commented May 2, 2018

I'm currently using syntect with a custom SyntaxSet which is compiled from several sources.

I have experienced a few problems with some syntax files due to errors in the regex parsing (unfortunately resulting in a panic!, see #98).

I'm not sure if there is something that can be done about this here, but I thought it would be good to share (will #34 change something in this respect?).

Invalid pattern in look-behind

I downloaded an advanced Markdown syntax definition from https://github.com/jonschlinkert/sublime-markdown-extended and it results in a thread 'main' panicked at 'called Result::unwrap() on an Err value: Error(-122, invalid pattern in look-behind)', libcore/result.rs:945:5 when using it with an input like this:

Test

<div>
</div>

I tracked the error down to this section

    - match: '^(?=<(p|div|h[1-6]|blockquote|pre|table|dl|ol|ul|script|noscript|form|fieldset|iframe|math|ins|del)\b)(?!.*?</\1>)'
      comment: Markdown formatting is disabled inside block-level tags.
      push:
        - meta_scope: meta.disable-markdown
        - match: (?<=^</\1>$\n)
          pop: true
        - include: scope:text.html.basic
        - include: scope:text.html.handlebars

and this regular expression in particular: (?<=^</\1>$\n). I was able to fix the parsing-error by removing the trailing \n in the parens, but I'm not sure if this is the right way to go.

Target of repeat operator is invalid

The current version of the Javascript.sublime-syntax in https://github.com/sublimehq/Packages fails with target of repeat operator is invalid on any JavaScript file, for example:

var x = 2;

I tracked the error down to this section

variables:
  identifier_start: '[_$\p{L}\p{Nl}]'
  identifier_part: '[_$\p{L}\p{Nl}\p{Mn}\p{Mc}\p{Nd}\p{Pc}\x{200C}\x{200D}]'
  identifier_break: (?!{{identifier_part}})

  identifier: '{{identifier_start}}{{identifier_part}}*{{identifier_break}}'
  constant_identifier: '[[:upper:]]{{identifier_part}}*{{identifier_break}}'
  dollar_only_identifier: '\${{identifier_break}}'
  dollar_identifier: '(\$){{identifier_part}}*{{identifier_break}}+'
  func_lookahead: '\s*(async\s+)?function{{identifier_break}}'
  arrow_func_lookahead: '\s*(async\s*)?({{identifier}}|\(([^()]|\([^()]*\))*\))\s*=>'
  either_func_lookahead: (?:{{func_lookahead}}|{{arrow_func_lookahead}})
  binding_pattern_lookahead: (?:{{identifier}}|\[|\{)
  left_expression_end_lookahead: '(?!\s*[.\[\(])'

and the dollar_identifier regex in particular. The last part of this regex is {{identifier_break}}+ which expands to (?!...)+. The + here is superfluous and I have opened a pull request to remove it, but it would be great if syntect would not panic in such a case.

@trishume
Copy link
Owner

trishume commented May 3, 2018

Yah the fact that syntect panics on invalid regexes is unfortunate, and should probably be fixed for 3.0 when we next make breaking API changes. It may require a lot of bubbling though.

As for the specific errors, I wonder if we're using an older version of Oniguruma than Sublime or a newer one. Because Sublime should be parsing these using the same regex parser we are.

@robinst
Copy link
Collaborator

robinst commented May 3, 2018

Regarding the first problem, "Invalid pattern in look-behind":

The regex (?<=^</\1>$\n) as it is in the syntax definition works fine with Oniguruma. The problem is that we rewrite it to (?<=^</\1>$\z) for the "no-newlines" mode, which is invalid. That's a bug in syntect that we can fix (here somewhere).

But, I would recommend that you switch to "newlines" mode. To do that, do the following changes:

Some background: The regexes in Sublime Text are all written assuming that the lines that they match on include the trailing \n, because Sublime Text itself does that. Sometimes that's not simple to do and would mean you'd have to copy every line. That's why syntect offers a "no newlines" mode, where it tries to rewrite regexes so that they match on lines that don't contain the trailing \n. But that's not possible to do correctly in all cases, and as you can see with this example, it contains bugs too.

@robinst
Copy link
Collaborator

robinst commented May 3, 2018

@trishume: With regards to panicking, can we make it so that we try to compile the regexes when loading from YAML (just for checking them)? So loading them would fail, instead of only later when using them for parsing. I think that would reduce the bubbling a bit.

@trishume
Copy link
Owner

trishume commented May 3, 2018

@robinst yah that would help. But the trick there is regexes with capture interpolations. There I guess you have to compile them with some placeholder such that if that compiles then any validly escaped interpolation will compile.

@robinst
Copy link
Collaborator

robinst commented May 3, 2018

Yeah, replacing them with placeholders sounds good.

@sharkdp
Copy link
Contributor Author

sharkdp commented May 3, 2018

The regex (?<=^</\1>$\n) as it is in the syntax definition works fine with Oniguruma. The problem is that we rewrite it to (?<=^</\1>$\z) for the "no-newlines" mode, which is invalid. That's a bug in syntect that we can fix (here somewhere).

Oh, that's why I had problems finding that regular expression initially! I didn't remember that "no-newlines" option, thank you very much for pointing this out (and also for the detailed instructions - I will try this!).

@trishume: With regards to panicking, can we make it so that we try to compile the regexes when loading from YAML (just for checking them)?

Two things to (possibly) be aware of

  • This will probably have a significant impact on SyntaxSet loading performance and also overall performance (if someone only uses a subset of the SyntaxSet there is no need to check all language definitions)?
  • If the load_syntaxes function (with the current function signature) would fail with Err(..) if any regex is invalid, you would get either all or nothing. Right now, you get a partially-working SyntaxSet (not sure what is to be preferred).

@robinst
Copy link
Collaborator

robinst commented May 3, 2018

The idea is that it would only impact loading from YAML files, which I'd recommend to do infrequently anyway.

After that, you can serialize the syntax definition (using bincode) as a cache. Then later you load the cached version which does not do the checks.

@keith-hall
Copy link
Collaborator

I downloaded an advanced Markdown syntax definition from https://github.com/jonschlinkert/sublime-markdown-extended

You'll probably find that the default Markdown syntax definition is actually much better than all the old third party ones these days, and will likely have none of these compatibility problems. From a quick glance at the readme of Markdown Extended, the only feature you'd not get is the YAML "front matter" (and only the official SublimeHQ syntaxes would be highlighted in fenced code blocks).

robinst added a commit that referenced this issue May 6, 2018
For single lines without newlines, the end result is the same. The
advantage of `$` is that it can be used in lookbehinds. `\z` in a
lookbehind results in a regex compilation error, so rewriting would need
to be more complicated. See #156.
@robinst
Copy link
Collaborator

robinst commented May 6, 2018

Created a PR so that the regex (?<=^</\1>$\n) is rewritten to (?<=^</\1>$$) which Oniguruma is fine with: #157

@sharkdp
Copy link
Contributor Author

sharkdp commented May 6, 2018

You'll probably find that the default Markdown syntax definition is actually much better than all the old third party ones these days, and will likely have none of these compatibility problems. From a quick glance at the readme of Markdown Extended, the only feature you'd not get is the YAML "front matter" (and only the official SublimeHQ syntaxes would be highlighted in fenced code blocks).

@keith-hall Thank you very much for the feedback. I tried the default Markdown syntax definition but some of the highlighting is done with background colors (which I don't want to use in the terminal / my application) and I didn't get the syntax highlighting in code blocks to work(?).


With #157 merged, I think this can be closed. My second problem ("Target of repeat operator is invalid") is already addressed somewhere else and the panic! discussion should probably be moved to #98.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants