feat: use `regex` to enhance js engine support #762

antfu · 2024-08-30T17:54:42Z

netlify · 2024-08-30T17:55:00Z

✅ Deploy Preview for shiki-next ready!

Name	Link
🔨 Latest commit	`b878e53`
🔍 Latest deploy log	https://app.netlify.com/sites/shiki-next/deploys/66df217bd315590008851df1
😎 Deploy Preview	https://deploy-preview-762--shiki-next.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

netlify · 2024-08-30T17:55:54Z

✅ Deploy Preview for shiki-matsu ready!

Name	Link
🔨 Latest commit	`b878e53`
🔍 Latest deploy log	https://app.netlify.com/sites/shiki-matsu/deploys/66df217bad11ed0008accd4d
😎 Deploy Preview	https://deploy-preview-762--shiki-matsu.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

slevithan · 2024-08-31T09:07:23Z

Continuing the discussion from antfu/oniguruma-to-js#1 , here are some basic regex plugins that add support for more oniguruma-to-js features. I'm curious what level of support for existing TextMate grammars this would give:

import { regex } from 'regex'
import { Context, replaceUnescaped } from 'regex-utilities'

const TABLE_POSIX = {
  alnum: '0-9A-Za-z',
  alpha: 'A-Za-z',
  ascii: '\\0-\\x7F',
  blank: ' \\t',
  cntrl: '\\0-\\x1F\\x7F',
  digit: '0-9',
  graph: '!-~',
  lower: 'a-z',
  print: ' -~',
  punct: '!-/:-@[-`{-~',
  space: ' \\t\\r\\n\\v\\f',
  upper: 'A-Z',
  xdigit: '0-9A-Fa-f',
  word: '\\w',
}

function oniguruma(str) {
  str = replaceUnescaped(str, '\\\\(?:h|(?<literal>[\'"`#@]))', ({groups: {literal}}, details) => {
    if (literal) {
      return literal
    }
    let hex = TABLE_POSIX.xdigit
    return details.context === Context.DEFAULT ? `[${hex}]` : hex
  })
  str = replaceUnescaped(str, String.raw`\\(?<escape>[Az])|\(\?#[^)]*\)`, ({groups: {escape}}) => {
    if (escape) {
      return escape === 'A' ? '^' : '$'
    }
    return ''
  }, Context.DEFAULT)
  str = replaceUnescaped(str, String.raw`\[:(?<neg>\^?)(?<posix>[a-z]+):\]`, ({groups: {neg, posix}}) => {
    if (neg) {
      throw new Error('TODO: Support negated POSIX classes')
    }
    let resolved = TABLE_POSIX[posix]
    if (!resolved) {
      throw new Error(`Unknown POSIX class "${posix}"`)
    }
    return resolved
  }, Context.CHAR_CLASS)
  return str
}

regex({
  flags: 'dg',
  subclass: true,
  plugins: [oniguruma],
  unicodeSetsPlugin: null,
  disable: {
    x: true,
    n: true,
    v: true,
  },
})({ raw: [p] })

Note that this fixes a couple errors from oniguruma-to-js. Namely, it fixes the list of whitespace characters matched by [[:space:]], and it correctly does not support [:posix:] syntax outside of character classes (maybe I'm reading the code incorrectly and this is already true in oniguruma-to-js).

Edited to simplify and add support for (?#...), \A, \z, and escaped literal \@. It now depends on regex-utilities 2.3.0+.

The plugin is adding support for:

POSIX classes (in non-negated form, but allowed within a surrounding character class that is negated or non-negated).
\h, inside and outside of character classes.
\A and \z, intentionally limited to outside of character classes.
Enclosed comments, intentionally limited to outside of character classes.
Escaped literal ', ", `, #, and @, inside and outside of character classes (it's easy to add more characters if needed).

antfu · 2024-08-31T17:31:18Z

Thank you! I updated the code based on your suggestions, but unfortunately it seems to still have a lot errors. Meanwhile I am a little bit worried about the complexity here - I am still up to give regex a try if it managed to cover more cases that oniguruma-to-js couldn't do. But for now I want to move things faster, so I will get back to this a bit later. Thanks

slevithan · 2024-08-31T20:41:08Z

Thank you! I updated the code based on your suggestions, but unfortunately it seems to still have a lot errors.

I've just edited the code in my last comment to add support for \A, \z, and escaped literal \@. This resolves some but not all of the errors. It will still throw for things like \G and (?i:...).

Meanwhile I am a little bit worried about the complexity here

I've added a new feature in regex-utilities 2.3.0 and used it in the updated example code to significantly simplify. Hopefully this is now easier to follow and makes it clearer how to work with regex plugins.

I am still up to give regex a try if it managed to cover more cases that oniguruma-to-js couldn't do. But for now I want to move things faster, so I will get back to this a bit later. Thanks

For sure, this makes a lot of sense while you're in the rapid development stage. Even beyond that, though, I think the goals of these two projects might be just far enough apart that you will continue to prefer not to develop on top of regex. oniguruma-to-js is about doing whatever it takes (and tolerating incomplete solutions) to make the majority of or all TextMate grammars work, whereas regex is about modernizing JS regexes with just the most valuable features for readability/safety/power and providing complete/robust solutions for the features it offers. This puts them a bit at odds, even given its plugin system.

So no worries at all if you want to close this. It was interesting/valuable to discuss this with you and see how regex might be used in libraries like this that are about cross-flavor compatibility. It's possible that, in the future, regex will get more features and more powerful plugins that would make it worthwhile to reexamine.

codecov · 2024-09-05T20:09:54Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.32%. Comparing base (4769588) to head (b878e53).
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #762      +/-   ##
==========================================
+ Coverage   92.30%   92.32%   +0.01%     
==========================================
  Files          70       70              
  Lines        4547     4558      +11     
  Branches     1009     1009              
==========================================
+ Hits         4197     4208      +11     
  Misses        345      345              
  Partials        5        5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

package.json

scripts/report-engine-js-compat.ts

package.json

packages/core/package.json

antfu · 2024-09-06T04:09:46Z

docs/references/engine-js-compat.md

+|                 |                             Count |
+| :-------------- | --------------------------------: |
+| Total Languages |                               213 |
+| Fully Supported | [132](#fully-supported-languages) |


This gets a lot better, thanks for the updates!

Cool. I'll try to identify the issues and fix handling for grammars that used to be okay but aren't anymore. I already notice an issue with regex's support for multiple possessive quantifiers in the same pattern that should be easy to fix.

slevithan · 2024-09-08T00:39:44Z

Okay, I've run the report-engine-js script locally, and here's what I've found.

There were two possessive quantifier related bugs in regex v4.3.0 that were being triggered, that are now fixed in v4.3.2. This significantly raises the number of grammars passing.
I'm not sure what the purpose of the following code is in the latest version of this PR:

    pattern = pattern
      .replace(/\\�/g, '\\G')
      .replace(/\(\{\)/g, '(\\{)')

Removing this increases the number of grammars passing, whether using regex or not. With those lines removed, using regex 4.3.2 gives the following results:

| Fully Supported | [172](#fully-supported-languages) |
| Mismatched | [24](#mismatched-languages) |
| Unsupported | [17](#unsupported-languages) |

This is still slightly lower than the number reported when not using regex, but the report is misrepresenting the facts based on the limited grammar samples, and in fact the number of correctly supported grammars is now significantly higher than before. As a result, I think at this point this is a significantly positive change. As far as I can tell, all of the cases remaining where regex 4.3.2 flips a grammar in this report from supported to unsupported are cases where oniguruma-to-js is already getting things wrong even though it passes for the provided samples.

The culprits for the number of "supported" grammars going slightly down are the following errors that regex is throwing:

"Subroutine \g<name> followed a recursive reference"
"Invalid named capture referenced by subroutine"
- For example, given \g<1> or \g<-1>.

These are subroutines that Oniguruma supports but regex doesn't, as explained in antfu/oniguruma-to-js#1 (comment):

regex doesn't support the following subroutine features (they will throw if encountered):

Subroutines that reference groups by absolute or relative number via \g<N>.

Recursive subroutines. These can't be emulated using JS regexes.

Even though oniguruma-to-js doesn't throw on these patterns and the tests are not catching the resulting problems (given the specific grammar samples being tested), all of these patterns are currently being interpreted incorrectly. So IMO it is better that regex is throwing for them. What is happening currently (without regex) is that, since the regexes are not using flag u or v, the \g is being interpreted as matching a literal 'g' (since \g is not a known escape), and therefore e.g. \g<1> is being treated as equivalent to g<1>, which matches only the four literal characters g, <, 1, >, in sequence.

Looking at the remaining failing cases beyond the currently unsupported subroutines described above, the main culprits seem to be:

Nested character classes [...[...]] and character class intersection [...&&...].
- These would be supported if using JS's flag v, but then you'd have to deal with flag v's strict errors and different escaping rules, which would cause a LOT more errors unless you first transpiled a bunch of things to match flag v's stricter rules.
- These differences in syntax (e.g., allowing unescaped literal '[' within a character class) don't directly throw errors when not using flag u/v, but sometimes you get other errors as a side effect like "Range out of order in character class" if an unescaped - is the first char in the nested class. And in cases where nested class and intersection syntax doesn't throw, it is currently always making the pattern non-equivalent to Oniguruma (so they match the wrong things) even though this might not be caught by the specific grammar samples being tested.
Unicode categories. Ex: \p{L}, \P{Mn}, etc.
- These would be supported if using JS's flag u or v, but again you'd then have to deal with the strict errors and different escaping rules that result from these flags. When not using these flags, Unicode property syntax doesn't throw an error, but it is always matching the wrong things.

Here's what I'd personally do:

Bump to regex 4.3.2.
Remove the pattern = pattern.replace... lines I mentioned at the top of this comment, unless they are serving some important purpose.
Land the PR.

Even though the reported number of "supported" grammars would go down slightly (to 172), this should only affect grammars that are already not working correctly in at least some cases. And the number of actually supported grammars would go up significantly, since:

As a result of the improvements in the regex library that came from this discussion (thanks so much for your help and ideas!), this change should not be adding errors for any regexes that are actually already working as intended.
Many grammars currently "supported" that are not actually working correctly would gain correct support. This will be true for any grammars that use atomic groups or possessive quantifiers (which oniguruma-to-js currently neuters, which is a serious performance/security risk), as well as subroutines referencing named groups (which, since you're not using flag u or v, don't throw but never match what's intended).

If desired, after landing this change I'd recommend doing additional work in oniguruma-to-js to transpile non-strict regex syntax into strict v mode syntax, and use the resulting patterns with flag v. This work is potentially a bit tricky (and would limit support to Node.js 20 and 2023-era browsers), but I'd be happy to help with it (if desired) and it would add native support for nested character classes, set intersection/subtraction, Unicode categories, etc. (all possible but complicated and data-heavy to emulate with Unicode character tables, etc., and emulated versions would not be as fast as the native support). And after this work, more grammars would be supported than currently are, since these advanced Oniguruma features would then just work (and fast!) without any need for emulation. Additionally, the options used with regex's rewrite function could then be simplified.

antfu · 2024-09-09T13:25:46Z

Thanks a lot for your deep investigation, that looks very promising!

This is still slightly lower than the number reported when not using regex, but the report is misrepresenting the facts based on the limited grammar samples, and in fact the number of correctly supported grammars is now significantly higher than before.

Yes, I totally agree with that. It's surely better to surface the errors instead of silently eating them.

With the current result, I am more than happy to land it, and then we can iterate as we go. However, as the JS engine is mostly designed to work on browsers, the bundle size is something we need to be concerned about. If regex can do the work much better and correctly than oniguruma-to-js, I'd be more than happy to deprecate it and use regex exclusively.

The current test runs with regex -> oniguruma-to-js, where if I remove the oniguruma-to-js part, it will throw many more syntax errors. Is it possible we port those fixes to regex so shiki can solely deps on regex then?

This work is potentially a bit tricky (and would limit support to Node.js 20 and 2023-era browsers

For Shiki, in this particular case, I think it's acceptable as we consider this to be an advanced usage. If it's tricky to provide back-compact, we can just document the minimal runtime requirement for using the feature.

Thanks so much for your work and help!

antfu · 2024-09-09T16:29:46Z

I'll merge this for now, let's iterate it on main banch.

slevithan · 2024-09-09T17:49:30Z

Amazing to see this merged. 🎉🙌

as the JS engine is mostly designed to work on browsers, the bundle size is something we need to be concerned about. If regex can do the work much better and correctly than oniguruma-to-js, I'd be more than happy to deprecate it and use regex exclusively.

If I/we create a new library that extends regex, I could definitely help improve Oniguruma translation further and with greater accuracy.

It wouldn’t make sense to add extensive Oniguruma translation to the base regex lib since regex’s syntax/flags/features are not based on Oniguruma. Its much more based on PCRE, and even then it is not trying to support all PCRE features — it's only adding a minimal set of features that are most relevant for elevating native JS regexes to be one of the best and most readable/modern regex flavors in its own right.

But, it absolutely makes sense to me to create a new library (tentative name: textmate-grammar-regex; or maybe regex-plugins which would bundle a bunch of plugins under an "Oniguruma" flavor set) that builds on top of regex and can take over responsibility from oniguruma-to-js (since you mentioned you're okay with deprecating it), with Shiki as a first-class client library. Then, at the same time, I could offer new exports in the base regex lib that make it much more tree shake-able when used in this way, so that the new lib can be as lightweight as possible.

I have a bunch of ideas for this, but I won't be able to start development for it until sometime next week. I’d be happy to add you as a maintainer for this new project (totally up to you!) after the basics are set up.

What do you think? Does this seem like a decent plan?

antfu · 2024-09-09T17:54:05Z

Sounds great, looking forward to it! (and take your time ofc). Thanks!

slevithan · 2024-09-09T18:15:53Z

Cool. I think it will be a fun project so I'm looking forward to it, too.

This work is potentially a bit tricky (and would limit support to Node.js 20 and 2023-era browsers

For Shiki, in this particular case, I think it's acceptable as we consider this to be an advanced usage. If it's tricky to provide back-compact, we can just document the minimal runtime requirement for using the feature.

Okay, in that case I'm going to require native flag v support in the new lib. Transpiling the new features from flags u/v is indeed tricky, but there's a great existing lib that already does this: regexpu-core (which is used under the hood by Babel to transpile u/v). The problem, and the reason I wouldn't recommend using it, is that it would make the resulting package very large (you mentioned that small bundle size is important) because it includes a ton of necessary Unicode character data. It would also make the resulting regexes at least somewhat slower (and much larger) than just relying on the native versions of these features that you get from flag v.

…#1244) This PR contains the following updates: | Package | Change | Age | Adoption | Passing | Confidence | |---|---|---|---|---|---| | [@shikijs/vitepress-twoslash](https://redirect.github.com/shikijs/shiki) ([source](https://redirect.github.com/shikijs/shiki/tree/HEAD/packages/vitepress-twoslash)) | [`1.16.2` -> `1.16.3`](https://renovatebot.com/diffs/npm/@shikijs%2fvitepress-twoslash/1.16.2/1.16.3) | [![age](https://developer.mend.io/api/mc/badges/age/npm/@shikijs%2fvitepress-twoslash/1.16.3?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![adoption](https://developer.mend.io/api/mc/badges/adoption/npm/@shikijs%2fvitepress-twoslash/1.16.3?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![passing](https://developer.mend.io/api/mc/badges/compatibility/npm/@shikijs%2fvitepress-twoslash/1.16.2/1.16.3?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![confidence](https://developer.mend.io/api/mc/badges/confidence/npm/@shikijs%2fvitepress-twoslash/1.16.2/1.16.3?slim=true)](https://docs.renovatebot.com/merge-confidence/) | --- ### Release Notes <details> <summary>shikijs/shiki (@shikijs/vitepress-twoslash)</summary> ### [`v1.16.3`](https://redirect.github.com/shikijs/shiki/releases/tag/v1.16.3) [Compare Source](https://redirect.github.com/shikijs/shiki/compare/v1.16.2...v1.16.3) ##### 🚀 Features - Make `createCssVariablesTheme` no longer experimental - by [@antfu](https://redirect.github.com/antfu) [<samp>(ac10b)</samp>](https://redirect.github.com/shikijs/shiki/commit/ac10b3ac) - Use `regex` to enhance js engine support - by [@antfu](https://redirect.github.com/antfu) in [https://github.com/shikijs/shiki/issues/762](https://redirect.github.com/shikijs/shiki/issues/762) [<samp>(ed362)</samp>](https://redirect.github.com/shikijs/shiki/commit/ed362960) ##### [View changes on GitHub](https://redirect.github.com/shikijs/shiki/compare/v1.16.2...v1.16.3) </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] If you want to rebase/retry this PR, check this box --- This PR was generated by [Mend Renovate](https://mend.io/renovate/). View the [repository job log](https://developer.mend.io/github/cap-js/docs).  Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

antfu added 2 commits August 30, 2024 19:50

chore: try regex

9ca43d3

Merge branch 'main' into feat/regex

3c44714

slevithan mentioned this pull request Aug 31, 2024

Thoughts on building on top of the regex library? antfu/oniguruma-to-js#1

Closed

3 tasks

feat: update implemtation based on suggestions

0257aca

antfu added 4 commits September 5, 2024 18:35

Merge branch 'main' into feat/regex

fecdee1

chore: update

8389ddb

Merge branch 'main' into feat/regex

b8da04e

chore: update

4191f90

slevithan reviewed Sep 6, 2024

View reviewed changes

package.json Outdated Show resolved Hide resolved

scripts/report-engine-js-compat.ts Outdated Show resolved Hide resolved

slevithan reviewed Sep 6, 2024

View reviewed changes

package.json Outdated Show resolved Hide resolved

slevithan reviewed Sep 6, 2024

View reviewed changes

packages/core/package.json Outdated Show resolved Hide resolved

antfu added 2 commits September 6, 2024 06:06

Merge branch 'main' into feat/regex

dfd31c0

chore: update

aad33b4

antfu commented Sep 6, 2024

View reviewed changes

antfu added 5 commits September 9, 2024 15:08

chore: use local oniguruma-to-js for test when presents

007f612

chore: update deps

387589f

Merge branch 'main' into feat/regex

dcfac0c

chore: update regex

b71b674

chore: update

cf5c930

feat: update

b878e53

antfu changed the title ~~Try use regex~~ feat: use regex to enhance js engine support Sep 9, 2024

antfu mentioned this pull request Sep 9, 2024

Sync Shiki and JS engine unreasonably long blocking of js threads in some code renderings #772

Closed

5 tasks

antfu merged commit ed36296 into main Sep 9, 2024
13 checks passed

antfu deleted the feat/regex branch September 9, 2024 16:30

josecelano mentioned this pull request Oct 2, 2024

[Snyk] Upgrade shiki from 1.6.1 to 1.16.3 torrust/torrust-website#64

Closed

slevithan mentioned this pull request Nov 7, 2024

Switch JS engine to Oniguruma-To-ES for much improved emulation #828

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: use `regex` to enhance js engine support #762

feat: use `regex` to enhance js engine support #762

antfu commented Aug 30, 2024

netlify bot commented Aug 30, 2024 •

edited

Loading

netlify bot commented Aug 30, 2024 •

edited

Loading

slevithan commented Aug 31, 2024 •

edited

Loading

antfu commented Aug 31, 2024

slevithan commented Aug 31, 2024

codecov bot commented Sep 5, 2024 •

edited

Loading

antfu Sep 6, 2024

slevithan Sep 6, 2024 •

edited

Loading

slevithan commented Sep 8, 2024 •

edited

Loading

antfu commented Sep 9, 2024

antfu commented Sep 9, 2024

slevithan commented Sep 9, 2024

antfu commented Sep 9, 2024

slevithan commented Sep 9, 2024

feat: use regex to enhance js engine support #762

feat: use regex to enhance js engine support #762

Conversation

antfu commented Aug 30, 2024

netlify bot commented Aug 30, 2024 • edited Loading

✅ Deploy Preview for shiki-next ready!

netlify bot commented Aug 30, 2024 • edited Loading

✅ Deploy Preview for shiki-matsu ready!

slevithan commented Aug 31, 2024 • edited Loading

antfu commented Aug 31, 2024

slevithan commented Aug 31, 2024

codecov bot commented Sep 5, 2024 • edited Loading

Codecov Report

antfu Sep 6, 2024

Choose a reason for hiding this comment

slevithan Sep 6, 2024 • edited Loading

Choose a reason for hiding this comment

slevithan commented Sep 8, 2024 • edited Loading

antfu commented Sep 9, 2024

antfu commented Sep 9, 2024

slevithan commented Sep 9, 2024

antfu commented Sep 9, 2024

slevithan commented Sep 9, 2024

feat: use `regex` to enhance js engine support #762

feat: use `regex` to enhance js engine support #762

netlify bot commented Aug 30, 2024 •

edited

Loading

netlify bot commented Aug 30, 2024 •

edited

Loading

slevithan commented Aug 31, 2024 •

edited

Loading

codecov bot commented Sep 5, 2024 •

edited

Loading

slevithan Sep 6, 2024 •

edited

Loading

slevithan commented Sep 8, 2024 •

edited

Loading