Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use regex to enhance js engine support #762

Merged
merged 15 commits into from
Sep 9, 2024
Merged

feat: use regex to enhance js engine support #762

merged 15 commits into from
Sep 9, 2024

Conversation

antfu
Copy link
Member

@antfu antfu commented Aug 30, 2024

Copy link

netlify bot commented Aug 30, 2024

Deploy Preview for shiki-next ready!

Name Link
🔨 Latest commit b878e53
🔍 Latest deploy log https://app.netlify.com/sites/shiki-next/deploys/66df217bd315590008851df1
😎 Deploy Preview https://deploy-preview-762--shiki-next.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Copy link

netlify bot commented Aug 30, 2024

Deploy Preview for shiki-matsu ready!

Name Link
🔨 Latest commit b878e53
🔍 Latest deploy log https://app.netlify.com/sites/shiki-matsu/deploys/66df217bad11ed0008accd4d
😎 Deploy Preview https://deploy-preview-762--shiki-matsu.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@slevithan
Copy link
Collaborator

slevithan commented Aug 31, 2024

Continuing the discussion from antfu/oniguruma-to-js#1 , here are some basic regex plugins that add support for more oniguruma-to-js features. I'm curious what level of support for existing TextMate grammars this would give:

import { regex } from 'regex'
import { Context, replaceUnescaped } from 'regex-utilities'

const TABLE_POSIX = {
  alnum: '0-9A-Za-z',
  alpha: 'A-Za-z',
  ascii: '\\0-\\x7F',
  blank: ' \\t',
  cntrl: '\\0-\\x1F\\x7F',
  digit: '0-9',
  graph: '!-~',
  lower: 'a-z',
  print: ' -~',
  punct: '!-/:-@[-`{-~',
  space: ' \\t\\r\\n\\v\\f',
  upper: 'A-Z',
  xdigit: '0-9A-Fa-f',
  word: '\\w',
}

function oniguruma(str) {
  str = replaceUnescaped(str, '\\\\(?:h|(?<literal>[\'"`#@]))', ({groups: {literal}}, details) => {
    if (literal) {
      return literal
    }
    let hex = TABLE_POSIX.xdigit
    return details.context === Context.DEFAULT ? `[${hex}]` : hex
  })
  str = replaceUnescaped(str, String.raw`\\(?<escape>[Az])|\(\?#[^)]*\)`, ({groups: {escape}}) => {
    if (escape) {
      return escape === 'A' ? '^' : '$'
    }
    return ''
  }, Context.DEFAULT)
  str = replaceUnescaped(str, String.raw`\[:(?<neg>\^?)(?<posix>[a-z]+):\]`, ({groups: {neg, posix}}) => {
    if (neg) {
      throw new Error('TODO: Support negated POSIX classes')
    }
    let resolved = TABLE_POSIX[posix]
    if (!resolved) {
      throw new Error(`Unknown POSIX class "${posix}"`)
    }
    return resolved
  }, Context.CHAR_CLASS)
  return str
}

regex({
  flags: 'dg',
  subclass: true,
  plugins: [oniguruma],
  unicodeSetsPlugin: null,
  disable: {
    x: true,
    n: true,
    v: true,
  },
})({ raw: [p] })

Note that this fixes a couple errors from oniguruma-to-js. Namely, it fixes the list of whitespace characters matched by [[:space:]], and it correctly does not support [:posix:] syntax outside of character classes (maybe I'm reading the code incorrectly and this is already true in oniguruma-to-js).

Edited to simplify and add support for (?#...), \A, \z, and escaped literal \@. It now depends on regex-utilities 2.3.0+.

The plugin is adding support for:

  • POSIX classes (in non-negated form, but allowed within a surrounding character class that is negated or non-negated).
  • \h, inside and outside of character classes.
  • \A and \z, intentionally limited to outside of character classes.
  • Enclosed comments, intentionally limited to outside of character classes.
  • Escaped literal ', ", `, #, and @, inside and outside of character classes (it's easy to add more characters if needed).

@antfu
Copy link
Member Author

antfu commented Aug 31, 2024

Thank you! I updated the code based on your suggestions, but unfortunately it seems to still have a lot errors. Meanwhile I am a little bit worried about the complexity here - I am still up to give regex a try if it managed to cover more cases that oniguruma-to-js couldn't do. But for now I want to move things faster, so I will get back to this a bit later. Thanks

@slevithan
Copy link
Collaborator

Thank you! I updated the code based on your suggestions, but unfortunately it seems to still have a lot errors.

I've just edited the code in my last comment to add support for \A, \z, and escaped literal \@. This resolves some but not all of the errors. It will still throw for things like \G and (?i:...).

Meanwhile I am a little bit worried about the complexity here

I've added a new feature in regex-utilities 2.3.0 and used it in the updated example code to significantly simplify. Hopefully this is now easier to follow and makes it clearer how to work with regex plugins.

I am still up to give regex a try if it managed to cover more cases that oniguruma-to-js couldn't do. But for now I want to move things faster, so I will get back to this a bit later. Thanks

For sure, this makes a lot of sense while you're in the rapid development stage. Even beyond that, though, I think the goals of these two projects might be just far enough apart that you will continue to prefer not to develop on top of regex. oniguruma-to-js is about doing whatever it takes (and tolerating incomplete solutions) to make the majority of or all TextMate grammars work, whereas regex is about modernizing JS regexes with just the most valuable features for readability/safety/power and providing complete/robust solutions for the features it offers. This puts them a bit at odds, even given its plugin system.

So no worries at all if you want to close this. It was interesting/valuable to discuss this with you and see how regex might be used in libraries like this that are about cross-flavor compatibility. It's possible that, in the future, regex will get more features and more powerful plugins that would make it worthwhile to reexamine.

Copy link

codecov bot commented Sep 5, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.32%. Comparing base (4769588) to head (b878e53).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #762      +/-   ##
==========================================
+ Coverage   92.30%   92.32%   +0.01%     
==========================================
  Files          70       70              
  Lines        4547     4558      +11     
  Branches     1009     1009              
==========================================
+ Hits         4197     4208      +11     
  Misses        345      345              
  Partials        5        5              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

package.json Outdated Show resolved Hide resolved
scripts/report-engine-js-compat.ts Outdated Show resolved Hide resolved
package.json Outdated Show resolved Hide resolved
| | Count |
| :-------------- | --------------------------------: |
| Total Languages | 213 |
| Fully Supported | [132](#fully-supported-languages) |
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gets a lot better, thanks for the updates!

Copy link
Collaborator

@slevithan slevithan Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. I'll try to identify the issues and fix handling for grammars that used to be okay but aren't anymore. I already notice an issue with regex's support for multiple possessive quantifiers in the same pattern that should be easy to fix.

@slevithan
Copy link
Collaborator

slevithan commented Sep 8, 2024

Okay, I've run the report-engine-js script locally, and here's what I've found.

  • There were two possessive quantifier related bugs in regex v4.3.0 that were being triggered, that are now fixed in v4.3.2. This significantly raises the number of grammars passing.
  • I'm not sure what the purpose of the following code is in the latest version of this PR:
    pattern = pattern
      .replace(/\\�/g, '\\G')
      .replace(/\(\{\)/g, '(\\{)')

Removing this increases the number of grammars passing, whether using regex or not. With those lines removed, using regex 4.3.2 gives the following results:

| Fully Supported | [172](#fully-supported-languages) |
| Mismatched | [24](#mismatched-languages) |
| Unsupported | [17](#unsupported-languages) |

This is still slightly lower than the number reported when not using regex, but the report is misrepresenting the facts based on the limited grammar samples, and in fact the number of correctly supported grammars is now significantly higher than before. As a result, I think at this point this is a significantly positive change. As far as I can tell, all of the cases remaining where regex 4.3.2 flips a grammar in this report from supported to unsupported are cases where oniguruma-to-js is already getting things wrong even though it passes for the provided samples.

The culprits for the number of "supported" grammars going slightly down are the following errors that regex is throwing:

  • "Subroutine \g<name> followed a recursive reference"
  • "Invalid named capture referenced by subroutine"
    • For example, given \g<1> or \g<-1>.

These are subroutines that Oniguruma supports but regex doesn't, as explained in antfu/oniguruma-to-js#1 (comment):

  • regex doesn't support the following subroutine features (they will throw if encountered):
    • Subroutines that reference groups by absolute or relative number via \g<N>.
    • Recursive subroutines. These can't be emulated using JS regexes.

Even though oniguruma-to-js doesn't throw on these patterns and the tests are not catching the resulting problems (given the specific grammar samples being tested), all of these patterns are currently being interpreted incorrectly. So IMO it is better that regex is throwing for them. What is happening currently (without regex) is that, since the regexes are not using flag u or v, the \g is being interpreted as matching a literal 'g' (since \g is not a known escape), and therefore e.g. \g<1> is being treated as equivalent to g<1>, which matches only the four literal characters g, <, 1, >, in sequence.

Looking at the remaining failing cases beyond the currently unsupported subroutines described above, the main culprits seem to be:

  • Nested character classes [...[...]] and character class intersection [...&&...].
    • These would be supported if using JS's flag v, but then you'd have to deal with flag v's strict errors and different escaping rules, which would cause a LOT more errors unless you first transpiled a bunch of things to match flag v's stricter rules.
    • These differences in syntax (e.g., allowing unescaped literal '[' within a character class) don't directly throw errors when not using flag u/v, but sometimes you get other errors as a side effect like "Range out of order in character class" if an unescaped - is the first char in the nested class. And in cases where nested class and intersection syntax doesn't throw, it is currently always making the pattern non-equivalent to Oniguruma (so they match the wrong things) even though this might not be caught by the specific grammar samples being tested.
  • Unicode categories. Ex: \p{L}, \P{Mn}, etc.
    • These would be supported if using JS's flag u or v, but again you'd then have to deal with the strict errors and different escaping rules that result from these flags. When not using these flags, Unicode property syntax doesn't throw an error, but it is always matching the wrong things.

Here's what I'd personally do:

  • Bump to regex 4.3.2.
  • Remove the pattern = pattern.replace... lines I mentioned at the top of this comment, unless they are serving some important purpose.
  • Land the PR.

Even though the reported number of "supported" grammars would go down slightly (to 172), this should only affect grammars that are already not working correctly in at least some cases. And the number of actually supported grammars would go up significantly, since:

  1. As a result of the improvements in the regex library that came from this discussion (thanks so much for your help and ideas!), this change should not be adding errors for any regexes that are actually already working as intended.
  2. Many grammars currently "supported" that are not actually working correctly would gain correct support. This will be true for any grammars that use atomic groups or possessive quantifiers (which oniguruma-to-js currently neuters, which is a serious performance/security risk), as well as subroutines referencing named groups (which, since you're not using flag u or v, don't throw but never match what's intended).

If desired, after landing this change I'd recommend doing additional work in oniguruma-to-js to transpile non-strict regex syntax into strict v mode syntax, and use the resulting patterns with flag v. This work is potentially a bit tricky (and would limit support to Node.js 20 and 2023-era browsers), but I'd be happy to help with it (if desired) and it would add native support for nested character classes, set intersection/subtraction, Unicode categories, etc. (all possible but complicated and data-heavy to emulate with Unicode character tables, etc., and emulated versions would not be as fast as the native support). And after this work, more grammars would be supported than currently are, since these advanced Oniguruma features would then just work (and fast!) without any need for emulation. Additionally, the options used with regex's rewrite function could then be simplified.

@antfu
Copy link
Member Author

antfu commented Sep 9, 2024

Thanks a lot for your deep investigation, that looks very promising!

This is still slightly lower than the number reported when not using regex, but the report is misrepresenting the facts based on the limited grammar samples, and in fact the number of correctly supported grammars is now significantly higher than before.

Yes, I totally agree with that. It's surely better to surface the errors instead of silently eating them.

With the current result, I am more than happy to land it, and then we can iterate as we go. However, as the JS engine is mostly designed to work on browsers, the bundle size is something we need to be concerned about. If regex can do the work much better and correctly than oniguruma-to-js, I'd be more than happy to deprecate it and use regex exclusively.

The current test runs with regex -> oniguruma-to-js, where if I remove the oniguruma-to-js part, it will throw many more syntax errors. Is it possible we port those fixes to regex so shiki can solely deps on regex then?

This work is potentially a bit tricky (and would limit support to Node.js 20 and 2023-era browsers

For Shiki, in this particular case, I think it's acceptable as we consider this to be an advanced usage. If it's tricky to provide back-compact, we can just document the minimal runtime requirement for using the feature.

Thanks so much for your work and help!

@antfu antfu changed the title Try use regex feat: use regex to enhance js engine support Sep 9, 2024
@antfu
Copy link
Member Author

antfu commented Sep 9, 2024

I'll merge this for now, let's iterate it on main banch.

@antfu antfu merged commit ed36296 into main Sep 9, 2024
13 checks passed
@antfu antfu deleted the feat/regex branch September 9, 2024 16:30
@slevithan
Copy link
Collaborator

Amazing to see this merged. 🎉🙌

as the JS engine is mostly designed to work on browsers, the bundle size is something we need to be concerned about. If regex can do the work much better and correctly than oniguruma-to-js, I'd be more than happy to deprecate it and use regex exclusively.

If I/we create a new library that extends regex, I could definitely help improve Oniguruma translation further and with greater accuracy.

It wouldn’t make sense to add extensive Oniguruma translation to the base regex lib since regex’s syntax/flags/features are not based on Oniguruma. Its much more based on PCRE, and even then it is not trying to support all PCRE features — it's only adding a minimal set of features that are most relevant for elevating native JS regexes to be one of the best and most readable/modern regex flavors in its own right.

But, it absolutely makes sense to me to create a new library (tentative name: textmate-grammar-regex; or maybe regex-plugins which would bundle a bunch of plugins under an "Oniguruma" flavor set) that builds on top of regex and can take over responsibility from oniguruma-to-js (since you mentioned you're okay with deprecating it), with Shiki as a first-class client library. Then, at the same time, I could offer new exports in the base regex lib that make it much more tree shake-able when used in this way, so that the new lib can be as lightweight as possible.

I have a bunch of ideas for this, but I won't be able to start development for it until sometime next week. I’d be happy to add you as a maintainer for this new project (totally up to you!) after the basics are set up.

What do you think? Does this seem like a decent plan?

@antfu
Copy link
Member Author

antfu commented Sep 9, 2024

Sounds great, looking forward to it! (and take your time ofc). Thanks!

@slevithan
Copy link
Collaborator

Cool. I think it will be a fun project so I'm looking forward to it, too.

This work is potentially a bit tricky (and would limit support to Node.js 20 and 2023-era browsers

For Shiki, in this particular case, I think it's acceptable as we consider this to be an advanced usage. If it's tricky to provide back-compact, we can just document the minimal runtime requirement for using the feature.

Okay, in that case I'm going to require native flag v support in the new lib. Transpiling the new features from flags u/v is indeed tricky, but there's a great existing lib that already does this: regexpu-core (which is used under the hood by Babel to transpile u/v). The problem, and the reason I wouldn't recommend using it, is that it would make the resulting package very large (you mentioned that small bundle size is important) because it includes a ton of necessary Unicode character data. It would also make the resulting regexes at least somewhat slower (and much larger) than just relying on the native versions of these features that you get from flag v.

chgeo referenced this pull request in cap-js/docs Sep 11, 2024
…#1244)

This PR contains the following updates:

| Package | Change | Age | Adoption | Passing | Confidence |
|---|---|---|---|---|---|
|
[@shikijs/vitepress-twoslash](https://redirect.github.com/shikijs/shiki)
([source](https://redirect.github.com/shikijs/shiki/tree/HEAD/packages/vitepress-twoslash))
| [`1.16.2` ->
`1.16.3`](https://renovatebot.com/diffs/npm/@shikijs%2fvitepress-twoslash/1.16.2/1.16.3)
|
[![age](https://developer.mend.io/api/mc/badges/age/npm/@shikijs%2fvitepress-twoslash/1.16.3?slim=true)](https://docs.renovatebot.com/merge-confidence/)
|
[![adoption](https://developer.mend.io/api/mc/badges/adoption/npm/@shikijs%2fvitepress-twoslash/1.16.3?slim=true)](https://docs.renovatebot.com/merge-confidence/)
|
[![passing](https://developer.mend.io/api/mc/badges/compatibility/npm/@shikijs%2fvitepress-twoslash/1.16.2/1.16.3?slim=true)](https://docs.renovatebot.com/merge-confidence/)
|
[![confidence](https://developer.mend.io/api/mc/badges/confidence/npm/@shikijs%2fvitepress-twoslash/1.16.2/1.16.3?slim=true)](https://docs.renovatebot.com/merge-confidence/)
|

---

### Release Notes

<details>
<summary>shikijs/shiki (@&#8203;shikijs/vitepress-twoslash)</summary>

###
[`v1.16.3`](https://redirect.github.com/shikijs/shiki/releases/tag/v1.16.3)

[Compare
Source](https://redirect.github.com/shikijs/shiki/compare/v1.16.2...v1.16.3)

#####    🚀 Features

- Make `createCssVariablesTheme` no longer experimental  -  by
[@&#8203;antfu](https://redirect.github.com/antfu)
[<samp>(ac10b)</samp>](https://redirect.github.com/shikijs/shiki/commit/ac10b3ac)
- Use `regex` to enhance js engine support  -  by
[@&#8203;antfu](https://redirect.github.com/antfu) in
[https://github.com/shikijs/shiki/issues/762](https://redirect.github.com/shikijs/shiki/issues/762)
[<samp>(ed362)</samp>](https://redirect.github.com/shikijs/shiki/commit/ed362960)

#####     [View changes on
GitHub](https://redirect.github.com/shikijs/shiki/compare/v1.16.2...v1.16.3)

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined),
Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you
are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the
rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update
again.

---

- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box

---

This PR was generated by [Mend Renovate](https://mend.io/renovate/).
View the [repository job
log](https://developer.mend.io/github/cap-js/docs).

<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzOC41OS4yIiwidXBkYXRlZEluVmVyIjoiMzguNTkuMiIsInRhcmdldEJyYW5jaCI6Im1haW4iLCJsYWJlbHMiOltdfQ==-->

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants