Skip to content
This repository has been archived by the owner on Nov 15, 2024. It is now read-only.

Thoughts on building on top of the regex library? #1

Closed
3 tasks done
slevithan opened this issue Aug 30, 2024 · 7 comments
Closed
3 tasks done

Thoughts on building on top of the regex library? #1

slevithan opened this issue Aug 30, 2024 · 7 comments
Labels
enhancement New feature or request

Comments

@slevithan
Copy link

slevithan commented Aug 30, 2024

Clear and concise description of the problem

Great project! Idea: Building on regex (which has some overlap; it's a library for extending native JS syntax and emitting native JS regexes) would give you free support for several complex features that are hard to reimplement robustly. Additionally, regex has an existing and easy to use plugin system that would allow extending it with additional Oniguruma features.

Suggested solution

Extended Oniguruma syntax and flags supported by regex:

  • Flag x. On by default, but can be disabled via an option. The handling is more robust/correct than oniguruma-to-js's current handling since whitespace and comments still separate tokens (e.g. with \0 1), etc. See the collapsed details under "Show more details" here, but note that this is modeled on PCRE/Perl's flag xx and doesn't work the same as Oniguruma within character classes.
  • Atomic groups.
  • Possessive quantifiers.
  • Subroutines via \g<name>. This is the same syntax as Oniguruma, but note that it's modelled on PCRE/Perl subroutines which work a little differently (better) than Oniguruma for edge cases related to things like backreferences to groups defined within the subroutine. But for probably 95% of use, it works the same.

Extended flags and syntax supported by regex but not by Oniguruma:

  • Flag n ("named capture only" mode). On by default, but can be disabled via an option. This makes (...) noncapturing, and additionally makes numbered backreferences to named groups an error (which Oniguruma does even without flag n).
  • Subroutine definition groups. No issue with supporting this since the syntax is an error in native JS and Oniguruma.

Existing oniguruma-to-js features not supported by regex:

  • POSIX bracket expressions. These would be trivial to add via a plugin, supporting them only inside character classes. And by using JS's flag v and emitting them as nested classes, they would correctly error when they're on a range boundary, and correctly be treated as a set if on a set subtraction/intersection boundary.
  • \h. Again trivial to add via a plugin, with support both inside and outside character classes, and correctly handling them as an error when on a character class range or set operation boundary.
  • Mode modifiers. Not trivial but still doable via a plugin. regex also already supports this via interpolation of RegExp instances that have different flags than the outer regex.

If you think building on regex could be a good fit, it would also be possible to extend the regex library with new options that would make its features work more similarly to Oniguruma/Onigmo.

Alternative

No response

Additional context

No response

Validations

@antfu
Copy link
Owner

antfu commented Aug 30, 2024

Hey, thanks for bringing this up, I am totally up to collaborate! I also found regex early today on Twitter.

regex seems to be a lot more thoughtful and robust in many ways, while my lib is mostly trying to get Shiki (TextMate grammars) working with JavaScript.

I tried to replace oniguruma-to-js with regex in Shiki, but it seems not producing expected result. Maybe I used it wrong?

https://github.com/shikijs/shiki/pull/762/files?short_path=245c1a2#diff-245c1a2a1d3ff31538c571eb7f586a6838013a7193aed1bd4c7c468ef45a0f83

For oniguruma-to-js, I wanted it to be fast and lightweight as much as possible, as my goal is to correctly transpile the regexes in Shiki's TextMate grammars in order to replace Onigumra's WASM entirely.

@slevithan
Copy link
Author

slevithan commented Aug 30, 2024

Cool! I'm also totally up to collaborate if/where it makes sense! regex is totally aligned on the goals of being fast and lightweight.

I'll say up front though that it's possible regex isn't the right fit since its goal is modernizing JS regexes (although I'd likely be happy to adopt features that don't interfere with this goal while making it more broadly useful, in libraries like this one). One way it achieves this goal is by being a strict superset of JS regex syntax, with support for all ES2025 regex features. But Oniguruma isn't a strict superset of JS (lots of big and small differences). So regex could be very useful for quickly improving beyond the current version of oniguruma-to-js, but it might end up not being the right foundation if the goal is to strive for full Oniguruma compatibility (at least where that matters to get to 100% support for Shiki's TextMate grammars, since this goal is actually impossible if outputting native JS regexes).

I tried to replace oniguruma-to-js with regex in Shiki, but it seems not producing expected result. Maybe I used it wrong?

You used it right! You're getting a bunch of "Invalid escape" and "Invalid character in character class" errors because regex uses JS's flag v syntax, and flag v has particular/strict escaping rules within character classes. regex always uses either flag v or u (depending on whether your environment natively supports flag v), but it uses flag v escaping rules in both cases, since v is a best practice for all new JS regexes, so by using v syntax, regex simplifies down to the one most-modern parsing rules for regex syntax, rather than maintaining native JS's 4 (!) different parsing rules depending on flags and features used.

Extending from the setup in your linked diff, you can change regex to not use flag v's escaping rules like this:

const re = regex({
  flags: 'dg',
  subclass: true,
  unicodeSetsPlugin: null,
  disable: {
    x: true,
    n: true,
    v: true,
  },
})({ raw: [p] })

This adds disable: {v: true} to switch to flag u, and prevents applying flag v's escaping rules via unicodeSetsPlugin: null. These changes make it work for some but not all of the currently failing regexes in the diff you linked to. Some still throw because they escape characters that flag u (with its strict errors for unreserved escapes) doesn't allow you to escape. Allowing those would require using "Unicode-unaware" mode (no flag u or v), which is not great because it would drop support for \p{...} among other regex features and safeguards. BUT, support for needless escapes can easily be modded in with a simple plugin, like this:

regex({
  flags: 'dg',
  subclass: true,
  plugins: [str => str.replace(/(\\\\)|\\(['"`#])/g, '$1$2')],
  unicodeSetsPlugin: null,
  disable: {
    x: true,
    n: true,
    v: true,
  },
})({ raw: [p] })

The plugin used above un-escapes the characters ', ", `, and #. You could easily add more, but these were the needlessly escaped characters I noticed in the diff you linked to that would throw with flag u. It handles them whether they are inside or outside of character classes, and it correctly accounts for escaped backslashes that might precede these characters. I tested this with a bunch (but not all) of the regexes in your linked diff, and it worked with all those I tested it with.

Edit: I've built on the above code in a comment on the relevant Shiki PR.

@slevithan
Copy link
Author

slevithan commented Aug 31, 2024

Aside: Note that oniguruma-to-js's current strategy of neutering possessive quantifiers and atomic groups (by turning them into greedy quantifiers and noncapturing groups) is quite dangerous. In addition to how in some cases it might change which strings are matched (which you already acknowledge within comments), in other cases it might pass tests but lead to ReDoS with certain inputs, which could hang people's servers, etc. In fact, this is quite likely, since protecting against this outcome with certain inputs is probably why the regex author used an atomic group or possessive quantifier in the first place.

I would therefore advise against this. IMO it's much better to just not support such regexes than to make them dangerous. But using regex allows you to support them since it has perfect/complete support for possessive quantifiers and atomic groups, which is nontrivial to do correctly with native JS regexes (since to emulate them you have to add capturing groups along with other tricks, so then you need to also rewrite following numbered backreferences in the pattern and use a RegExp subclass to not let these "emulation groups" alter backreference values in replacement strings/functions, etc.).

@antfu
Copy link
Owner

antfu commented Sep 1, 2024

Yeah, I agree. Would regex support a processor version/utility (return source string instead of RegExp instance) where I can then do further process and contract the RegExp on my own (as you mention in shikijs/shiki#762 (comment) about (?i: etc, so I can workaround it)

@slevithan
Copy link
Author

slevithan commented Sep 1, 2024

Would regex support a processor version/utility (return source string instead of RegExp instance) where I can then do further process and contract the RegExp on my own

That's a great idea, and yes. Supporting an option to return a string instead of a regex is totally doable. (Tentative option name returnString or processOnly, but I'd love better suggestions.)

I'm traveling at the moment but will try to add such a feature in the next few days. There are a few things I need to review as part of adding this to ensure it works well.

Question: Do TextMate grammars ever do search and replace with backreferences in a user-provided replacement string? If not (and regexes are only used to search), this would be much easier. If yes, regex's automatic rewriting of numbered backreferences could create an issue (solvable but with more complexity).

@antfu
Copy link
Owner

antfu commented Sep 2, 2024

Do you aim to have a single exported API for regex? It's a great way to limit the public interface for sure. But from my point of view, I care more about tree shaking and composability. Exposing features via options through a single entry point makes everything non-treeshakable, meaning that even if a user doesn't need all the features it provides, they still need to pay the cost of the full package - do we consider expose some functions, like function convertRegex(input: string): string?

It's it's a bit awkward to use regex({})({ raw: [pattern] }) then a function version createRegex(pattern, {}) - as in our cases, we don't need the string template functionality either.

Question: Do TextMate grammars ever do search and replace with backreferences

I am not very sure if they would have backreferences or not. But for Shiki it only needs regex.match() and get the indices of each group.


Thanks a lot for all the information and the willingness to collaborate, even though my usage can be a bit tricky. Enjoy your travel! We can talk more about it after that.

@slevithan
Copy link
Author

slevithan commented Sep 4, 2024

Good feedback. I've just published a new version of regex that includes the new function rewrite(str, options). It returns an object with two properties (expression, flags), and if you passed those properties to RegExp you'd get the same result as using the regex tag. rewrite doesn't give you context-aware interpolation and isn't a raw string template, but your use case doesn't need those things anyway. It supports all of regex's options except subclass, so you could even use plugins with it if you wanted.

I'd recommend something like this:

import { rewrite } from 'regex'
str = rewrite(str, {
  flags: 'dg',
  unicodeSetsPlugin: null,
  disable: {
    n: true,
    v: true,
    x: true,
  },
}).expression

This leaves the pattern alone apart from transforming atomic groups, possessive quantifiers, subroutines, and subroutine definition groups.

Differences from Oniguruma (all are edge cases):

  • Unfortunately, although Oniguruma's ?+, *+, and ++ are possessive, {N}+ is not (same for {N,}+ and {N,N}+). 😢 Instead, the + confusingly quantifies the {N} quantifier, so there are two greedy quantifiers in a row ({2}+ means repeat 2, 4, 6, 8, ... times). This is a quirk of Oniguruma that doesn't follow the behavior and prior art of possessive {N}+ in Perl, PCRE, Java, Python, C++, etc.
    • If desired, you could do separate processing (or use a regex plugin) to support the Oniguruma logic for {N}+, etc.
  • Unfortunately, in Oniguruma, subroutines via \g<name> don't restore capturing groups upon exiting the subroutine (unlike subroutines in PCRE and Perl). This means that subsequent \k<name> backrefereces to such groups (or to named groups defined within the group that the subroutine references) will work like PCRE and not like Oniguruma.
    • Mixing subroutines and backreferences to the same groups referenced by subroutines is probably an extreme edge case, so this behavior difference is unlikely to cause problems. Also, it's not possible anyway to perfectly emulate Oniguruma's subroutines in JS because of JS's rules about duplicate named capture groups.
    • regex doesn't support the following subroutine features (they will throw if encountered):
      • Subroutines that reference groups by absolute or relative number via \g<N>.
      • Recursive subroutines. These can't be emulated using JS regexes.

If you wanted, you could disable subroutine and subroutine definition group processing via option disable: {subroutines: true}. That would also prevent the errors when encountering unsupported subroutine features.

@antfu antfu closed this as completed in e40e961 Sep 12, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants