Named capture group syntax: `(?<name>exp)` #955

01mf02 · 2023-02-08T10:58:56Z

Would it be possible to support in addition to the existing syntax (?P<name>exp) for named capture groups also the syntax (?<name>exp)?

My use case for this is that I am currently writing a jq clone called jaq. Recently, I have added support for regular expressions to jaq using the regex crate, which works very well. However, because jq supports only the (?<name>exp) syntax (because of the oniguruma library) and jaq only the (?P<name>exp) syntax (because of the regex crate), it is currently impossible to write regexes with named capture groups that are valid in both jaq and jq.

Apart from this, the (?<name>exp) syntax seems reasonably popular, so apart from my special use case, it might make sense to add support for this syntax. :)

The text was updated successfully, but these errors were encountered:

BurntSushi · 2023-02-08T14:26:38Z

To get this out of the way: this would be a backwards compatible change because currently all forms of (?<name>exp) are invalid syntax due to < being interpreted as a flag. Since < is of course not a flag, it always fails.

I think the original reason why I didn't do this was to 1) match RE2 syntax and 2) I didn't want two ways of doing the same thing. But I do think my stance has softened somewhat over time on this. Another related example is that I plan to relax the escaping rules in order to make the differences between at least the surface level syntax smaller between other regex engines. (For example, right now \/ is forbidden.)

I'll also say though that what you're facing here is a surface level problem. There are assuredly many other differences between this regex engine and Oniguruma. How will you deal with those? If compatibility is your ultimate goal, then you probably just need to use Oniguruma itself. Or do you see this more as a "let's get some compatibility, but not all of it" sort of situation? The problem there is that there may be many incompatibilities that are totally silent. (I don't have an Oniguruma environment that I can easily test with at the moment.)

I'll note that RE2 specifically only implements (?P<name>exp) syntax and there is this comment in the parser:

  // Check for named captures, first introduced in Python's regexp library.
  // As usual, there are three slightly different syntaxes:
  //
  //   (?P<name>expr)   the original, introduced by Python
  //   (?<name>expr)    the .NET alteration, adopted by Perl 5.10
  //   (?'name'expr)    another .NET alteration, adopted by Perl 5.10
  //
  // Perl 5.10 gave in and implemented the Python version too,
  // but they claim that the last two are the preferred forms.
  // PCRE and languages based on it (specifically, PHP and Ruby)
  // support all three as well.  EcmaScript 4 uses only the Python form.
  //
  // In both the open source world (via Code Search) and the
  // Google source tree, (?P<expr>name) is the dominant form,
  // so that's the one we implement.  One is enough.

I am quite sympathetic to this line of reasoning personally. And chasing this sort of "let's just keep adding alternative forms of everything until we capture all the different ways other regex engines do things" will lead us into undesirable territory.

I also wonder whether you could easily work around this by looking for a (?< and replacing it with a (?P<. You would need to deal with escapes, but I think that might be it? I don't think you'd need to write a full parser. I might be wrong though, I haven't given this a lot of thought.

I'm undecided on this personally. @junyer do you have any thoughts here?

junyer · 2023-02-08T16:01:09Z

It sounds like you and I (and @rsc) are aligned here at least philosophically. And now speaking pragmatically, adding support for (?<name>exp) – or anything else – to RE2 shouldn't happen without initiating a three-phase commit protocol with the Go regexp package, RE2/J et cetera. I won't presume to speak for the Rust regex crate, of course, but various Google-related projects won't ever support this unless someone herds those cats successfully... and that someone is very unlikely to be me.

rsc · 2023-02-08T16:29:42Z

I still basically agree with what I wrote in the RE2 comment long ago. I could change my mind given evidence of (1) significant usage of .NET forms or (2) significant environments that only support the .NET forms. It sounds like jq might be one such environment. Reading the other link, maybe Java or Boost has (?...) without (?P...)? It's unclear to me.

On the surface syntax issue and \/, RE2 and Go follow the general convention originally set by egrep of backslash-letter being special (so you must know what it means or reject it) and backslash-punctuation always being literal punctuation. So \/ and \_ fall out of that rule without being handled explicitly. The code in RE2 looks like:

  if (c < Runeself && !isalpha(c) && !isdigit(c)) {
    // Escaped non-word characters are always themselves.
    // PCRE is not quite so rigorous: it accepts things like
    // \q, but we don't.  We once rejected \_, but too many
    // programs and people insist on using it, so allow \_.
    *rp = c;
    return true;
  }

01mf02 · 2023-02-08T16:30:28Z

Thanks for your very detailed answers.

I'll also say though that what you're facing here is a surface level problem. There are assuredly many other differences between this regex engine and Oniguruma. How will you deal with those? If compatibility is your ultimate goal, then you probably just need to use Oniguruma itself. Or do you see this more as a "let's get some compatibility, but not all of it" sort of situation? The problem there is that there may be many incompatibilities that are totally silent. (I don't have an Oniguruma environment that I can easily test with at the moment.)

Yes, I see the situation as "let's get some compatibility, but not all of it". (Using Oniguruma from Rust is not really an option for me, all the more because I already have an implementation of regexes using the regex crate.) At least most regexes that I have seen in jq snippets in the wild are fairly simple, so I believe that regex should interpret them the way a jq user expects it to. By far the largest problem, however, are named capture groups, because there are some jq functions that crucially depend on them, in particular capture. Without the named capture group syntax, it is not possible to use capture the same way in jq and jaq.

I am quite sympathetic to this line of reasoning personally. And chasing this sort of "let's just keep adding alternative forms of everything until we capture all the different ways other regex engines do things" will lead us into undesirable territory.

I agree in principle; however, when searching for "regex named capture group", among the first four matches,
all mention the syntax (?<, whereas only one site (the first one) additionally (not exclusively) mentions the existence of (?P<.
This at least suggests that there might not be such a strong consensus towards the syntax (?P< as the one and only syntax to rule them all.

I also wonder whether you could easily work around this by looking for a (?< and replacing it with a (?P<. You would need to deal with escapes, but I think that might be it? I don't think you'd need to write a full parser. I might be wrong though, I haven't given this a lot of thought.

There might be a lot of tricky cases to handle. Consider:

[(?<]
\[(?<
\\[(?<]
\(?<
\\(?<
...

Given that I am not a regex expert, I would not trust myself to get this right.

And now speaking pragmatically, adding support for (?<name>exp) – or anything else – to RE2 shouldn't happen without initiating a three-phase commit protocol with the Go regexp package, RE2/J et cetera.

Why is there such a need for synchronisation? Is there some kind of agreement between the Rust regex crate and RE2 to implement precisely the same syntax?

Would it perhaps be possible to have some opt-in option, for example in ParserBuilder, to enable parsing (?< syntax?

01mf02 · 2023-02-08T16:41:11Z

I could change my mind given evidence of (1) significant usage of .NET forms or (2) significant environments that only support the .NET forms. It sounds like jq might be one such environment. Reading the other link, maybe Java or Boost has (?...) without (?P...)? It's unclear to me.

Regarding Java, I read at least three sites, all of which exclusively mentioned the (?< syntax.

For Boost, the documentation says that the Perl syntax is the default behaviour, and details that this supports (?< and (?'. Again, no mention of (?P<.

BurntSushi · 2023-02-08T16:43:02Z

Why is there such a need for synchronisation? Is there some kind of agreement between the Rust regex crate and RE2 to implement precisely the same syntax?

To clarify here, @junyer and @rsc are RE2 maintainers, and RE2, RE2/J and Go's regexp package are all maintained by folks at Google. So those packages I think generally try to stay very strictly aligned.

There is no synchronization promise with those three and the regex crate though. The regex crate does actually have some substantial differences (like the escaping strategy, although I expect that to change in the direction of RE2's) and also support for character class set operations and nested classes and probably a few other minor things. Still though, I value their input and "consistent with RE2" is, overall, something I value. But not over everything else.

01mf02 · 2023-02-09T10:51:06Z

I see. Thanks for clarifying your synchronisation policy.

Just on the side: I believe that implementing the (?< syntax implies changing only one line in the code, namely replacing if self.bump_if("?P<") by if self.bump_if("?P<") || self.bump_if("?<"). I would gladly volunteer to submit a PR with this change where I would also write a few tests for the new behaviour. But of course only if you agree that this feature is worth having.

If I can do anything else to convince you about the utility of supporting (?<, please let me know. Aside, I also checked that JavaScript uniquely supports (?<. Furthermore, among two of the most popular regex websites, https://regexr.com/ supports only the (?< syntax and https://regex101.com/ supports both (?< and (?P<. From my research, I have gained the impression that the (?P< syntax is actually more the exception than the norm.

01mf02 · 2023-02-09T16:55:17Z

I found an interesting bit of history from the Python project that explains among others how the syntax (? came up. It goes further on to explain:

Python supports several of Perl's extensions and adds an extension syntax to Perl's extension syntax. If the first character after the question mark is a P, you know that it's an extension that's specific to Python.

So the P in (?P< stands for a Python-specific extension.
In that sense, it reminds me of browser-specific extensions. Like, for example, -moz-animation, which was later standardised and turned into just animation.
I suppose that in the same way that people dropped the -moz-prefix, people dropped the P from (?P< as named capture groups proved to be useful beyond Python.
Now, keeping to allow the P in the syntax may be justified by compatibility reasons (just like -moz-animation is still accepted in some browsers). At the same time, it would be great to also have a way to express named capture groups without the capital P, which perpetuates that they are a Python-specific extension (which they have ceased to be a long time ago).

BurntSushi · 2023-02-09T17:06:55Z

@01mf02 The history and original reason for the (?P syntax is indeed interesting, but I think it has almost exactly zero weight on my decision here. Here are the things that matter to me, in no particular order:

Consistency with other regex engines, especially RE2, given the common ancestry.
Keeping the syntax "simple," for some definition of "simple." Having two different syntaxes for accomplishing the same thing is a negative IMO. Basically, what this results in in my experience is that someone learns one syntax, then sees the other syntax and wonders, "wait was I doing it the wrong way? should I switch? what's the difference between them?" We can of course mitigate such things by answering such questions in the docs, but it is remarkably difficult to make such a thing discoverable. It's certainly not something you want to plaster across the introduction, so it tends to get buried in the syntax details. Which is fine... But people are going to get confused. As with other things in this list, I do not value this above everything else. It's just something I consider.
Making the syntax flexible enough to fit into other environments. This is a net positive because it means there's more knowledge transfer from past experience and things tend to "just work" more often than not. I think this is basically what describes your use case here.
There is an overall downside of trying to "make the syntax match other regex engines," because basically other than regex engines that closely and strictly follow an existing specification, no two regex engines behave the same. And so trying to "just make things work" is a long path that doesn't really have an end. I don't think there is a positive of negative here, but it's something to consider.

I think (1) and (2) are where I am at the moment. Unfortunately, there's no real objective criteria to evaluate here.

I am overall leaning towards doing this.

Just on the side: I believe that implementing the (?< syntax implies changing only one line in the code, namely replacing if self.bump_if("?P<") by if self.bump_if("?P<") || self.bump_if("?<"). I would gladly volunteer to submit a PR with this change where I would also write a few tests for the new behaviour. But of course only if you agree that this feature is worth having.

I agree that the patch here is likely quite simple, but it is probably not this simple. Whether (?P<name>expr) or (?<name>expr) is used or not needs to show up in the AST somewhere. So there may be some type definition changes here, and potentially even a breaking change for the regex-syntax crate. (Which is okay. I don't like to do it too often, but I am planning to do one soon.)

rsc · 2023-02-09T20:36:34Z

Have we identified any regexp implementations other than onigurama that don't implement (?P<name>...)?

Also, is the suggestion to allow both (?<name>...) and (?'name'...) or just the first?

BurntSushi · 2023-02-10T01:15:19Z

Not sure about (?'name'...) but I found these with some quick searching:

Javascript only supports (?<name>...).
.NET only supports (?<name>...) and (?'name'...) syntax.
Java only supports (?<name>...).
Ruby only supports (?<name>...) and (?'name'...) syntax. Although I think Ruby uses Oniguruma, so this might be a dupe? Well, actually, it looks like Ruby these days uses Onigmo which is a fork of Oniguruma. But it's the same syntax support with respect to named groups.
Unbelievably, I can't find any authoritative reference for Boost's regex library about what kind of named capture support it has, but some examples in the wild suggest it at least supports (?<name>...) syntax.

I think that's all I could find at the moment. I think the closest thing to a consensus among non-RE2 engines is "support both (?P<name>...) and (?<name>...)." That seemed like the most common thing, but it's not ubiquitous. A lot of engines support one or the other too. The "support both" is perhaps inflated a bit by the ubiquity of PCRE, which is used as the default regex engine in at least a few places (PHP and Julia come to mind).

BurntSushi · 2023-02-10T01:16:14Z

Also, is the suggestion to allow both (?<name>...) and (?'name'...) or just the first?

I think the suggestion on the table is just first, as that's what is used by Oniguruma in the context of jq scripts.

The (?'name'...) syntax is one that I very rarely see. I don't think there are any regex engines (that I can recall in my search) that only support (?'name'...).

c-git · 2023-02-10T02:10:53Z

I hope this comment doesn't distract too much but I really appreciate how @BurntSushi addresses issues raised, explaining his reasoning and so on. I learn so much from just following along and it usually causes me to think about considerations I might have otherwise missed. I just want to say thank you, I really appreciate the time you put into your responses.

01mf02 · 2023-02-10T11:20:19Z

I second @c-git in that I also value your very detailed responses, @BurntSushi.

And of course I'm happy to read that you are leaning towards implementing my suggestion.
I second your observation that (?'name'...) is something that you very rarely see. I've probably seen this syntax more often in documentation than used in actual code. So I am for not implementing this, also to keep the syntax "simple".

Unbelievably, I can't find any authoritative reference for Boost's regex library about what kind of named capture support it has, but some examples in the wild suggest it at least supports (?<name>...) syntax.

The documentation of the boost regex module mentions named capture groups only in the Perl syntax flavour, which says that ?< and ?' are supported. No mention of ?P< here.

I agree that the patch here is likely quite simple, but it is probably not this simple. Whether (?P<name>expr) or (?<name>expr) is used or not needs to show up in the AST somewhere. So there may be some type definition changes here, and potentially even a breaking change for the regex-syntax crate. (Which is okay. I don't like to do it too often, but I am planning to do one soon.)

Ah, I see. I suppose it is for round-tripping? If you wish, I could tackle this. I understand that the ast::Group::CaptureName variant would either need to be extended by some bit that indicates the presence of P (would you consider a boolean?), or a new variant (something like CapturePName) could be introduced. What do you think about this?

BurntSushi · 2023-02-10T11:32:31Z

Ah, I see. I suppose it is for round-tripping? If you wish, I could tackle this. I understand that the ast::Group::CaptureName variant would either need to be extended by some bit that indicates the presence of P (would you consider a boolean?), or a new variant (something like CapturePName) could be introduced. What do you think about this?

Yes, round-tripping. The point of the AST is that it exhaustively describes the syntax as it is. Lowering it into something simpler and easier to analyze happens in a second pass. (You'll need to make what is likely a trivial change to the AST->HIR translator, also inside of regex-syntax, to accommodate your changes to the AST.)

I think a new variant for GroupKind seems okay? So rename the existing CaptureName to CapturePName and introduce a new CaptureName variant.

rsc · 2023-02-10T16:36:10Z

Talked to @junyer a bit, and I think this change make sense to do in RE2 and Go as well. I filed golang/go#58458, and assuming it goes through we'll update RE2 and Go in about a month.

BurntSushi · 2023-02-10T16:42:15Z

@rsc SGTM! If y'all add support for it then I definitely will as well. We might not line up timing wise, but I think that's okay!

01mf02 · 2023-02-10T17:30:44Z

Great! I'm very happy that we seem to have reached a consensus on this issue. :) I have opened a PR with my proposed changes. Have a good weekend!

It turns out that both '(?P<name>...)' and '(?<name>...)' are rather common among regex engines. There are several that support just one or the other. Until this commit, the regex crate only supported the former, along with both RE2, RE2/J and Go's regexp package. There are also several regex engines that only supported the latter, such as Onigmo, Onuguruma, Java, Ruby, Boost, .NET and Javascript. To decrease friction, and because there is somewhat little cost to doing so, we elect to support both. It looks like perhaps RE2 and Go's regexp package will go the same route, but it isn't fully decided yet: golang/go#58458 Closes #955

It turns out that both '(?P<name>...)' and '(?<name>...)' are rather common among regex engines. There are several that support just one or the other. Until this commit, the regex crate only supported the former, along with both RE2, RE2/J and Go's regexp package. There are also several regex engines that only supported the latter, such as Onigmo, Onuguruma, Java, Ruby, Boost, .NET and Javascript. To decrease friction, and because there is somewhat little cost to doing so, we elect to support both. It looks like perhaps RE2 and Go's regexp package will go the same route, but it isn't fully decided yet: golang/go#58458 Closes #955, Closes #956

1.8.0 (2023-04-20) ================== This is a sizeable release that will be soon followed by another sizeable release. Both of them will combined close over 40 existing issues and PRs. This first release, despite its size, essentially represent preparatory work for the second release, which will be even bigger. Namely, this release: * Increases the MSRV to Rust 1.60.0, which was released about 1 year ago. * Upgrades its dependency on `aho-corasick` to the recently release 1.0 version. * Upgrades its dependency on `regex-syntax` to the simultaneously released `0.7` version. The changes to `regex-syntax` principally revolve around a rewrite of its literal extraction code and a number of simplifications and optimizations to its high-level intermediate representation (HIR). The second release, which will follow ~shortly after the release above, will contain a soup-to-nuts rewrite of every regex engine. This will be done by bringing [`regex-automata`](https://github.com/BurntSushi/regex-automata) into this repository, and then changing the `regex` crate to be nothing but an API shim layer on top of `regex-automata`'s API. These tandem releases are the culmination of about 3 years of on-and-off work that [began in earnest in March 2020](#656). Because of the scale of changes involved in these releases, I would love to hear about your experience. Especially if you notice undocumented changes in behavior or performance changes (positive *or* negative). Most changes in the first release are listed below. For more details, please see the commit log, which reflects a linear and decently documented history of all changes. New features: * [FEATURE #501](#501): Permit many more characters to be escaped, even if they have no significance. More specifically, any ASCII character except for `[0-9A-Za-z<>]` can now be escaped. Also, a new routine, `is_escapeable_character`, has been added to `regex-syntax` to query whether a character is escapeable or not. * [FEATURE #547](#547): Add `Regex::captures_at`. This filles a hole in the API, but doesn't otherwise introduce any new expressive power. * [FEATURE #595](#595): Capture group names are now Unicode-aware. They can now begin with either a `_` or any "alphabetic" codepoint. After the first codepoint, subsequent codepoints can be any sequence of alpha-numeric codepoints, along with `_`, `.`, `[` and `]`. Note that replacement syntax has not changed. * [FEATURE #810](#810): Add `Match::is_empty` and `Match::len` APIs. * [FEATURE #905](#905): Add an `impl Default for RegexSet`, with the default being the empty set. * [FEATURE #908](#908): A new method, `Regex::static_captures_len`, has been added which returns the number of capture groups in the pattern if and only if every possible match always contains the same number of matching groups. * [FEATURE #955](#955): Named captures can now be written as `(?<name>re)` in addition to `(?P<name>re)`. * FEATURE: `regex-syntax` now supports empty character classes. * FEATURE: `regex-syntax` now has an optional `std` feature. (This will come to `regex` in the second release.) * FEATURE: The `Hir` type in `regex-syntax` has had a number of simplifications made to it. * FEATURE: `regex-syntax` has support for a new `R` flag for enabling CRLF mode. This will be supported in `regex` proper in the second release. * FEATURE: `regex-syntax` now has proper support for "regex that never matches" via `Hir::fail()`. * FEATURE: The `hir::literal` module of `regex-syntax` has been completely re-worked. It now has more documentation, examples and advice. * FEATURE: The `allow_invalid_utf8` option in `regex-syntax` has been renamed to `utf8`, and the meaning of the boolean has been flipped. Performance improvements: * PERF: The upgrade to `aho-corasick 1.0` may improve performance in some cases. It's difficult to characterize exactly which patterns this might impact, but if there are a small number of longish (>= 4 bytes) prefix literals, then it might be faster than before. Bug fixes: * [BUG #514](#514): Improve `Debug` impl for `Match` so that it doesn't show the entire haystack. * BUGS [#516](#516), [#731](#731): Fix a number of issues with printing `Hir` values as regex patterns. * [BUG #610](#610): Add explicit example of `foo|bar` in the regex syntax docs. * [BUG #625](#625): Clarify that `SetMatches::len` does not (regretably) refer to the number of matches in the set. * [BUG #660](#660): Clarify "verbose mode" in regex syntax documentation. * BUG [#738](#738), [#950](#950): Fix `CaptureLocations::get` so that it never panics. * [BUG #747](#747): Clarify documentation for `Regex::shortest_match`. * [BUG #835](#835): Fix `\p{Sc}` so that it is equivalent to `\p{Currency_Symbol}`. * [BUG #846](#846): Add more clarifying documentation to the `CompiledTooBig` error variant. * [BUG #854](#854): Clarify that `regex::Regex` searches as if the haystack is a sequence of Unicode scalar values. * [BUG #884](#884): Replace `__Nonexhaustive` variants with `#[non_exhaustive]` attribute. * [BUG #893](#893): Optimize case folding since it can get quite slow in some pathological cases. * [BUG #895](#895): Reject `(?-u:\W)` in `regex::Regex` APIs. * [BUG #942](#942): Add a missing `void` keyword to indicate "no parameters" in C API. * [BUG #965](#965): Fix `\p{Lc}` so that it is equivalent to `\p{Cased_Letter}`. * [BUG #975](#975): Clarify documentation for `\pX` syntax.

This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [regex](https://github.com/rust-lang/regex) | dependencies | minor | `1.7.3` -> `1.8.1` | --- ### Release Notes <details> <summary>rust-lang/regex</summary> ### [`v1.8.1`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#181-2023-04-21) \================== This is a patch release that fixes a bug where a regex match could be reported where none was found. Specifically, the bug occurs when a pattern contains some literal prefixes that could be extracted *and* an optional word boundary in the prefix. Bug fixes: - [BUG #981](rust-lang/regex#981): Fix a bug where a word boundary could interact with prefix literal optimizations and lead to a false positive match. ### [`v1.8.0`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#180-2023-04-20) \================== This is a sizeable release that will be soon followed by another sizeable release. Both of them will combined close over 40 existing issues and PRs. This first release, despite its size, essentially represents preparatory work for the second release, which will be even bigger. Namely, this release: - Increases the MSRV to Rust 1.60.0, which was released about 1 year ago. - Upgrades its dependency on `aho-corasick` to the recently released 1.0 version. - Upgrades its dependency on `regex-syntax` to the simultaneously released `0.7` version. The changes to `regex-syntax` principally revolve around a rewrite of its literal extraction code and a number of simplifications and optimizations to its high-level intermediate representation (HIR). The second release, which will follow ~shortly after the release above, will contain a soup-to-nuts rewrite of every regex engine. This will be done by bringing [`regex-automata`](https://github.com/BurntSushi/regex-automata) into this repository, and then changing the `regex` crate to be nothing but an API shim layer on top of `regex-automata`'s API. These tandem releases are the culmination of about 3 years of on-and-off work that [began in earnest in March 2020](rust-lang/regex#656). Because of the scale of changes involved in these releases, I would love to hear about your experience. Especially if you notice undocumented changes in behavior or performance changes (positive *or* negative). Most changes in the first release are listed below. For more details, please see the commit log, which reflects a linear and decently documented history of all changes. New features: - [FEATURE #501](rust-lang/regex#501): Permit many more characters to be escaped, even if they have no significance. More specifically, any ASCII character except for `[0-9A-Za-z<>]` can now be escaped. Also, a new routine, `is_escapeable_character`, has been added to `regex-syntax` to query whether a character is escapeable or not. - [FEATURE #547](rust-lang/regex#547): Add `Regex::captures_at`. This filles a hole in the API, but doesn't otherwise introduce any new expressive power. - [FEATURE #595](rust-lang/regex#595): Capture group names are now Unicode-aware. They can now begin with either a `_` or any "alphabetic" codepoint. After the first codepoint, subsequent codepoints can be any sequence of alpha-numeric codepoints, along with `_`, `.`, `[` and `]`. Note that replacement syntax has not changed. - [FEATURE #810](rust-lang/regex#810): Add `Match::is_empty` and `Match::len` APIs. - [FEATURE #905](rust-lang/regex#905): Add an `impl Default for RegexSet`, with the default being the empty set. - [FEATURE #908](rust-lang/regex#908): A new method, `Regex::static_captures_len`, has been added which returns the number of capture groups in the pattern if and only if every possible match always contains the same number of matching groups. - [FEATURE #955](rust-lang/regex#955): Named captures can now be written as `(?<name>re)` in addition to `(?P<name>re)`. - FEATURE: `regex-syntax` now supports empty character classes. - FEATURE: `regex-syntax` now has an optional `std` feature. (This will come to `regex` in the second release.) - FEATURE: The `Hir` type in `regex-syntax` has had a number of simplifications made to it. - FEATURE: `regex-syntax` has support for a new `R` flag for enabling CRLF mode. This will be supported in `regex` proper in the second release. - FEATURE: `regex-syntax` now has proper support for "regex that never matches" via `Hir::fail()`. - FEATURE: The `hir::literal` module of `regex-syntax` has been completely re-worked. It now has more documentation, examples and advice. - FEATURE: The `allow_invalid_utf8` option in `regex-syntax` has been renamed to `utf8`, and the meaning of the boolean has been flipped. Performance improvements: - PERF: The upgrade to `aho-corasick 1.0` may improve performance in some cases. It's difficult to characterize exactly which patterns this might impact, but if there are a small number of longish (>= 4 bytes) prefix literals, then it might be faster than before. Bug fixes: - [BUG #514](rust-lang/regex#514): Improve `Debug` impl for `Match` so that it doesn't show the entire haystack. - BUGS [#516](rust-lang/regex#516), [#731](rust-lang/regex#731): Fix a number of issues with printing `Hir` values as regex patterns. - [BUG #610](rust-lang/regex#610): Add explicit example of `foo|bar` in the regex syntax docs. - [BUG #625](rust-lang/regex#625): Clarify that `SetMatches::len` does not (regretably) refer to the number of matches in the set. - [BUG #660](rust-lang/regex#660): Clarify "verbose mode" in regex syntax documentation. - BUG [#738](rust-lang/regex#738), [#950](rust-lang/regex#950): Fix `CaptureLocations::get` so that it never panics. - [BUG #747](rust-lang/regex#747): Clarify documentation for `Regex::shortest_match`. - [BUG #835](rust-lang/regex#835): Fix `\p{Sc}` so that it is equivalent to `\p{Currency_Symbol}`. - [BUG #846](rust-lang/regex#846): Add more clarifying documentation to the `CompiledTooBig` error variant. - [BUG #854](rust-lang/regex#854): Clarify that `regex::Regex` searches as if the haystack is a sequence of Unicode scalar values. - [BUG #884](rust-lang/regex#884): Replace `__Nonexhaustive` variants with `#[non_exhaustive]` attribute. - [BUG #893](rust-lang/regex#893): Optimize case folding since it can get quite slow in some pathological cases. - [BUG #895](rust-lang/regex#895): Reject `(?-u:\W)` in `regex::Regex` APIs. - [BUG #942](rust-lang/regex#942): Add a missing `void` keyword to indicate "no parameters" in C API. - [BUG #965](rust-lang/regex#965): Fix `\p{Lc}` so that it is equivalent to `\p{Cased_Letter}`. - [BUG #975](rust-lang/regex#975): Clarify documentation for `\pX` syntax. </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] If you want to rebase/retry this PR, check this box --- This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).  Co-authored-by: cabr2-bot <cabr2.help@gmail.com> Co-authored-by: crapStone <crapstone01@gmail.com> Reviewed-on: https://codeberg.org/Calciumdibromid/CaBr2/pulls/1874 Reviewed-by: crapStone <crapstone@noreply.codeberg.org> Co-authored-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org> Co-committed-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>

BurntSushi · 2023-08-11T19:05:13Z

This ended up being a very effective feature request. It caused RE2, Go's regexp package and this crate to all start supporting (?<name>expr) syntax in addition to (?P<name>expr). Nicely done @01mf02!

01mf02 · 2023-08-14T04:55:23Z

Thanks, @BurntSushi!

BurntSushi added the question label Feb 8, 2023

rsc mentioned this issue Feb 10, 2023

regexp/syntax: accept (?<name>...) in addition to (?P<name>...) golang/go#58458

Closed

BurntSushi added enhancement help wanted and removed question labels Feb 10, 2023

01mf02 mentioned this issue Feb 10, 2023

Support (?< syntax for named capture groups #956

Closed

BurntSushi added the fix-incoming label Feb 10, 2023

BurntSushi closed this as completed in 961d6f0 Apr 17, 2023

BurntSushi mentioned this issue Apr 20, 2023

release: 1.8.0 #979

Merged

torsteingrindvik mentioned this issue Apr 21, 2023

Regex 1.8.0: Valid ?P<foo> -> ?<foo> triggers lint rust-lang/rust-clippy#10680

Closed

InSyncWithFoo mentioned this issue Oct 2, 2023

Rust flavor produces error for valid regex firasdib/Regex101#2151

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Named capture group syntax: `(?<name>exp)` #955

Named capture group syntax: `(?<name>exp)` #955

01mf02 commented Feb 8, 2023

BurntSushi commented Feb 8, 2023

junyer commented Feb 8, 2023

rsc commented Feb 8, 2023

01mf02 commented Feb 8, 2023

01mf02 commented Feb 8, 2023

BurntSushi commented Feb 8, 2023

01mf02 commented Feb 9, 2023

01mf02 commented Feb 9, 2023

BurntSushi commented Feb 9, 2023

rsc commented Feb 9, 2023

BurntSushi commented Feb 10, 2023 •

edited

Loading

BurntSushi commented Feb 10, 2023

c-git commented Feb 10, 2023

01mf02 commented Feb 10, 2023

BurntSushi commented Feb 10, 2023

rsc commented Feb 10, 2023

BurntSushi commented Feb 10, 2023

01mf02 commented Feb 10, 2023

BurntSushi commented Aug 11, 2023

01mf02 commented Aug 14, 2023

Named capture group syntax: (?<name>exp) #955

Named capture group syntax: (?<name>exp) #955

Comments

01mf02 commented Feb 8, 2023

BurntSushi commented Feb 8, 2023

junyer commented Feb 8, 2023

rsc commented Feb 8, 2023

01mf02 commented Feb 8, 2023

01mf02 commented Feb 8, 2023

BurntSushi commented Feb 8, 2023

01mf02 commented Feb 9, 2023

01mf02 commented Feb 9, 2023

BurntSushi commented Feb 9, 2023

rsc commented Feb 9, 2023

BurntSushi commented Feb 10, 2023 • edited Loading

BurntSushi commented Feb 10, 2023

c-git commented Feb 10, 2023

01mf02 commented Feb 10, 2023

BurntSushi commented Feb 10, 2023

rsc commented Feb 10, 2023

BurntSushi commented Feb 10, 2023

01mf02 commented Feb 10, 2023

BurntSushi commented Aug 11, 2023

01mf02 commented Aug 14, 2023

Named capture group syntax: `(?<name>exp)` #955

Named capture group syntax: `(?<name>exp)` #955

BurntSushi commented Feb 10, 2023 •

edited

Loading