-
Notifications
You must be signed in to change notification settings - Fork 21
multiple named capturing groups #44
Comments
Hi, thanks for taking the time to write up this suggestion. This proposal is already at Stage 4 and shipped in browsers. Because of that, it's no longer open to further revisions. Any changes from here should be a new part of a new proposal. Maybe we can pursue this as a needs-consensus pull request. For more information, see https://github.com/tc39/ecma262/blob/master/CONTRIBUTING.md I'm wondering, have you run into the need for this case in practice? |
Actually I have run into the need. One of the examples is when I needed to create regExp for a date recognition as in the example I wrote in the message before. I think there should be more examples of using this. May I ask you why you didn't implement this feature when considered it before? I mean I'm sure you knew about it. |
@Konrud Thanks for the report. I'll think about this some more and chat about it with colleagues. It's possible that it was an error on my part to include this early error, and that nobody caught the design flaw. |
This is definitely an oversight and it's really really frustrating. |
@tophf Sorry about this! How does it come up for you? |
Not sure we should defend an acknowledged use case implemented in PCRE - if it's too hard to implement, why not just document the difference and mark it as WAI? Anyway, similarly to the example above, I have a list of something like thisSince the alternatives are so different I can't just aggregate them in encompassing named groups likeconst rules = [
{a: /rx1a/, b: /rx1b/, c: /rx1c/},
{a: /rx2a/, b: /rx2b/, c: /rx2c/},
// ..............
];
new RegExp(
`(?<a>${rules.map(_ => _.a.source).join('|')})` +
`(?<b>${rules.map(_ => _.b.source).join('|')})` +
`(?<c>${rules.map(_ => _.c.source).join('|')})`,
'g') and then process the matched groups - because this would produce bad pairings like a[1]b[3] leading to a wider/narrower/incorrect match. With the current limitation I have to write a parser-like evaluator or embed lengthy protections into each rule against capturing other rule's stuff. |
Sorry, I don't understand how relaxing the restriction on reusing group names would solve that problem. Could you give an example in code of making use of this feature, and what you expect the semantics to be? |
Not sure you should rely on my explanations. My point was we are just a few devs who bothered to bring this up here, and even though it's kinda good to feel important like I can influence a decision, but the discussed use case already has lots of examples over its long history so an ideal thing to do initially was to investigate and reuse the existing behavior, but as for now at least investigate it instead of relying on me, a random dude, moreover English is not my native language and I'm not good at explaining things. |
I appreciate the time you're putting into this issue, and I would like to make sure your feedback is well-represented in our decision-making process. If you can give a few more details, it'd be helpful. I only see one example here, so a second would be really useful in motivation a change. (There was another issue about not including properties for groups that aren't hit, but I think that amounts to a different proposal from that of the OP here.) |
My example is just one case out of the thousands existing ones, but okay, here's how it would look like with duplicates allowed ( const rules = [
/(?:foo)?(?<a>\w+)\s*,\s*(?<b>\d+)(?:bar)?/,
/\W*(?<a>herp|derp)\s*:\s*(?<b>one|two|three)/,
// ..............
];
const rx = new RegExp('(' + rules.map(r => r.source).join('|') + ')\s*\|\s*', 'g');
for (let m; (m = rx.exec(text));) {
const {a, b} = m.groups;
// do something with a and b
} If the rules are produced from user input with the current JS implementation I would have to scan the entire text per each rule, which could be a lot of times. If the rules are handcrafted, I could combine the first group into a "decider" expression which would be used to scan the entire text once, and on each exec I would choose a corresponding "tail" expression (with sticky flag) which would produce its named group and advance the decider's lastIndex upon success. The second approach is what I meant by "parser-like" in my previous comment. |
OK, thanks for explaining, I can see how this comes up in that case. If you can bare with me just slightly longer, I'm curious, can you say a little more about the context that this sort of issue has come up in a code base you're aware of in the past? |
I don't think there are any JS repos worth mentioning that stumble on this since everyone knows how limited JS regexp engine is compared to PCRE so people either use a custom extended regexp library or switch to another language altogether. In the future, though, implementing this feature would allow all kinds of customizable scraping of text forms, documents, etc. Personally I think any regexp engine should strive to be as close to PCRE as possible within the constraints of effort/performance/size bandwidth. |
Hi, I am an old perl user and as such I am often puzzled by discussions like this one. From my POV multiple occurances are mandatory. One big use case is in parser like situations, especially where you combine alternate syntax rules in one regexp. In this parsing use case, you could actually match each parsing rule separately and sequentially. In most cases you have subexpressions, where you can distinguish the paths, e.g. for git like commands you usually have a command name and sometimes even subcommand names. You often use a switch on these values and then simply use the other names in that rule to process it's parameters. Another use case is parsing alternate sequences of the same data, like date example. You often find this in natural languages or human input. |
https://www.regular-expressions.info/named.html has a section about this topic: "Multiple Groups with The Same Name" that is a nice overview. Though, I mostly used these two variants:
The last situation I remember, was parsing output lines of compilers and other tools.
so I could match all with one expression (simply joined with "|" only once and then cached) and use the three groups. |
I'm surprised that other languages give the first group which participated, rather than the last. I would expect later ones to clobber earlier ones, as happens when you hit the same group multiple times ( |
But your example isn't "Multiple Groups with The Same Name", it's a numbered group. |
I know my example is numbered groups. I was drawing an analogy: if you execute a numbered group multiple times, the match object ends up with the value from the last time it is executed. It's surprising, then, that if you execute the groups with the same name on different occasions, the match object ends up with the value from the first match. |
now I understand, I had a similar thought...
|
This proposal is finished and the repo is being archived, so discussion can't continue here. I've created a new repo to discuss this proposal, and I invite further discussion and contributions there: https://github.com/bakkot/proposal-duplicate-named-capturing-groups |
Perl, Ruby and .NET all allow
multiple named capturing groups
to share the same name in the regular expression. As of 07.2018, current implementations ofnamed capturing groups
in browsers (I've checked it in Chrome 67 and FF 61) don't allow this. So this regular expression for strict date analyze is invalid:Do you consider to add support for multiple named capturing groups?
I think it may help a lot. If we started to implement it as it in the other languages why don't implement it
thoroughly with all the features available?
The text was updated successfully, but these errors were encountered: