Nested repetitions parsed potentially incorrectly #3

calfeld-zz · 2013-01-31T17:45:33Z

Summary: /o{2}{5}/ matches 10 o's (in ruby) but the {2} quantifier is lost in the parse tree.

Example code:

#!/usr/bin/env ruby

require 'rubygems'
require 'regexp_parser'
require 'pp'

re = /o{2}{5}/
pp Regexp::Parser.parse(re)

puts "o" if "o" =~ re
puts "2 o" if "o"*2 =~ re
puts "5 o" if "o"*5 =~ re
puts "10 o" if "o"*10 =~ re

Output:

#<Regexp::Expression::Root:0x105fd9d40
 @expressions=
  [#<Regexp::Expression::Literal:0x105fd6398
    @expressions=[],
    @options=nil,
    @quantifier=
     #<Regexp::Expression::Quantifier:0x105fd5e20
      @max=5,
      @min=5,
      @mode=:greedy,
      @text="{5}",
      @token=:interval>,
    @text="o",
    @token=:literal,
    @type=:literal>],
 @options=nil,
 @text="",
 @token=:root,
 @type=:expression>
10 o

Comments:

As far as I can tell, the nested quantifier syntax isn't documented in ruby and is illegal in pcre. Grep for example, will not match any number of o's for the given regexp. As such, I'd be content with a will-not-fix verdict. But I thought you might be interested.

Thank you for your time.

The text was updated successfully, but these errors were encountered:

dkubb · 2013-01-31T20:13:49Z

While it's possible this is a bug in ruby's own regexp parser, if it's valid syntax it would be good to have it supported by regexp_parser so that it could handle this should it occur in the wild.

I'm working with @mbj on a project called mutant, which is a mutation tester. The idea is that it changes something about your code, and then runs the tests to make sure they fail. If they do not then you have code that is untested. We plan to bring this to regexps, since it is such a compact representation of logic and there's a really good chance most regexps are only partially tested, if at all.

Anyway, we're looking to use regexp_parser to parse regexps we encounter in the code, mutant them, and then run the tests to see if they fail. It would be really nice to have regexp_parser be able to handle all valid syntax (even this, although it's arguably a bit crazy and I would never write this in real code). I think we're still a few months off from having regexp_parser integrated, but I wanted to give you the head's up.

FWIW, we've been running mutant against all our own projects, and larger projects like rubinius. We will be starting on rubyspec soon. Once regexp_parser is integrated, if we find any issues we'll definately submit issues and (ideally) PRs to regexp_parser.

mbj · 2013-01-31T20:20:40Z

@dkubb Lets hope I get some additional funding to focus on mutant. I'd like not to wait month for the feature!

I'd like to add the following reference that explains the idea of mutant very well:

http://solnic.eu/2013/01/23/mutation-testing-with-mutant.html

ammar · 2013-02-05T16:21:08Z

@calfeld Good catch. I've never seen that before.

It is interesting that ruby's engine does the math no matter how many quantifiers follow. It accepted:

/x{2}{5}{10}{3}{2}/

and it started to match at exactly 'x' * 600.

I doubt this behavior is intentional. In the strange example above, I was able to add + quantifiers between the intervals, like:

 /x{2}+{5}+{10}+{3}+{2}+/

Depending on how many +s where added and where, sometimes it worked and sometimes it went into what appeared to be an infinite loop.

If it's intentional, it's buggy, and not documented. @calfeld, since you found it would you like to submit a ticket to ruby's redmine?

@dkubb and @mbj, that's a very interesting topic and project. I hope regexp_parser helps.

I've been meaning to give this old project some time, to merge a couple of minor fixes from the dev branch, complete some cleanups I've been working on, check for regex changes in ruby 2.0, and release a new gem version.

In the meantime, if you have questions, comments, or pull requests, I will be happy to receive them.

calfeld-zz · 2013-02-05T19:33:19Z

Submitted (https://bugs.ruby-lang.org/issues/7787)

There are some bizarre cases, e.g., /x{2}{5}+{8}{3}/ never seems to end but change that 8 to a 7, /x{2}{5}+{7}{3}/ and it runs fine, but add a space to the end, /x{2}{5}+{7}{3} / and it takes forever.

I'll leave this issue open for the time being, as regexp_parser's handling of such quantifiers is very different than ruby's.

ammar · 2013-02-07T14:27:43Z

@calfeld Thanks for submitting the issue. I saw the 'not a bug' response and I tend to agree with it. This behavior is not documented or defined, so people should not use or depend on it. All one could hope for, as a change, is an error, or at least a warning, that says that stacking quantifiers is not supported.

Currently the scanner parses all the quantifiers correctly, but it is the parser that only takes the very last one. Not sure what to do about that yet. Any thoughts?

calfeld-zz · 2013-02-07T15:22:31Z

I also agree with the response.

As for the parser, my suggestion is to parse it similar to with non-capturing groups. E.g., parse /x{2}{3}/ similarly to /(?:x{2}){3}/, i.e., root->passive group {3}->literal {2}. For clarity, you could add a new class, e.g., Group::Implicit, to distinguish the two cases. Such a solution would have no changes for regexps that don't contain such weirdness. However, it does mean more nodes that do not directly correspond to tokens.

jaynetics · 2020-06-06T20:08:11Z

@ammar in light of the discussion in #63 we might want to deal with this at well. i've actually seen this in use, e.g. to validate css hex colors: /\A\h{3}{1,2}\z/.

should i make an attempt?

if so, would passive groups or some new node be more appropriate to accomodate the extra quantifiers? the latter would warrant a major version bump in my opinion, so i'm more inclined to use passive groups, and maybe add Group::Passive#implicit? to allow differentiating.

calfeld-zz · 2020-06-06T21:13:43Z

I applaud your consideration of a 7 year old bug.

ammar · 2020-06-07T09:10:56Z

@jaynetics Sure, if you can. Happy to support and review too.

if so, would passive groups or some new node be more appropriate to accomodate the extra quantifiers?

That would work, but I haven't thought about it much. I wonder if it makes sense for quantifier nodes to have an internal quantifier stack. It might be instructive to find out how Onigmo represents these cases in its parse tree.

[#3]

jaynetics · 2020-11-25T12:44:48Z

the solution proposed by @calfeld is now live with v2.0.0 - better late than never 😆

these characters were wrongly treated as possessive or reluctant mode flags for the interval quantifier. Ruby/Onigmo does not support these modes for intervals, so it treats them as extra, chained quantifiers instead. c.f. #3, #69

ammar mentioned this issue Apr 21, 2016

Regexp::Scanner::PrematureEndError: Premature end of pattern at #{str} #15

Closed

jaynetics added a commit that referenced this issue Sep 20, 2020

Add simple handling/workaround for chained quantifiers

7693999

[#3]

jaynetics added a commit that referenced this issue Nov 25, 2020

Add simple handling/workaround for chained quantifiers

89f5c87

[#3]

jaynetics added a commit that referenced this issue Nov 25, 2020

Add simple handling/workaround for chained quantifiers

f14a36b

[#3]

jaynetics closed this as completed Nov 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nested repetitions parsed potentially incorrectly #3

Nested repetitions parsed potentially incorrectly #3

calfeld-zz commented Jan 31, 2013

dkubb commented Jan 31, 2013

mbj commented Jan 31, 2013

ammar commented Feb 5, 2013

calfeld-zz commented Feb 5, 2013

ammar commented Feb 7, 2013

calfeld-zz commented Feb 7, 2013

jaynetics commented Jun 6, 2020

calfeld-zz commented Jun 6, 2020

ammar commented Jun 7, 2020 •

edited

Loading

jaynetics commented Nov 25, 2020

Nested repetitions parsed potentially incorrectly #3

Nested repetitions parsed potentially incorrectly #3

Comments

calfeld-zz commented Jan 31, 2013

dkubb commented Jan 31, 2013

mbj commented Jan 31, 2013

ammar commented Feb 5, 2013

calfeld-zz commented Feb 5, 2013

ammar commented Feb 7, 2013

calfeld-zz commented Feb 7, 2013

jaynetics commented Jun 6, 2020

calfeld-zz commented Jun 6, 2020

ammar commented Jun 7, 2020 • edited Loading

jaynetics commented Nov 25, 2020

ammar commented Jun 7, 2020 •

edited

Loading