Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nested repetitions parsed potentially incorrectly #3

Closed
calfeld-zz opened this issue Jan 31, 2013 · 10 comments
Closed

Nested repetitions parsed potentially incorrectly #3

calfeld-zz opened this issue Jan 31, 2013 · 10 comments
Labels

Comments

@calfeld-zz
Copy link

Summary: /o{2}{5}/ matches 10 o's (in ruby) but the {2} quantifier is lost in the parse tree.

Example code:

#!/usr/bin/env ruby

require 'rubygems'
require 'regexp_parser'
require 'pp'

re = /o{2}{5}/
pp Regexp::Parser.parse(re)

puts "o" if "o" =~ re
puts "2 o" if "o"*2 =~ re
puts "5 o" if "o"*5 =~ re
puts "10 o" if "o"*10 =~ re

Output:

#<Regexp::Expression::Root:0x105fd9d40
 @expressions=
  [#<Regexp::Expression::Literal:0x105fd6398
    @expressions=[],
    @options=nil,
    @quantifier=
     #<Regexp::Expression::Quantifier:0x105fd5e20
      @max=5,
      @min=5,
      @mode=:greedy,
      @text="{5}",
      @token=:interval>,
    @text="o",
    @token=:literal,
    @type=:literal>],
 @options=nil,
 @text="",
 @token=:root,
 @type=:expression>
10 o

Comments:

As far as I can tell, the nested quantifier syntax isn't documented in ruby and is illegal in pcre. Grep for example, will not match any number of o's for the given regexp. As such, I'd be content with a will-not-fix verdict. But I thought you might be interested.

Thank you for your time.

@dkubb
Copy link

dkubb commented Jan 31, 2013

While it's possible this is a bug in ruby's own regexp parser, if it's valid syntax it would be good to have it supported by regexp_parser so that it could handle this should it occur in the wild.

I'm working with @mbj on a project called mutant, which is a mutation tester. The idea is that it changes something about your code, and then runs the tests to make sure they fail. If they do not then you have code that is untested. We plan to bring this to regexps, since it is such a compact representation of logic and there's a really good chance most regexps are only partially tested, if at all.

Anyway, we're looking to use regexp_parser to parse regexps we encounter in the code, mutant them, and then run the tests to see if they fail. It would be really nice to have regexp_parser be able to handle all valid syntax (even this, although it's arguably a bit crazy and I would never write this in real code). I think we're still a few months off from having regexp_parser integrated, but I wanted to give you the head's up.

FWIW, we've been running mutant against all our own projects, and larger projects like rubinius. We will be starting on rubyspec soon. Once regexp_parser is integrated, if we find any issues we'll definately submit issues and (ideally) PRs to regexp_parser.

@mbj
Copy link

mbj commented Jan 31, 2013

@dkubb Lets hope I get some additional funding to focus on mutant. I'd like not to wait month for the feature!

I'd like to add the following reference that explains the idea of mutant very well:

http://solnic.eu/2013/01/23/mutation-testing-with-mutant.html

@ammar
Copy link
Owner

ammar commented Feb 5, 2013

@calfeld Good catch. I've never seen that before.

It is interesting that ruby's engine does the math no matter how many quantifiers follow. It accepted:

/x{2}{5}{10}{3}{2}/

and it started to match at exactly 'x' * 600.

I doubt this behavior is intentional. In the strange example above, I was able to add + quantifiers between the intervals, like:

 /x{2}+{5}+{10}+{3}+{2}+/

Depending on how many +s where added and where, sometimes it worked and sometimes it went into what appeared to be an infinite loop.

If it's intentional, it's buggy, and not documented. @calfeld, since you found it would you like to submit a ticket to ruby's redmine?

@dkubb and @mbj, that's a very interesting topic and project. I hope regexp_parser helps.

I've been meaning to give this old project some time, to merge a couple of minor fixes from the dev branch, complete some cleanups I've been working on, check for regex changes in ruby 2.0, and release a new gem version.

In the meantime, if you have questions, comments, or pull requests, I will be happy to receive them.

@calfeld-zz
Copy link
Author

Submitted (https://bugs.ruby-lang.org/issues/7787)

There are some bizarre cases, e.g., /x{2}{5}+{8}{3}/ never seems to end but change that 8 to a 7, /x{2}{5}+{7}{3}/ and it runs fine, but add a space to the end, /x{2}{5}+{7}{3} / and it takes forever.

I'll leave this issue open for the time being, as regexp_parser's handling of such quantifiers is very different than ruby's.

@ammar
Copy link
Owner

ammar commented Feb 7, 2013

@calfeld Thanks for submitting the issue. I saw the 'not a bug' response and I tend to agree with it. This behavior is not documented or defined, so people should not use or depend on it. All one could hope for, as a change, is an error, or at least a warning, that says that stacking quantifiers is not supported.

Currently the scanner parses all the quantifiers correctly, but it is the parser that only takes the very last one. Not sure what to do about that yet. Any thoughts?

@calfeld-zz
Copy link
Author

I also agree with the response.

As for the parser, my suggestion is to parse it similar to with non-capturing groups. E.g., parse /x{2}{3}/ similarly to /(?:x{2}){3}/, i.e., root->passive group {3}->literal {2}. For clarity, you could add a new class, e.g., Group::Implicit, to distinguish the two cases. Such a solution would have no changes for regexps that don't contain such weirdness. However, it does mean more nodes that do not directly correspond to tokens.

@jaynetics
Copy link
Collaborator

@ammar in light of the discussion in #63 we might want to deal with this at well. i've actually seen this in use, e.g. to validate css hex colors: /\A\h{3}{1,2}\z/.

should i make an attempt?

if so, would passive groups or some new node be more appropriate to accomodate the extra quantifiers? the latter would warrant a major version bump in my opinion, so i'm more inclined to use passive groups, and maybe add Group::Passive#implicit? to allow differentiating.

@calfeld-zz
Copy link
Author

I applaud your consideration of a 7 year old bug.

@ammar
Copy link
Owner

ammar commented Jun 7, 2020

@jaynetics Sure, if you can. Happy to support and review too.

if so, would passive groups or some new node be more appropriate to accomodate the extra quantifiers?

That would work, but I haven't thought about it much. I wonder if it makes sense for quantifier nodes to have an internal quantifier stack. It might be instructive to find out how Onigmo represents these cases in its parse tree.

@jaynetics
Copy link
Collaborator

the solution proposed by @calfeld is now live with v2.0.0 - better late than never 😆

jaynetics added a commit that referenced this issue May 1, 2022
these characters were wrongly treated as possessive or reluctant mode flags for the interval quantifier.

Ruby/Onigmo does not support these modes for intervals, so it treats them as extra, chained quantifiers instead.

c.f. #3, #69
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants