Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul of Set#members needed #47

Closed
jaynetics opened this issue Nov 18, 2017 · 3 comments
Closed

Overhaul of Set#members needed #47

jaynetics opened this issue Nov 18, 2017 · 3 comments

Comments

@jaynetics
Copy link
Collaborator

Right now, handling the content of character sets with regexp_parser is hard:

What I have in mind as a general solution is the following:

  • removing the :subset token type, leaving #set_level to differentiate between sets and subsets
  • using the :set token type only for tokens that are particular to sets ([, ^, &&, ] and ranges)
  • removing the :member, :member_hex, :range and :range_hex tokens
  • treating set ranges and members like any other sub-expression instead
  • removing the attr Set#members, leaving #expressions to access members, ranges and subsets

Thus, parsing /a[bc-d]/ could yield something like

#<Root @expressions=[
  #<Literal @type=:literal, @token=:literal, @text="a" >,
  #<CharacterSet @expressions=[
    #<Literal @type=:literal, @token=:literal, @text="b" >,
    #<Range @type=:set, @token=:range, @expressions=[
      #<Literal @type=:literal, @token=:literal, @text="c" >,
      #<Literal @type=:literal, @token=:literal, @text="d" >
    ]>
  ]>
]>

The only tricky bit is rewiring the ragel machines in the right way and catching all ranges.
On the other hand, it would probably lead to less code, as special treatment is only needed for a few things within sets: the set tokens plus ., \b, and [:...:] if I am not mistaken.

What do you think, @ammar?

@ammar
Copy link
Owner

ammar commented Nov 19, 2017

Thanks for this @Janosch-x.

Characters sets are the poorest implemented part of the scanner. Barely satisfied the most basic of cases.

I like the approach you outlined. Removing :subset and treating :set members as sub-expressions is a great idea. It should simplify the scanner and its use.

I wonder if extracting the character set scanning logic into a sub-machine, like properties.rl, would make it easier to implement the changes. it might require breaking scanner.rl into smaller parts, which could be a good thing. I'd like to explore that a little, soon hopefully.

@jaynetics
Copy link
Collaborator Author

I wonder if extracting the character set scanning logic into a sub-machine, like properties.rl, would make it easier to implement the changes.

Thats what I thought, too. I might be able to help a bit with the whole thing, at least by setting up tests beforehand.

Another thing that just crossed my mind is that it could make sense to have an Intersection expression with subexpressions as well. Right now, you have to keep track of the preceding and succeeding set member yourself to find out what is being intersected. We could do something like this instead:

Regexp::Parser.parse(/[a-c&&b]/) # =>
#<Root @expressions=[
  #<CharacterSet @expressions=[
    #<Intersection @type=:set, @token=:intersection, @expressions=[
      #<Range @type=:set, @token=:range, @expressions=[
        #<Literal @type=:literal, @token=:literal, @text="a" >,
        #<Literal @type=:literal, @token=:literal, @text="c" >
      ]>,
      #<Literal @type=:literal, @token=:literal, @text="b" >
    ]>
  ]>
]>

The least complicated way to achieve this might be doing it purely in parser.rb, kind of the same way alternation expressions are handled.

@jaynetics
Copy link
Collaborator Author

resolved by #55

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants