Overhaul of Set#members needed #47

jaynetics · 2017-11-18T11:36:05Z

Right now, handling the content of character sets with regexp_parser is hard:

The Scanner only detects few ranges successfully, as detailed in issue Only alphanumeric character set ranges are detected as ranges #29.
The Scanner returns inconclusive information about member tokens because they all have the type :set. Issue Inconsistent scanning of properties within sets #28 describes this for properties, but it also affects \a, \e, \n, \t, \u, \v and more.
The Parser then "throws away" even this limited information as it only relays the Token#text to Set#members. (Re-running Parser#parse on individual Set#members is a poor workaround for this.)

What I have in mind as a general solution is the following:

removing the :subset token type, leaving #set_level to differentiate between sets and subsets
using the :set token type only for tokens that are particular to sets ([, ^, &&, ] and ranges)
removing the :member, :member_hex, :range and :range_hex tokens
treating set ranges and members like any other sub-expression instead
removing the attr Set#members, leaving #expressions to access members, ranges and subsets

Thus, parsing /a[bc-d]/ could yield something like

#<Root @expressions=[
  #<Literal @type=:literal, @token=:literal, @text="a" >,
  #<CharacterSet @expressions=[
    #<Literal @type=:literal, @token=:literal, @text="b" >,
    #<Range @type=:set, @token=:range, @expressions=[
      #<Literal @type=:literal, @token=:literal, @text="c" >,
      #<Literal @type=:literal, @token=:literal, @text="d" >
    ]>
  ]>
]>

The only tricky bit is rewiring the ragel machines in the right way and catching all ranges.
On the other hand, it would probably lead to less code, as special treatment is only needed for a few things within sets: the set tokens plus ., \b, and [:...:] if I am not mistaken.

What do you think, @ammar?

The text was updated successfully, but these errors were encountered:

ammar · 2017-11-19T13:57:02Z

Thanks for this @Janosch-x.

Characters sets are the poorest implemented part of the scanner. Barely satisfied the most basic of cases.

I like the approach you outlined. Removing :subset and treating :set members as sub-expressions is a great idea. It should simplify the scanner and its use.

I wonder if extracting the character set scanning logic into a sub-machine, like properties.rl, would make it easier to implement the changes. it might require breaking scanner.rl into smaller parts, which could be a good thing. I'd like to explore that a little, soon hopefully.

jaynetics · 2017-11-23T20:34:51Z

I wonder if extracting the character set scanning logic into a sub-machine, like properties.rl, would make it easier to implement the changes.

Thats what I thought, too. I might be able to help a bit with the whole thing, at least by setting up tests beforehand.

Another thing that just crossed my mind is that it could make sense to have an Intersection expression with subexpressions as well. Right now, you have to keep track of the preceding and succeeding set member yourself to find out what is being intersected. We could do something like this instead:

Regexp::Parser.parse(/[a-c&&b]/) # =>
#<Root @expressions=[
  #<CharacterSet @expressions=[
    #<Intersection @type=:set, @token=:intersection, @expressions=[
      #<Range @type=:set, @token=:range, @expressions=[
        #<Literal @type=:literal, @token=:literal, @text="a" >,
        #<Literal @type=:literal, @token=:literal, @text="c" >
      ]>,
      #<Literal @type=:literal, @token=:literal, @text="b" >
    ]>
  ]>
]>

The least complicated way to achieve this might be doing it purely in parser.rb, kind of the same way alternation expressions are handled.

jaynetics · 2018-09-07T18:45:19Z

resolved by #55

jaynetics mentioned this issue Apr 30, 2018

Improve set handling #55

Merged

jaynetics closed this as completed Sep 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overhaul of Set#members needed #47

Overhaul of Set#members needed #47

jaynetics commented Nov 18, 2017

ammar commented Nov 19, 2017

jaynetics commented Nov 23, 2017

jaynetics commented Sep 7, 2018

Overhaul of Set#members needed #47

Overhaul of Set#members needed #47

Comments

jaynetics commented Nov 18, 2017

ammar commented Nov 19, 2017

jaynetics commented Nov 23, 2017

jaynetics commented Sep 7, 2018