-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overhaul of Set#members needed #47
Comments
Thanks for this @Janosch-x. Characters sets are the poorest implemented part of the scanner. Barely satisfied the most basic of cases. I like the approach you outlined. Removing I wonder if extracting the character set scanning logic into a sub-machine, like |
Thats what I thought, too. I might be able to help a bit with the whole thing, at least by setting up tests beforehand. Another thing that just crossed my mind is that it could make sense to have an
The least complicated way to achieve this might be doing it purely in |
resolved by #55 |
Right now, handling the content of character sets with
regexp_parser
is hard:Scanner
only detects few ranges successfully, as detailed in issue Only alphanumeric character set ranges are detected as ranges #29.Scanner
returns inconclusive information about member tokens because they all have the type:set
. Issue Inconsistent scanning of properties within sets #28 describes this for properties, but it also affects \a, \e, \n, \t, \u, \v and more.Parser
then "throws away" even this limited information as it only relays theToken#text
toSet#members
. (Re-runningParser#parse
on individualSet#members
is a poor workaround for this.)What I have in mind as a general solution is the following:
:subset
token type, leaving#set_level
to differentiate between sets and subsets:set
token type only for tokens that are particular to sets ([
,^
,&&
,]
and ranges):member
,:member_hex
,:range
and:range_hex
tokensSet#members
, leaving#expressions
to access members, ranges and subsetsThus, parsing
/a[bc-d]/
could yield something likeThe only tricky bit is rewiring the ragel machines in the right way and catching all ranges.
On the other hand, it would probably lead to less code, as special treatment is only needed for a few things within sets: the set tokens plus
.
,\b
, and[:...:]
if I am not mistaken.What do you think, @ammar?
The text was updated successfully, but these errors were encountered: