-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop instantiating character ranges #99
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Future work: it might be possible to simply eliminate the concept of a "negated character class". Instead of |
Closed
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Part of #81.
Fsm
no longer tolerates missing states or transitions - every state, including an "oblivion" state if desired, must be explicitly part of the set of states provided, and every transition must also be provided. It no longer raisesOblivionError
s or toleratesANYTHING_ELSE
. This was done to make the rest of this logic less agonisingly painful.Fsm
now requires that states always be integers and that its alphabet consist ofCharclass
es - it no longer tolerates strings or other values as symbols. This had some ramifications for testing. Note that this rules outANYTHING_ELSE
as a possible symbol -ANYTHING_ELSE
has been removed entirely.Fsm
is no longer intended to serve any generic finite state machine functionality and is instead specifically dedicated to handling strings for regular expressions.Fsm
andCharclass
has been inverted. Previously,Charclass
had ato_fsm
method. Now,Fsm
has afrom_charclass
static function.Fsm.derive
,Fsm.accepts
andFsm.strings
still accept/return strings (Python values of typestr
), not sequences ofCharclass
es.Fsm
now additionally requires that its alphabet ofCharclass
es fully partition the space of all possible Unicode characters. This means that instead ofANYTHING_ELSE
, it requires some kind of negatedCharclass
. A sample alphabet is(Charclass("a"), Charclass("b"), ~Charclass("ab"))
.Fsm
essentially has the same "alphabet", we no longer need to gather or unify the set of all in-use characters from regular expression elements when constructing thoseFsm
s. All of thosealphabet()
methods are now gone.epsilon(alphabet)
andnull(alphabet)
can simply become constantsEPSILON
andNULL
.Charclass
function,repartition
, for rewriting those alphabets ofCharclass
es, andFsm
has a new methodreplace_alphabet
- this is used during manipulations in order to unify alphabets among disparateFsm
s and make it possible to combine them with relative ease.All of the above makes it possible for an
Fsm
over a relatively large collection of characters to do so by making use of only a relatively small collection of individualCharclass
symbols. For example,[\t\n\r -\uD7FF\uE000-\uFFFD\U00010000-\U0010FFFF]
is now a single symbol, and anFsm
making use of it will have a single transition for that symbol.So far so good. All of that paves the way for the next part:
Charclass
no longer stores a collection of characters. Instead it stores a list of "ord ranges", which are inclusive ranges of Unicode character numbers.Charclass("abce")
, for example, stores((97, 99), (101, 101))
. Some sophisticated, moderately efficient methodsnegate
andadd_ord_range
have been added to make it possible to sanely manage large collections of these ranges, maintaining their sequence, merging or separating them when appropriate.Charclass
is much simplified.Charclass
es is also modified slightly.charclass.chars
orcharclass.ord_ranges
- instead we use new helper methodsget_chars
,num_chars
andaccepts
to determine what theCharclass
has inside of it.[1a\\D]
do still work.All of this in turn means that a
Charclass
like[\t\n\r -\uD7FF\uE000-\uFFFD\U00010000-\U0010FFFF]
no longer instantiates achars
collection with over 1,000,000 individual characters in it. Instead it maintains just a few ranges internally, and it can be combined with otherCharclass
es relatively efficiently.This was a total nightmare taking multiple solid days of work. I decided there was no way to do this piecemeal, it had to be done all in one shot. I'm likely to spend a little while longer looking over this code to see if it can be improved, and I expect folks might want to lint it a little. There may be lingering performance hangups for these nasty cases, but I tackled all the obvious stuff.
The public API of
greenery
is unchanged. This is essentially a performance uplift.