Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop instantiating character ranges #99

Merged
merged 53 commits into from
Aug 6, 2023
Merged

Stop instantiating character ranges #99

merged 53 commits into from
Aug 6, 2023

Conversation

qntm
Copy link
Owner

@qntm qntm commented Aug 6, 2023

Part of #81.

  • Fsm no longer tolerates missing states or transitions - every state, including an "oblivion" state if desired, must be explicitly part of the set of states provided, and every transition must also be provided. It no longer raises OblivionErrors or tolerates ANYTHING_ELSE. This was done to make the rest of this logic less agonisingly painful.
  • Fsm now requires that states always be integers and that its alphabet consist of Charclasses - it no longer tolerates strings or other values as symbols. This had some ramifications for testing. Note that this rules out ANYTHING_ELSE as a possible symbol - ANYTHING_ELSE has been removed entirely. Fsm is no longer intended to serve any generic finite state machine functionality and is instead specifically dedicated to handling strings for regular expressions.
  • To support this, the dependency between Fsm and Charclass has been inverted. Previously, Charclass had a to_fsm method. Now, Fsm has a from_charclass static function.
  • Note that the methods Fsm.derive, Fsm.accepts and Fsm.strings still accept/return strings (Python values of type str), not sequences of Charclasses.
  • Fsm now additionally requires that its alphabet of Charclasses fully partition the space of all possible Unicode characters. This means that instead of ANYTHING_ELSE, it requires some kind of negated Charclass. A sample alphabet is (Charclass("a"), Charclass("b"), ~Charclass("ab")).
  • Because every Fsm essentially has the same "alphabet", we no longer need to gather or unify the set of all in-use characters from regular expression elements when constructing those Fsms. All of those alphabet() methods are now gone.
  • This also means constructors epsilon(alphabet) and null(alphabet) can simply become constants EPSILON and NULL.
  • Replacing this, we have a sophisticated new Charclass function, repartition, for rewriting those alphabets of Charclasses, and Fsm has a new method replace_alphabet - this is used during manipulations in order to unify alphabets among disparate Fsms and make it possible to combine them with relative ease.

All of the above makes it possible for an Fsm over a relatively large collection of characters to do so by making use of only a relatively small collection of individual Charclass symbols. For example, [\t\n\r -\uD7FF\uE000-\uFFFD\U00010000-\U0010FFFF] is now a single symbol, and an Fsm making use of it will have a single transition for that symbol.

So far so good. All of that paves the way for the next part:

  • Charclass no longer stores a collection of characters. Instead it stores a list of "ord ranges", which are inclusive ranges of Unicode character numbers. Charclass("abce"), for example, stores ((97, 99), (101, 101)). Some sophisticated, moderately efficient methods negate and add_ord_range have been added to make it possible to sanely manage large collections of these ranges, maintaining their sequence, merging or separating them when appropriate.
  • Due to this, stringification of Charclass is much simplified.
  • Parsing of Charclasses is also modified slightly.
  • We no longer directly reference charclass.chars or charclass.ord_ranges - instead we use new helper methods get_chars, num_chars and accepts to determine what the Charclass has inside of it.
  • Yes, negations inside of negations like [1a\\D] do still work.

All of this in turn means that a Charclass like [\t\n\r -\uD7FF\uE000-\uFFFD\U00010000-\U0010FFFF] no longer instantiates a chars collection with over 1,000,000 individual characters in it. Instead it maintains just a few ranges internally, and it can be combined with other Charclasses relatively efficiently.

This was a total nightmare taking multiple solid days of work. I decided there was no way to do this piecemeal, it had to be done all in one shot. I'm likely to spend a little while longer looking over this code to see if it can be improved, and I expect folks might want to lint it a little. There may be lingering performance hangups for these nasty cases, but I tackled all the obvious stuff.

The public API of greenery is unchanged. This is essentially a performance uplift.

@qntm qntm self-assigned this Aug 6, 2023
@qntm qntm merged commit a1d49d6 into main Aug 6, 2023
@qntm qntm deleted the cc3 branch August 6, 2023 23:15
@qntm
Copy link
Owner Author

qntm commented Aug 6, 2023

Future work: it might be possible to simply eliminate the concept of a "negated character class". Instead of ~Charclass("a"), which internally stores ord_ranges of ((97, 97),) and a self.negated flag, we could internally store ord_ranges of ((0, 96), (98, 1114111)) and scrap the flag. This could potentially simplify a great deal of logic.

@qntm qntm mentioned this pull request Jan 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant