Stop instantiating character ranges #99

qntm · 2023-08-06T23:00:05Z

Part of #81.

Fsm no longer tolerates missing states or transitions - every state, including an "oblivion" state if desired, must be explicitly part of the set of states provided, and every transition must also be provided. It no longer raises OblivionErrors or tolerates ANYTHING_ELSE. This was done to make the rest of this logic less agonisingly painful.
Fsm now requires that states always be integers and that its alphabet consist of Charclasses - it no longer tolerates strings or other values as symbols. This had some ramifications for testing. Note that this rules out ANYTHING_ELSE as a possible symbol - ANYTHING_ELSE has been removed entirely. Fsm is no longer intended to serve any generic finite state machine functionality and is instead specifically dedicated to handling strings for regular expressions.
To support this, the dependency between Fsm and Charclass has been inverted. Previously, Charclass had a to_fsm method. Now, Fsm has a from_charclass static function.
Note that the methods Fsm.derive, Fsm.accepts and Fsm.strings still accept/return strings (Python values of type str), not sequences of Charclasses.
Fsm now additionally requires that its alphabet of Charclasses fully partition the space of all possible Unicode characters. This means that instead of ANYTHING_ELSE, it requires some kind of negated Charclass. A sample alphabet is (Charclass("a"), Charclass("b"), ~Charclass("ab")).
Because every Fsm essentially has the same "alphabet", we no longer need to gather or unify the set of all in-use characters from regular expression elements when constructing those Fsms. All of those alphabet() methods are now gone.
This also means constructors epsilon(alphabet) and null(alphabet) can simply become constants EPSILON and NULL.
Replacing this, we have a sophisticated new Charclass function, repartition, for rewriting those alphabets of Charclasses, and Fsm has a new method replace_alphabet - this is used during manipulations in order to unify alphabets among disparate Fsms and make it possible to combine them with relative ease.

All of the above makes it possible for an Fsm over a relatively large collection of characters to do so by making use of only a relatively small collection of individual Charclass symbols. For example, [\t\n\r -\uD7FF\uE000-\uFFFD\U00010000-\U0010FFFF] is now a single symbol, and an Fsm making use of it will have a single transition for that symbol.

So far so good. All of that paves the way for the next part:

Charclass no longer stores a collection of characters. Instead it stores a list of "ord ranges", which are inclusive ranges of Unicode character numbers. Charclass("abce"), for example, stores ((97, 99), (101, 101)). Some sophisticated, moderately efficient methods negate and add_ord_range have been added to make it possible to sanely manage large collections of these ranges, maintaining their sequence, merging or separating them when appropriate.
Due to this, stringification of Charclass is much simplified.
Parsing of Charclasses is also modified slightly.
We no longer directly reference charclass.chars or charclass.ord_ranges - instead we use new helper methods get_chars, num_chars and accepts to determine what the Charclass has inside of it.
Yes, negations inside of negations like [1a\\D] do still work.

All of this in turn means that a Charclass like [\t\n\r -\uD7FF\uE000-\uFFFD\U00010000-\U0010FFFF] no longer instantiates a chars collection with over 1,000,000 individual characters in it. Instead it maintains just a few ranges internally, and it can be combined with other Charclasses relatively efficiently.

This was a total nightmare taking multiple solid days of work. I decided there was no way to do this piecemeal, it had to be done all in one shot. I'm likely to spend a little while longer looking over this code to see if it can be improved, and I expect folks might want to lint it a little. There may be lingering performance hangups for these nasty cases, but I tackled all the obvious stuff.

The public API of greenery is unchanged. This is essentially a performance uplift.

qntm · 2023-08-06T23:26:54Z

Future work: it might be possible to simply eliminate the concept of a "negated character class". Instead of ~Charclass("a"), which internally stores ord_ranges of ((97, 97),) and a self.negated flag, we could internally store ord_ranges of ((0, 96), (98, 1114111)) and scrap the flag. This could potentially simplify a great deal of logic.

qntm added 30 commits July 31, 2023 21:33

Stop allowing missing states in the map

cbf883c

No more oblivion states

b6c284d

Eliminate some complex logic for handling missing transitions

e9785dd

Introduce a function for character class repartitioning

f42949f

A few more

b17b948

Merges?

b453496

Merges.

b210e8e

Always require ANYTHING_ELSE in the alphabet

d03ca08

Eliminate logic accounting for ANYTHING_ELSE being missing

f4c0cc8

Disallow a None state

5f0e1c0

Flip some logic

c16995f

Allow an Fsm's alphabet to be replaced

cc96197

Split out ANYTHING_ELSE, move Charclass Fsm constructor

538d8da

Move tests over

a29abbb

Fsms now use Charclasses internally instead of single characters

8350680

Introducing combine_alphabets, the worst code ever

03d2e79

This is actually working, countdown to hitting a brick wall

b4a7d6d

This is mad but we no longer use ANYTHING_ELSE inside Fsm

627bdbd

Stop allowing ANYTHING_ELSE in Fsm constructions

fa8a976

ANYTHING_ELSE is dead

68d6406

Some simplifications

77c9e0e

All symbols must now be Charclasses

fcf2a91

Some simplifications

146c127

Rename get_chars back to alphabet

564a6e3

Fix Charclass sorting

269a587

Only allow integer states

87cba49

Formatting

1df2cf5

Fix all type errors

2a9babf

Placate the linter

1459cc0

Final

2de274f

qntm added 18 commits August 5, 2023 22:17

Stop requiring an alphabet for null and epsilon

618b8fb

Make those constants

fc1203b

Apply some simplifications

3021065

Allow larger Charclasses and fix tests

33712e8

Guess we don't need this!

ec18773

Charclasses now use single-character ranges internally

c5b901b

Alter the Charclass constructor to require ranges

0c6e513

Private API for Charclass

c6fef8a

Halfway there

3e68d9c

2/3rds of the way there

2d79228

Nightmare code

8004f2f

THAT WAS HAAARD

b6598db

It works.

07d5001

Simplify tests again

e3cc855

Well it's working...

466b5a6

Fix some performance issues

566c5ee

A final performance thing?

09222f9

Penultimateness

2d4274b

qntm self-assigned this Aug 6, 2023

qntm added 5 commits August 7, 2023 00:00

Lint

b2b8e6c

Lint 2

3a72904

Lint 3

7ff30be

Lint 4 (?)

a2ffff8

Lint 5

ef14e4e

qntm merged commit a1d49d6 into main Aug 6, 2023

qntm deleted the cc3 branch August 6, 2023 23:15

qntm mentioned this pull request Aug 6, 2023

fsm: Adding "CharClass" to FSM #93

Closed

qntm mentioned this pull request Jan 8, 2024

Make FSM public #105

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop instantiating character ranges #99

Stop instantiating character ranges #99

qntm commented Aug 6, 2023 •

edited

Loading

qntm commented Aug 6, 2023

Stop instantiating character ranges #99

Stop instantiating character ranges #99

Conversation

qntm commented Aug 6, 2023 • edited Loading

qntm commented Aug 6, 2023

qntm commented Aug 6, 2023 •

edited

Loading