first phase of migrating to regex-automata #977

BurntSushi · 2023-04-17T18:35:26Z

This gigantic PR is best explained by some bits of the CHANGELOG:

This is a sizeable release that will be soon followed by another sizeable
release. Both of them will combined close over 40 existing issues and PRs.

This first release, despite its size, essentially represent preparatory work
for the second release, which will be even bigger. Namely, this release:

Increases the MSRV to Rust 1.60.0, which was released about 1 year ago.
Upgrades its dependency on aho-corasick to the recently release 1.0
version.
Upgrades its dependency on regex-syntax to the simultaneously released
0.7 version. The changes to regex-syntax principally revolve around a
rewrite of its literal extraction code and a number of simplifications and
optimizations to its high-level intermediate representation (HIR).

The second release, which will follow ~shortly after the release above, will
contain a soup-to-nuts rewrite of every regex engine. This will be done by
bringing regex-automata into
this repository, and then changing the regex crate to be nothing but an API
shim layer on top of regex-automata's API.

These tandem releases are the culmination of about 3
years of on-and-off work that began in earnest in March
2020.

Ref #656

This sets 'rust-version' to 1.60 and also increases the pinned Rust version that we test against in CI to 1.60.0. Rust 1.60.0 was released over a year ago and contains some important stuff. Notably, it includes namespaced and weak dependency features that are used in the (soon to be) released aho-corasick 1.0. They will also be extensively used in regex-automata 0.3, which is coming to a rust-lang/regex repository near you Real Soon Now.

Apparently in C, an empty parameter list means "the function takes an unspecified number of arguments." (lol.) But an explicit void means "the function takes zero arguments." The latter is indeed what we want here. Ref: https://softwareengineering.stackexchange.com/questions/286490/what-is-the-difference-between-function-and-functionvoid Closes #942

This is justified by the fact that a RegexSet is, after all, a set. And a set has a very obvious default value: the empty set. Plus, this is exactly what you get by passing a default `Vec` or an empty iterator to the `RegexSet::new` constructor. We specifically do not add a `Default` impl for Regex because it has no obvious default value. Fixes #905, Closes #906

There will be a new 'regex-cli' tool that will supplant this (and more).

'sc' refers to the 'Currency_Symbol' general category, but is also the abbreviation for the 'Script' property. So when going through the canonicalization process, it would get normalized to 'Script' before being checked as a general category. We fix it by special casing it. See also #719 Fixes #835, #899

This is more similar to the \p{Cf} bug than the \p{Sc} bug, but basically, 'lc' is an abbreviation for both 'Cased_Letter' and 'Lowercase_Mapping'. Since we don't support the latter (currently), we make 'lc' map to 'Cased_Letter'. If we do ever add 'Lowercase_Mapping' in the future, then we will just require users to type out its full form. Fixes #965

Previously this was only defined on 'ClassUnicode', but since 'Class' might contain a 'ClassUnicode', it should be defined here too. We don't need to update any call sites since this crate doesn't actually use 'Class::case_fold_simple' directly, and instead manipulates the underlying 'ClassUnicode' or 'ClassBytes'.

This effectively bumps the MSRV of 'regex' to Rust 1.56, which was released in Oct 2021. It's not quite a year at the time of writing, but I expect it will be a year by the time this change is released.

It turns out that all uses of 'as' in the regex-syntax crate can be replaced with either explicitly infallible routines (like 'u32::from(char)'), or with routines that will panic on failure. These panics are strictly better than truncating casts that might otherwise lead to subtle bugs in the context of this crate. (Namely, we never really care about the perf effects here, since regex parsing is just never a bottleneck.)

This method was deprecated a while ago, but we kept it around because it wasn't worth a breaking release to remove them. This also simplifies some imports.

This marks the various error types as '#[non_exhaustive]' instead of using a __Nonexhaustive variant hack. Closes #884

An empty character class is effectively a way to write something that can never match anything. The regex crate has pretty much always returned an error for such things because it was never taught how to handle "always fail" states. Partly because I just didn't think about it when initially writing the regex engines and partly because it isn't often useful. With that said, it should be supported for completeness and because there is no real reason to not support it. Moreover, it can be useful in certain contexts where regexes are generated and you want to insert an expression that can never match. It's somewhat contrived, but it happens when the interface is a regex pattern. Previously, the ban on empty character classes was implemented in the regex-syntax crate. But with the rewrite in #656 getting closer and closer to landing, it's now time to relax this restriction. However, we do keep the overall restriction in the 'regex' API by returning an error in the NFA compiler. Once #656 is done, the new regex engines will permit this case.

When Unicode mode is disabled (i.e., (?-u)), the Perl character classes (\w, \d and \s) revert to their ASCII definitions. The negated forms of these classes are also derived from their ASCII definitions, and this means that they may actually match bytes outside of ASCII and thus possibly invalid UTF-8. For this reason, when the translator is configured to only produce HIR that matches valid UTF-8, '(?-u)\W' should be rejected. Previously, it was not being rejected, which could actually lead to matches that produced offsets that split codepoints, and thus lead to panics when match offsets are used to slice a string. For example, this code fn main() { let re = regex::Regex::new(r"(?-u)\W").unwrap(); let haystack = "☃"; if let Some(m) = re.find(haystack) { println!("{:?}", &haystack[m.range()]); } } panics with byte index 1 is not a char boundary; it is inside '☃' (bytes 0..3) of `☃` That is, it reports a match at 0..1, which is technically correct, but the regex itself should have been rejected in the first place since the top-level Regex API always has UTF-8 mode enabled. Also, many of the replacement tests were using '(?-u)\W' (or similar) for some reason. I'm not sure why, so I just removed the '(?-u)' to make those tests pass. Whether Unicode is enabled or not doesn't seem to be an interesting detail for those tests. (All haystacks and replacements appear to be ASCII.) Fixes #895, Partially addresses #738

In effect, this adds support for no_std by depending on only core and alloc. There is still currently some benefit to enabling std support, namely, getting the 'std::error::Error' trait impls for the various error types. (Although, it seems like the 'Error' trait is going to get moved to 'core' finally.) Otherwise, the only 'std' things we use are in tests for tweaking stack sizes. This is the first step in an effort to make 'regex' itself work without depending on 'std'. 'regex' itself will be more precarious since it uses things like HashMap and Mutex that we'll need to find a way around. Getting around HashMap is easy (just use BTreeMap), but figuring out how to synchronize the threadpool will be interesting. Ref #476, Ref #477

I wish this feature were stable and enabled by default. I suspect that it maybe doesn't work correctly 100% of the time, but it's super useful. And manually annotating APIs is a huge pain, so it's worth at least attempting.

Get rid of those old crusty HTML links! Also, if an intradoc link is used that is bunk, fail the build.

I'm not sure exactly why I used three variants instead of two like how I've defined it in this patch. Possibly because the AST uses three variants? (The AST needs to do a little more work to store a span associated with where the name actually is in the expression, so it maybe makes a little more sense there.) In any case, this is the first step of many in simplifying the HIR.

This is apparently not used anywhere. So drop it. Also motivated by wanting to squash look-around assertions into a single enum. So 'is_negated' won't make sense on its own anymore.

Instead of having both 'HirKind::Anchor' and 'HirKind::WordBoundary', this patch flattens them into one 'hirKind::Look'. Why do this? I think they make more sense grouped together. Namely, they are all simplistic look-around assertions and they all tend to be handled with very similar logic.

This greatly simplifies how repetitions are represented in the HIR from a sprawling set of variants down to just a simple `(u32, Option<u32>)`. This is much simpler and still permits us to specialize all of the cases we did before if necessary. This also simplifies some of the HIR printer's output. e.g., 'a{1}' is just 'a'.

This fixes some corner cases in the HIR printer where it would print the concrete syntax of a regex that does not match the natural interpretation of the HIR. One such example of this is: concat(a, alt(b, c)) This would get printed as ab|c But clearly, it should be printed as: a(?:b|c) The issue here is that the printer only considers the current HirKind when determining how to print it. Sometimes a group is needed to print an alt (and even a concat, in the case of 'rep(+, concat(a, b))'), but sometimes it isn't. We could address this in a few different ways: 1) Always print concats and alts inside a non-capturing group. 2) Make the printer detect precisely the cases where a non-capturing group is needed. 3) Make the HIR smart constructors insert non-capturing groups when needed. 4) Do some other thing to change the HIR to prevent these sorts of things by construction. This patch goes with (1). The reason in favor of it is that HIR printer was always about printing an equivalent regex and never about trying to print a "nice" regex. Indeed, the HIR printer can't print a nice regex, because the HIR represents a rigorously simplifed view of a regex to make analysis easier. (The most obvious such example are Unicode character classes. For example, the HIR printer never prints '\w'.) So inserting some extra groups (which it already does) even when they aren't strictly needed is perfectly okay. But still, it's useful to say why we didn't do the other choices: 2) Modifying the printer to only print groups when they're actually needed is pretty difficult. I tried this briefly, and handling this case requires some notion of what the parent expression is. This winds up being a possible but hairy change. 3) Making the HIR more complicated to make the printer correct seems like it's optimizing for the wrong thing. Inserting extra groups in places just obfuscates HIR values that already have clear semantics. That is, use concat(a, alt(b, c)) over concat(a, group(alt(b, c))). 4) It's not clear how we would change the HIR to guarantee this sort of thing wouldn't happen. At the very least, it seems likely it would require a more complex data type. At first, I had thought (1) seemed inelegant. But the more I thought about it, the more it seemed quite consistent with how the HIR printer already worked. So that's the path I took here. Closes #516, Closes #731

No matter what 'a' is, 'a{0}' is always equivalent to an empty regex.

This gets rid of the old 'Literal' type: enum Literal { Unicode(char), Byte(u8), } and replaces it with struct Literal(Box<[u8]>); I did this primarily because I perceive the new version to be a bit simpler and is very likely to be more space efficient given some of the changes I have in mind (upcoming in subsequent commits). Namely, I want to include more analysis information beyond just simply booleans, and this means using up more space. Putting that analysis information on every single byte/char seems gratuitous. But putting it on every single sequence of byte/chars seems more justifiable. I also have a hand-wavy idea that this might make analysis a bit easier. And another hand-wavy idea that debug-printing such an HIR will make it a bit more comprehensible. Overall, this isn't a completely obvious win and I do wonder whether I'll regret this. For one thing, the translator is now a fair bit more complicated in exchange for not creating a 'Vec<u8>' for every 'ast::Literal' node. This also gives up the Unicode vs byte distinct and just commits to "all bytes." Instead, we do a UTF-8 check on every 'Hir::literal' call, and that in turn sets the UTF-8 property. This does seem a bit wasteful, and indeed, we do another UTF-8 check in the compiler (even though we could use 'unsafe' correctly and avoid it). However, once the new NFA compiler lands from regex-automata, it operates purely in byte-land and will not need to do another UTF-8 check. Moreover, a UTF-8 check, even on every literal, is likely barely measureable in the grand scheme of things. I do also worry that this is overwrought. In particular, the AST creates a node for each character. Then the HIR smooths them out to sequences of characters (that is, Vec<u8>). And then NFA compilation splits them back out into states where a state handles at most one character (or range of characters). But, I am taking somewhat of a leap-of-judgment here that this will make analysis easier and will overall use less space. But we'll see.

This makes the Debug impls for Literal and ClassRangeBytes a bit better. The former in particular. Instead of just printing a sequence of decimal numbers, we now print them as characters. Given the lackluster support for Vec<u8> as a string in the standard library, we copy a little bit of code from regex-automata to make the debug print for the Vec<u8> basically as nice as a String.

This commit completely rewrites how HIR properties are computed inductively. Firstly, 'Properties' is now boxed, so that it contributes less space to each HIR value. This does add an allocation for each HIR expression, but most HIR expressions already require at least one alloc anyway. And there should be far fewer of them now that we collapse literals together. Secondly, 'Properties' now computes far more general attributes instead of hyper-specific things. For example, instead of 'is_match_empty', we now have 'minimum_len' and 'maximum_len'. Similarly, instead of 'is_anchored_start' and 'is_anchored_end', we now compute sets of look-around assertions found anywhere, only as a prefix and only as a suffix. We also remove 'is_line_anchored_{start,end}'. There were only used in the 'grep-regex' crate and they seem unnecessary. They were otherwise fairly weird properties to compute.

Instead of using a boolean parameter, we just split them into dot_char, dot_byte, any_char, any_byte. Another path would be to use an enum, but this appeals to me a little more.

It turns out they are completely superfluous in the HIR, so we can drop them completely. We only need to explicitly represent capturing groups.

This makes it so 'a{1}' is rewritten as 'a' and '[a]' is rewritten as 'a'. A lot of the tests expected '[a]' to get preserved as a class in the HIR, so this required a bit of surgery.

This is a transitory commit that will need to be updated once aho-corasick 1.0 is actually released. Its purpose is to make it so the regex crate, the "old" regex crate and regex-automata all agree on the same version of aho-corasick to use while in development.

Now that it *only* represents a capturing group, it makes sense to give it a more specific name.

Where 'sub' is short for 'sub-expression.'

This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag, 'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also causes '.' to *not* match \r in addition to \n (unless the 's' flag is enabled of course). The intended semantics are that CRLF mode makes \r\n, \r and \n line terminators but with one key property: \r\n is treated as a single line terminator. That is, ^/$ do not match between \r and \n. This partially addresses #244 by adding syntax support. Currently, if you try to use this new flag, the regex compiler will report an error. We intend to finish support for this once #656 is complete. (Indeed, at time of writing, CRLF matching works in regex-automata.)

This updates docs in a number of places, including adding examples. We also make it so zero-width matches never impact the 'utf8' property. In practice, this means '(?-u:\B)' is now considered to match valid UTF-8, which is consistent with the fact that 'a*' is considered to match valid UTF-8 too. We also do a refresh of the 'Look' and 'LookSet' APIs.

This resolves a long-standing (but somewhat minor) complaint that folks have with the regex crate: it does not permit escaping punctuation characters in cases where those characters do not need to be escaped. So things like \/, \" and \! would result in parse errors. Most other regex engines permit these, even in cases where they aren't needed. I had been against doing this for future evolution purposes, but it's incredibly unlikely that we're ever going to add a new meta character to the syntax. I literally cannot think of any conceivable future in which that might happen. However, we do continue to ban escapes for [0-9A-Za-z<>], because it is conceivable that we might add new escape sequences for those characters. (And 0-9 are already banned by virtue of them looking too much like backreferences, which aren't supported.) For example, we could add \Q...\E literal syntax. Or \< and \> as start and end word boundaries, as found in POSIX regex engines. Fixes #501

This changes the rules for capture names to be much less restrictive. Namely, the requirements are now: 1. Must begin with an `_` or any alphabetic codepoint. 2. After the first codepoint, the name may contain any sequence of alpha-numeric codepoints along with `_`, `.`, `[` and `]`. Closes #595

This adds a new routine for computing the static number of capture groups that will appear in every match. If the number of groups is not invariant across all matches, then there is no static capture length. This is meant to help implement higher level convenience APIs for extracting capture groups, such as the one described in #824. We may wind up including such APIs in the regex crate itself, but this commit stops short of that. Instead, we just add this new property which should permit those APIs to exist outside of this crate for now. Closes #908

And do the same for 'static_captures_len'. The motivation for this is that the top-level Regex API had equivalently named methods 'captures_len' and 'static_captures_len', except those included the implicit group and were therefore always 1 more than the same APIs on Hir. We distinguish them by renaming the routines on HIR.

It turns out that it's not too hard to get HIR translation to run pretty slowly with some carefully crafted regexes. For example: (?i:[[:^space:]------------------------------------------------------------------------]) This regex is actually a [:^space:] class that has an empty class subtracted from it 36 times. For each subtraction, the resulting class--despite it not having changed---goes through Unicode case folding again. This in turn slows things way down. We introduce a fairly basic optimization that basically keeps track of whether an interval set has been folded or not. The idea was taken from PR #893, but was tweaked slightly. The magic of how it works is that if two interval sets have already been folded, then they retain that property after any of the set operations: negation, union, difference, intersection and symmetric difference. So case folding should generally only need to be run once for each "base" class, but then not again as operations are performed. Some benchmarks were added to rebar (which isn't public yet at time of writing). Closes #893

I'm overall coming around to the opinion that these tend to make the code harder to read. So I've been steadily dropping the Result aliases.

This rewrites how Unicode simple case folding worked. Instead of just defining a single function and expecting callers to deal with the fallout, we know define a stateful type that "knows" about the structure of the case folding table. For example, it now knows enough to avoid binary search lookups in most cases. All we really have to do is require that callers lookup codepoints in sequence, which is perfectly fine for our use case. Ref #893

Previously, classes would show up in the debug representation as very deeply nested things, making them more difficult to read than they need to be. This removes at least a few pretty redundant layers and uses a more compact range notation.

The contract of this function says that any invalid group offset should result in a return value of None. In general, it worked fine, unless the offset was so big that some internal multiplication overflowed. That could in turn produce an incorrect result or a panic. So we fix that here with checked arithmetic. Fixes #738, Fixes #950

This makes it clearer that the regex engine works by *logically* treating a haystack as a sequence of codepoints. Or more specifically, Unicode scalar values. Fixes #854

The existing docs were pretty paltry, and it turns out we can be a bit more helpful for folks when they hit this error. Fixes #846

Adding these methods has almost no cost and they can be convenient to have in some cases. Closes #810

The name is somewhat unfortunate, but it's actually kind of difficult to capture the right semantics in the name. The key bit is that the function returns the offset at the point at which a match is known, and that point might vary depending on which internal regex engine was used. Fixes #747

This clarifies that `x` is "verbose mode," and that whitespace becomes insignificant everywhere, including in character classes. We also add guidance for how to insert a space: either escape it or use a hex literal. Fixes #660

It is really unfortunate, but SetMatches::len and SetMatcher::iter().count() do not correspond go the same thing. It's not clear why I even added the SetMatches::len method in the first place, but it always returns the number of regexes in the set, and not the number of regexes that matched. We can't change the name (or remove the method) obviously, but we do add a warning to the docs. Fixes #625

And we make it an interesting example, i.e., one that demonstrates preference order semantics. Closes #610

This isn't *strictly* needed because of the existence of Regex::captures_read_at, but it does fill out the singular missing method. Namely, all other search routines have an *_at variant, so we might as well add it for Regex::captures too. Closes #547

This makes it so the Debug impl for Match only shows the actual matched text. Otherwise, the Match shows the entire haystack, which is likely to be misleading. Fixes #514

This is useful when doing structural recursion on a '&Hir' to produce a new 'Hir' derived from it.

Since it uses heap memory and because it's something you typically hang on to in a regex engine, we expose a routine for computing heap memory. We might consider doing this for other types in regex-syntax, but there hasn't been a strong need for it yet.

The wording appears to be a little unclear, so we switch it around a bit. Fixes #975

This will need to be updated again to add a date (maybe today?), but this should cover everything from the commit log.

BurntSushi and others added 3 commits April 17, 2023 12:22

BurntSushi mentioned this pull request Apr 17, 2023

plan: moving regex engines to regex-automata #656

Closed

BurntSushi and others added 9 commits April 17, 2023 14:37

regex-debug: this removes regex-debug

544374b

There will be a new 'regex-cli' tool that will supplant this (and more).

syntax: switch to Rust 2021

f59ebfa

This effectively bumps the MSRV of 'regex' to Rust 1.56, which was released in Oct 2021. It's not quite a year at the time of writing, but I expect it will be a year by the time this change is released.

syntax: remove 'std::error::Error::description' impls

b147fe3

This method was deprecated a while ago, but we kept it around because it wasn't worth a breaking release to remove them. This also simplifies some imports.

syntax: remove '__Nonexhaustive' hack, use #[non_exhaustive]

06df9ac

This marks the various error types as '#[non_exhaustive]' instead of using a __Nonexhaustive variant hack. Closes #884

BurntSushi force-pushed the ag/prepare-for-regex-automata branch from 3132a91 to a46751b Compare April 17, 2023 18:37

BurntSushi added 16 commits April 17, 2023 14:49

syntax: enable 'doc_auto_cfg'

5d9746d

I wish this feature were stable and enabled by default. I suspect that it maybe doesn't work correctly 100% of the time, but it's super useful. And manually annotating APIs is a huge pain, so it's worth at least attempting.

syntax: switch to rustdoc intra links

7bd2d9a

Get rid of those old crusty HTML links! Also, if an intradoc link is used that is bunk, fail the build.

syntax: remove WordBoundary::is_negated method

00ea571

This is apparently not used anywhere. So drop it. Also motivated by wanting to squash look-around assertions into a single enum. So 'is_negated' won't make sense on its own anymore.

syntax: 'a{0}' should compile to Hir::empty

62802fa

No matter what 'a' is, 'a{0}' is always equivalent to an empty regex.

syntax: rejigger Hir::{dot,any}

2c119ea

Instead of using a boolean parameter, we just split them into dot_char, dot_byte, any_char, any_byte. Another path would be to use an enum, but this appeals to me a little more.

syntax: remove non-capturing groups from HIR

3a8313e

It turns out they are completely superfluous in the HIR, so we can drop them completely. We only need to explicitly represent capturing groups.

syntax: small HIR simplifications

05cf861

This makes it so 'a{1}' is rewritten as 'a' and '[a]' is rewritten as 'a'. A lot of the tests expected '[a]' to get preserved as a class in the HIR, so this required a bit of surgery.

BurntSushi added 27 commits April 17, 2023 14:49

syntax: rename 'Group' to 'Capture'

99d8436

Now that it *only* represents a capturing group, it makes sense to give it a more specific name.

syntax: rename 'hir' to 'sub'

6cb02d9

Where 'sub' is short for 'sub-expression.'

syntax: drop some Result type aliases

0dd5853

I'm overall coming around to the opinion that these tend to make the code harder to read. So I've been steadily dropping the Result aliases.

syntax: improve Debug impl for Class

3c49615

Previously, classes would show up in the debug representation as very deeply nested things, making them more difficult to read than they need to be. This removes at least a few pretty redundant layers and uses a more compact range notation.

doc: add wording about Unicode scalar values

e65ba17

This makes it clearer that the regex engine works by *logically* treating a haystack as a sequence of codepoints. Or more specifically, Unicode scalar values. Fixes #854

doc: add more explanation to 'CompiledTooBig' error

d04ea10

The existing docs were pretty paltry, and it turns out we can be a bit more helpful for folks when they hit this error. Fixes #846

api: add Match::{is_empty, len}

07c453d

Adding these methods has almost no cost and they can be convenient to have in some cases. Closes #810

doc: clarify verbose mode

061dd68

This clarifies that `x` is "verbose mode," and that whitespace becomes insignificant everywhere, including in character classes. We also add guidance for how to insert a space: either escape it or use a hex literal. Fixes #660

doc: add example that uses an alternation

f3de42b

And we make it an interesting example, i.e., one that demonstrates preference order semantics. Closes #610

api: add Regex::captures_at

cdf6325

This isn't *strictly* needed because of the existence of Regex::captures_read_at, but it does fill out the singular missing method. Namely, all other search routines have an *_at variant, so we might as well add it for Regex::captures too. Closes #547

api: improve Debug impl for Match

3988431

This makes it so the Debug impl for Match only shows the actual matched text. Otherwise, the Match shows the entire haystack, which is likely to be misleading. Fixes #514

syntax: add 'Repetition::with'

3f4bfa6

This is useful when doing structural recursion on a '&Hir' to produce a new 'Hir' derived from it.

doc: tweak presentation of \pN syntax

a34c1a7

The wording appears to be a little unclear, so we switch it around a bit. Fixes #975

changelog: add entry for regex 1.8

82b0f0d

This will need to be updated again to add a date (maybe today?), but this should cover everything from the commit log.

BurntSushi force-pushed the ag/prepare-for-regex-automata branch from a46751b to 82b0f0d Compare April 17, 2023 18:52

BurntSushi merged commit 8161ac8 into master Apr 17, 2023

BurntSushi deleted the ag/prepare-for-regex-automata branch April 17, 2023 19:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

first phase of migrating to regex-automata #977

first phase of migrating to regex-automata #977

BurntSushi commented Apr 17, 2023

first phase of migrating to regex-automata #977

first phase of migrating to regex-automata #977

Conversation

BurntSushi commented Apr 17, 2023