Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

first phase of migrating to regex-automata #977

Merged
merged 79 commits into from
Apr 17, 2023

Conversation

BurntSushi
Copy link
Member

This gigantic PR is best explained by some bits of the CHANGELOG:


This is a sizeable release that will be soon followed by another sizeable
release. Both of them will combined close over 40 existing issues and PRs.

This first release, despite its size, essentially represent preparatory work
for the second release, which will be even bigger. Namely, this release:

  • Increases the MSRV to Rust 1.60.0, which was released about 1 year ago.
  • Upgrades its dependency on aho-corasick to the recently release 1.0
    version.
  • Upgrades its dependency on regex-syntax to the simultaneously released
    0.7 version. The changes to regex-syntax principally revolve around a
    rewrite of its literal extraction code and a number of simplifications and
    optimizations to its high-level intermediate representation (HIR).

The second release, which will follow ~shortly after the release above, will
contain a soup-to-nuts rewrite of every regex engine. This will be done by
bringing regex-automata into
this repository, and then changing the regex crate to be nothing but an API
shim layer on top of regex-automata's API.

These tandem releases are the culmination of about 3
years of on-and-off work that began in earnest in March
2020
.


Ref #656

BurntSushi and others added 3 commits April 17, 2023 12:22
This sets 'rust-version' to 1.60 and also increases the pinned Rust
version that we test against in CI to 1.60.0.

Rust 1.60.0 was released over a year ago and contains some important
stuff. Notably, it includes namespaced and weak dependency features that
are used in the (soon to be) released aho-corasick 1.0. They will also
be extensively used in regex-automata 0.3, which is coming to a
rust-lang/regex repository near you Real Soon Now.
Apparently in C, an empty parameter list means "the function takes an
unspecified number of arguments." (lol.) But an explicit void means
"the function takes zero arguments." The latter is indeed what we want
here.

Ref: https://softwareengineering.stackexchange.com/questions/286490/what-is-the-difference-between-function-and-functionvoid

Closes #942
This is justified by the fact that a RegexSet is, after all, a set. And
a set has a very obvious default value: the empty set. Plus, this is
exactly what you get by passing a default `Vec` or an empty iterator to
the `RegexSet::new` constructor.

We specifically do not add a `Default` impl for Regex because it has no
obvious default value.

Fixes #905, Closes #906
BurntSushi and others added 9 commits April 17, 2023 14:37
There will be a new 'regex-cli' tool that will supplant this (and more).
'sc' refers to the 'Currency_Symbol' general category, but is also
the abbreviation for the 'Script' property. So when going through the
canonicalization process, it would get normalized to 'Script' before
being checked as a general category. We fix it by special casing it.

See also #719

Fixes #835, #899
This is more similar to the \p{Cf} bug than the \p{Sc} bug, but
basically, 'lc' is an abbreviation for both 'Cased_Letter' and
'Lowercase_Mapping'. Since we don't support the latter (currently), we
make 'lc' map to 'Cased_Letter'.

If we do ever add 'Lowercase_Mapping' in the future, then we will just
require users to type out its full form.

Fixes #965
Previously this was only defined on 'ClassUnicode', but since 'Class'
might contain a 'ClassUnicode', it should be defined here too.

We don't need to update any call sites since this crate doesn't
actually use 'Class::case_fold_simple' directly, and instead
manipulates the underlying 'ClassUnicode' or 'ClassBytes'.
This effectively bumps the MSRV of 'regex' to Rust 1.56, which was
released in Oct 2021. It's not quite a year at the time of writing, but
I expect it will be a year by the time this change is released.
It turns out that all uses of 'as' in the regex-syntax crate can be
replaced with either explicitly infallible routines (like
'u32::from(char)'), or with routines that will panic on failure. These
panics are strictly better than truncating casts that might otherwise
lead to subtle bugs in the context of this crate. (Namely, we never
really care about the perf effects here, since regex parsing is just
never a bottleneck.)
This method was deprecated a while ago, but we kept it around because it
wasn't worth a breaking release to remove them.

This also simplifies some imports.
This marks the various error types as '#[non_exhaustive]' instead of
using a __Nonexhaustive variant hack.

Closes #884
An empty character class is effectively a way to write something that
can never match anything. The regex crate has pretty much always
returned an error for such things because it was never taught how to
handle "always fail" states. Partly because I just didn't think about it
when initially writing the regex engines and partly because it isn't
often useful.

With that said, it should be supported for completeness and because
there is no real reason to not support it. Moreover, it can be useful in
certain contexts where regexes are generated and you want to insert an
expression that can never match. It's somewhat contrived, but it
happens when the interface is a regex pattern.

Previously, the ban on empty character classes was implemented in the
regex-syntax crate. But with the rewrite in #656 getting closer and
closer to landing, it's now time to relax this restriction. However, we
do keep the overall restriction in the 'regex' API by returning an error
in the NFA compiler. Once #656 is done, the new regex engines will
permit this case.
@BurntSushi BurntSushi force-pushed the ag/prepare-for-regex-automata branch from 3132a91 to a46751b Compare April 17, 2023 18:37
When Unicode mode is disabled (i.e., (?-u)), the Perl character classes
(\w, \d and \s) revert to their ASCII definitions. The negated forms
of these classes are also derived from their ASCII definitions, and this
means that they may actually match bytes outside of ASCII and thus
possibly invalid UTF-8. For this reason, when the translator is
configured to only produce HIR that matches valid UTF-8, '(?-u)\W'
should be rejected.

Previously, it was not being rejected, which could actually lead to
matches that produced offsets that split codepoints, and thus lead to
panics when match offsets are used to slice a string. For example, this
code

  fn main() {
      let re = regex::Regex::new(r"(?-u)\W").unwrap();
      let haystack = "☃";
      if let Some(m) = re.find(haystack) {
          println!("{:?}", &haystack[m.range()]);
      }
  }

panics with

  byte index 1 is not a char boundary; it is inside '☃' (bytes 0..3) of `☃`

That is, it reports a match at 0..1, which is technically correct, but
the regex itself should have been rejected in the first place since the
top-level Regex API always has UTF-8 mode enabled.

Also, many of the replacement tests were using '(?-u)\W' (or similar)
for some reason. I'm not sure why, so I just removed the '(?-u)' to make
those tests pass. Whether Unicode is enabled or not doesn't seem to be
an interesting detail for those tests. (All haystacks and replacements
appear to be ASCII.)

Fixes #895, Partially addresses #738
In effect, this adds support for no_std by depending on only core and
alloc. There is still currently some benefit to enabling std support,
namely, getting the 'std::error::Error' trait impls for the various
error types. (Although, it seems like the 'Error' trait is going to get
moved to 'core' finally.) Otherwise, the only 'std' things we use are in
tests for tweaking stack sizes.

This is the first step in an effort to make 'regex' itself work without
depending on 'std'. 'regex' itself will be more precarious since it uses
things like HashMap and Mutex that we'll need to find a way around.
Getting around HashMap is easy (just use BTreeMap), but figuring out how
to synchronize the threadpool will be interesting.

Ref #476, Ref #477
I wish this feature were stable and enabled by default. I suspect that
it maybe doesn't work correctly 100% of the time, but it's super useful.
And manually annotating APIs is a huge pain, so it's worth at least
attempting.
Get rid of those old crusty HTML links!

Also, if an intradoc link is used that is bunk, fail the build.
I'm not sure exactly why I used three variants instead of two like how
I've defined it in this patch. Possibly because the AST uses three
variants? (The AST needs to do a little more work to store a span
associated with where the name actually is in the expression, so it
maybe makes a little more sense there.)

In any case, this is the first step of many in simplifying the HIR.
This is apparently not used anywhere. So drop it.

Also motivated by wanting to squash look-around assertions into a single
enum. So 'is_negated' won't make sense on its own anymore.
Instead of having both 'HirKind::Anchor' and 'HirKind::WordBoundary',
this patch flattens them into one 'hirKind::Look'.

Why do this? I think they make more sense grouped together. Namely, they
are all simplistic look-around assertions and they all tend to be
handled with very similar logic.
This greatly simplifies how repetitions are represented in the HIR from
a sprawling set of variants down to just a simple `(u32, Option<u32>)`.
This is much simpler and still permits us to specialize all of the cases
we did before if necessary.

This also simplifies some of the HIR printer's output. e.g., 'a{1}' is
just 'a'.
This fixes some corner cases in the HIR printer where it would print the
concrete syntax of a regex that does not match the natural
interpretation of the HIR. One such example of this is:

    concat(a, alt(b, c))

This would get printed as

    ab|c

But clearly, it should be printed as:

    a(?:b|c)

The issue here is that the printer only considers the current HirKind
when determining how to print it. Sometimes a group is needed to print
an alt (and even a concat, in the case of 'rep(+, concat(a, b))'), but
sometimes it isn't.

We could address this in a few different ways:

1) Always print concats and alts inside a non-capturing group.
2) Make the printer detect precisely the cases where a non-capturing
   group is needed.
3) Make the HIR smart constructors insert non-capturing groups when
   needed.
4) Do some other thing to change the HIR to prevent these sorts of
   things by construction.

This patch goes with (1). The reason in favor of it is that HIR printer
was always about printing an equivalent regex and never about trying to
print a "nice" regex. Indeed, the HIR printer can't print a nice regex,
because the HIR represents a rigorously simplifed view of a regex to
make analysis easier. (The most obvious such example are Unicode
character classes. For example, the HIR printer never prints '\w'.) So
inserting some extra groups (which it already does) even when they
aren't strictly needed is perfectly okay.

But still, it's useful to say why we didn't do the other choices:

2) Modifying the printer to only print groups when they're actually
   needed is pretty difficult. I tried this briefly, and handling this
   case requires some notion of what the parent expression is. This
   winds up being a possible but hairy change.
3) Making the HIR more complicated to make the printer correct seems
   like it's optimizing for the wrong thing. Inserting extra groups in
   places just obfuscates HIR values that already have clear semantics.
   That is, use concat(a, alt(b, c)) over concat(a, group(alt(b, c))).
4) It's not clear how we would change the HIR to guarantee this sort of
   thing wouldn't happen. At the very least, it seems likely it would
   require a more complex data type.

At first, I had thought (1) seemed inelegant. But the more I thought
about it, the more it seemed quite consistent with how the HIR printer
already worked. So that's the path I took here.

Closes #516, Closes #731
No matter what 'a' is, 'a{0}' is always equivalent to an empty regex.
This gets rid of the old 'Literal' type:

  enum Literal {
    Unicode(char),
    Byte(u8),
  }

and replaces it with

  struct Literal(Box<[u8]>);

I did this primarily because I perceive the new version to be a bit
simpler and is very likely to be more space efficient given some of the
changes I have in mind (upcoming in subsequent commits). Namely, I want
to include more analysis information beyond just simply booleans, and
this means using up more space. Putting that analysis information on
every single byte/char seems gratuitous. But putting it on every single
sequence of byte/chars seems more justifiable.

I also have a hand-wavy idea that this might make analysis a bit easier.
And another hand-wavy idea that debug-printing such an HIR will make it
a bit more comprehensible.

Overall, this isn't a completely obvious win and I do wonder whether
I'll regret this. For one thing, the translator is now a fair bit
more complicated in exchange for not creating a 'Vec<u8>' for every
'ast::Literal' node.

This also gives up the Unicode vs byte distinct and just commits to "all
bytes." Instead, we do a UTF-8 check on every 'Hir::literal' call, and
that in turn sets the UTF-8 property. This does seem a bit wasteful, and
indeed, we do another UTF-8 check in the compiler (even though we could
use 'unsafe' correctly and avoid it). However, once the new NFA compiler
lands from regex-automata, it operates purely in byte-land and will not
need to do another UTF-8 check. Moreover, a UTF-8 check, even on every
literal, is likely barely measureable in the grand scheme of things.

I do also worry that this is overwrought. In particular, the AST creates
a node for each character. Then the HIR smooths them out to sequences of
characters (that is, Vec<u8>). And then NFA compilation splits them back
out into states where a state handles at most one character (or range of
characters). But, I am taking somewhat of a leap-of-judgment here that
this will make analysis easier and will overall use less space. But
we'll see.
This makes the Debug impls for Literal and ClassRangeBytes a bit better.
The former in particular. Instead of just printing a sequence of decimal
numbers, we now print them as characters.

Given the lackluster support for Vec<u8> as a string in the standard
library, we copy a little bit of code from regex-automata to make the
debug print for the Vec<u8> basically as nice as a String.
This commit completely rewrites how HIR properties are computed
inductively.

Firstly, 'Properties' is now boxed, so that it contributes less space to
each HIR value. This does add an allocation for each HIR expression, but
most HIR expressions already require at least one alloc anyway. And
there should be far fewer of them now that we collapse literals
together.

Secondly, 'Properties' now computes far more general attributes instead
of hyper-specific things. For example, instead of 'is_match_empty', we
now have 'minimum_len' and 'maximum_len'. Similarly, instead of
'is_anchored_start' and 'is_anchored_end', we now compute sets of
look-around assertions found anywhere, only as a prefix and only as a
suffix.

We also remove 'is_line_anchored_{start,end}'. There were only used in
the 'grep-regex' crate and they seem unnecessary. They were otherwise
fairly weird properties to compute.
Instead of using a boolean parameter, we just split them into dot_char,
dot_byte, any_char, any_byte.

Another path would be to use an enum, but this appeals to me a little
more.
It turns out they are completely superfluous in the HIR, so we can drop
them completely. We only need to explicitly represent capturing groups.
This makes it so 'a{1}' is rewritten as 'a' and '[a]' is rewritten as
'a'.

A lot of the tests expected '[a]' to get preserved as a class in the
HIR, so this required a bit of surgery.
This is a transitory commit that will need to be updated once
aho-corasick 1.0 is actually released. Its purpose is to make it so the
regex crate, the "old" regex crate and regex-automata all agree on the
same version of aho-corasick to use while in development.
Now that it *only* represents a capturing group, it makes sense to give
it a more specific name.
Where 'sub' is short for 'sub-expression.'
This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag,
'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also
causes '.' to *not* match \r in addition to \n (unless the 's' flag is
enabled of course).

The intended semantics are that CRLF mode makes \r\n, \r and \n line
terminators but with one key property: \r\n is treated as a single line
terminator. That is, ^/$ do not match between \r and \n.

This partially addresses #244 by adding syntax support. Currently, if
you try to use this new flag, the regex compiler will report an error.
We intend to finish support for this once #656 is complete. (Indeed, at
time of writing, CRLF matching works in regex-automata.)
This updates docs in a number of places, including adding examples.

We also make it so zero-width matches never impact the 'utf8' property.
In practice, this means '(?-u:\B)' is now considered to match valid
UTF-8, which is consistent with the fact that 'a*' is considered to
match valid UTF-8 too.

We also do a refresh of the 'Look' and 'LookSet' APIs.
This resolves a long-standing (but somewhat minor) complaint that folks
have with the regex crate: it does not permit escaping punctuation
characters in cases where those characters do not need to be escaped. So
things like \/, \" and \! would result in parse errors. Most other regex
engines permit these, even in cases where they aren't needed.

I had been against doing this for future evolution purposes, but it's
incredibly unlikely that we're ever going to add a new meta character to
the syntax. I literally cannot think of any conceivable future in which
that might happen.

However, we do continue to ban escapes for [0-9A-Za-z<>], because it is
conceivable that we might add new escape sequences for those characters.
(And 0-9 are already banned by virtue of them looking too much like
backreferences, which aren't supported.) For example, we could add
\Q...\E literal syntax. Or \< and \> as start and end word boundaries,
as found in POSIX regex engines.

Fixes #501
This changes the rules for capture names to be much less restrictive.
Namely, the requirements are now:

1. Must begin with an `_` or any alphabetic codepoint.
2. After the first codepoint, the name may contain any sequence of
   alpha-numeric codepoints along with `_`, `.`, `[` and `]`.

Closes #595
This adds a new routine for computing the static number of capture
groups that will appear in every match. If the number of groups is not
invariant across all matches, then there is no static capture length.

This is meant to help implement higher level convenience APIs for
extracting capture groups, such as the one described in #824. We may
wind up including such APIs in the regex crate itself, but this commit
stops short of that. Instead, we just add this new property which should
permit those APIs to exist outside of this crate for now.

Closes #908
And do the same for 'static_captures_len'.

The motivation for this is that the top-level Regex API had equivalently
named methods 'captures_len' and 'static_captures_len', except those
included the implicit group and were therefore always 1 more than the
same APIs on Hir. We distinguish them by renaming the routines on HIR.
It turns out that it's not too hard to get HIR translation to run pretty
slowly with some carefully crafted regexes. For example:

    (?i:[[:^space:]------------------------------------------------------------------------])

This regex is actually a [:^space:] class that has an empty class
subtracted from it 36 times. For each subtraction, the resulting
class--despite it not having changed---goes through Unicode case folding
again. This in turn slows things way down.

We introduce a fairly basic optimization that basically keeps track of
whether an interval set has been folded or not. The idea was taken from
PR #893, but was tweaked slightly. The magic of how it works is that if
two interval sets have already been folded, then they retain that
property after any of the set operations: negation, union, difference,
intersection and symmetric difference. So case folding should generally
only need to be run once for each "base" class, but then not again as
operations are performed.

Some benchmarks were added to rebar (which isn't public yet at time of
writing).

Closes #893
I'm overall coming around to the opinion that these tend to make the
code harder to read. So I've been steadily dropping the Result aliases.
This rewrites how Unicode simple case folding worked. Instead of just
defining a single function and expecting callers to deal with the
fallout, we know define a stateful type that "knows" about the structure
of the case folding table. For example, it now knows enough to avoid
binary search lookups in most cases. All we really have to do is require
that callers lookup codepoints in sequence, which is perfectly fine for
our use case.

Ref #893
Previously, classes would show up in the debug representation as very
deeply nested things, making them more difficult to read than they need
to be. This removes at least a few pretty redundant layers and uses a
more compact range notation.
The contract of this function says that any invalid group offset should
result in a return value of None. In general, it worked fine, unless the
offset was so big that some internal multiplication overflowed. That
could in turn produce an incorrect result or a panic. So we fix that
here with checked arithmetic.

Fixes #738, Fixes #950
This makes it clearer that the regex engine works by *logically*
treating a haystack as a sequence of codepoints. Or more specifically,
Unicode scalar values.

Fixes #854
The existing docs were pretty paltry, and it turns out we can be a bit
more helpful for folks when they hit this error.

Fixes #846
Adding these methods has almost no cost and they can be convenient to
have in some cases.

Closes #810
The name is somewhat unfortunate, but it's actually kind of difficult to
capture the right semantics in the name. The key bit is that the
function returns the offset at the point at which a match is known, and
that point might vary depending on which internal regex engine was used.

Fixes #747
This clarifies that `x` is "verbose mode," and that whitespace becomes
insignificant everywhere, including in character classes. We also add
guidance for how to insert a space: either escape it or use a hex
literal.

Fixes #660
It is really unfortunate, but SetMatches::len and
SetMatcher::iter().count() do not correspond go the same thing. It's not
clear why I even added the SetMatches::len method in the first place,
but it always returns the number of regexes in the set, and not the
number of regexes that matched.

We can't change the name (or remove the method) obviously, but we do add
a warning to the docs.

Fixes #625
And we make it an interesting example, i.e., one that demonstrates
preference order semantics.

Closes #610
This isn't *strictly* needed because of the existence of
Regex::captures_read_at, but it does fill out the singular missing
method. Namely, all other search routines have an *_at variant, so we
might as well add it for Regex::captures too.

Closes #547
This makes it so the Debug impl for Match only shows the actual matched
text. Otherwise, the Match shows the entire haystack, which is likely to
be misleading.

Fixes #514
This is useful when doing structural recursion on a '&Hir' to produce a
new 'Hir' derived from it.
Since it uses heap memory and because it's something you typically hang
on to in a regex engine, we expose a routine for computing heap memory.

We might consider doing this for other types in regex-syntax, but there
hasn't been a strong need for it yet.
The wording appears to be a little unclear, so we switch it around a
bit.

Fixes #975
This will need to be updated again to add a date (maybe today?), but
this should cover everything from the commit log.
@BurntSushi BurntSushi force-pushed the ag/prepare-for-regex-automata branch from a46751b to 82b0f0d Compare April 17, 2023 18:52
@BurntSushi BurntSushi merged commit 8161ac8 into master Apr 17, 2023
@BurntSushi BurntSushi deleted the ag/prepare-for-regex-automata branch April 17, 2023 19:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants