-
Notifications
You must be signed in to change notification settings - Fork 445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ironing out regex-lite
#961
Comments
Thank for filing this! So I want to be clear that I'm not convinced we should actually do this. And if we do, it's probably wise to wait to land it until after #656 is done. Although one could start work on it now. (Keeping in mind that I'm still not sold on the idea.) The high level motivation here is that the I don't really see how to fix that other than just splintering off and creating a totally new and separate "lite" version of OK, so how do we actually simplify things? Here's what I think:
Those are my initial thoughts. |
Thanks for the great write-up! I'm fine with either waiting for #656 or starting on things now (although I'm very excited for the refactor 🚀)
Having
My preferred would be 👍 to everything else |
I think you can get started whenever. I think the thing you shouldn't do is come up with a huge test suite. We'll be able to reuse the one that I have as part of #656. Obviously you can write tests as you build, but don't go out of your way to build something huge and comprehensive. Otherwise, I don't think #656 and this project would really intersect much, by design. I think the UTF-8 issue is going to make this gnarly unfortunately. Ideally we would just use |
I've been thinking more about this, and I feel like it would be helpful to write down some core data types. It will help structure thought, discussion and work on this I think. I think the two primary data types are an enum Hir {
Empty,
Char(char),
Class(Vec<(char, char)>),
Repetition {
min: u32,
max: Option<u32>,
greedy: bool,
child: Box<Hir>,
},
Capture {
index: u32,
name: Option<Box<str>>,
child: Box<Hir>,
},
Concat(Vec<Hir>),
Alternation(Vec<Hir>),
} And now for enum State {
Range {
start: char,
end: char,
},
Split {
alt1: StateID,
alt2: StateID,
},
Goto {
target: StateID,
},
Fail,
Match,
} And that's pretty much it. The core matching loop would then proceed by decoding a single rune from the haystack and applying it to the NFA state graph above. This should all work fine for The problem comes up with the I do wonder if we might be able to take Go's approach to this problem. Go's regexp engine doesn't support the This feels like an optimal place to land for me. It keeps the implementation very simple, makes searching Once difference from Go is that we should probably use "substitution of maximal subparts" as our lossy UTF-8 decoding strategy. So for example, Please ask questions if any of this doesn't make sense! |
This comment was marked as off-topic.
This comment was marked as off-topic.
IMO |
That is indeed the case for
Yes of course. But if you're going to make
Basically, as soon as you start talking about matching individual bytes, you wind up with an NFA state that instead looks like this: enum State {
RangeByte {
start: u8,
end: u8,
},
RangeChar {
start: char,
end: char,
},
Split {
alt1: StateID,
alt2: StateID,
},
Goto {
target: StateID,
},
Fail,
Match,
} And now your regex engine needs to know how to deal with both and you need to write an abstraction over what is considered a "character." And mixing Basically, I'm trying to simplify things here so that:
|
I ended up moving forward with this. My plan is to release it at the same time as
I'm not too happy about the size increase in the baseline that we're going to see in the My plan is for You can browse the source of And the API docs here (which aren't written yet, and the API is meant to be a clone of I think I have three main concerns right now:
|
@CosmicHorrorDev Also, I know you said you were planning to work on this, but it's been a few months. And I was able to get an end-to-end |
I'm just happy that it's being created in the first place :) I worked on it originally for about a week, and then set it down with the promise that I would get back to it sometime (which evidently never happened 😅) |
@BurntSushi I think regex-lite can be used in unicode-linebreak's build.rs. It uses regex to match against utf-8 string so it should be able to switch to regex-lite. Although a better solution might be to generate the tables.rs and upload it as part of the crate. |
Yes, the UCD data files are generally all ASCII. So you should be able to get away with using non-Unicode regexes to parse them. And yeah, personally, I'd suggest generating and committing Unicode table source files. That's what |
@NobodyXu Note for example that your regex uses |
@BurntSushi Thanks for the advice, I've opened an issue there to discuss with the maintainer on which solution they would like to pick. |
Those compile time and binary size improvements look pretty nice! But I think it would be good to track compile times / binary sizes for crates that include both |
@nicoburns can you say what I would do with that information? Like how would it be actionable? |
I usually close tickets on a commit-by-commit basis, but this refactor was so big that it wasn't feasible to do that. So ticket closures are marked here. Closes #244 Closes #259 Closes #476 Closes #644 Closes #675 Closes #824 Closes #961 Closes #68 Closes #510 Closes #787 Closes #891 Closes #429 Closes #517 Closes #579 Closes #779 Closes #850 Closes #921 Closes #976 Closes #1002 Closes #656
I usually close tickets on a commit-by-commit basis, but this refactor was so big that it wasn't feasible to do that. So ticket closures are marked here. Closes #244 Closes #259 Closes #476 Closes #644 Closes #675 Closes #824 Closes #961 Closes #68 Closes #510 Closes #787 Closes #891 Closes #429 Closes #517 Closes #579 Closes #779 Closes #850 Closes #921 Closes #976 Closes #1002 Closes #656
This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [regex](https://github.com/rust-lang/regex) | dependencies | minor | `1.8.4` -> `1.9.1` | --- ### Release Notes <details> <summary>rust-lang/regex (regex)</summary> ### [`v1.9.1`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#191-2023-07-07) [Compare Source](rust-lang/regex@1.9.0...1.9.1) \================== This is a patch release which fixes a memory usage regression. In the regex 1.9 release, one of the internal engines used a more aggressive allocation strategy than what was done previously. This patch release reverts to the prior on-demand strategy. Bug fixes: - [BUG #​1027](rust-lang/regex#1027): Change the allocation strategy for the backtracker to be less aggressive. ### [`v1.9.0`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#190-2023-07-05) [Compare Source](rust-lang/regex@1.8.4...1.9.0) \================== This release marks the end of a [years long rewrite of the regex crate internals](rust-lang/regex#656). Since this is such a big release, please report any issues or regressions you find. We would also love to hear about improvements as well. In addition to many internal improvements that should hopefully result in "my regex searches are faster," there have also been a few API additions: - A new `Captures::extract` method for quickly accessing the substrings that match each capture group in a regex. - A new inline flag, `R`, which enables CRLF mode. This makes `.` match any Unicode scalar value except for `\r` and `\n`, and also makes `(?m:^)` and `(?m:$)` match after and before both `\r` and `\n`, respectively, but never between a `\r` and `\n`. - `RegexBuilder::line_terminator` was added to further customize the line terminator used by `(?m:^)` and `(?m:$)` to be any arbitrary byte. - The `std` Cargo feature is now actually optional. That is, the `regex` crate can be used without the standard library. - Because `regex 1.9` may make binary size and compile times even worse, a new experimental crate called `regex-lite` has been published. It prioritizes binary size and compile times over functionality (like Unicode) and performance. It shares no code with the `regex` crate. New features: - [FEATURE #​244](rust-lang/regex#244): One can opt into CRLF mode via the `R` flag. e.g., `(?mR:$)` matches just before `\r\n`. - [FEATURE #​259](rust-lang/regex#259): Multi-pattern searches with offsets can be done with `regex-automata 0.3`. - [FEATURE #​476](rust-lang/regex#476): `std` is now an optional feature. `regex` may be used with only `alloc`. - [FEATURE #​644](rust-lang/regex#644): `RegexBuilder::line_terminator` configures how `(?m:^)` and `(?m:$)` behave. - [FEATURE #​675](rust-lang/regex#675): Anchored search APIs are now available in `regex-automata 0.3`. - [FEATURE #​824](rust-lang/regex#824): Add new `Captures::extract` method for easier capture group access. - [FEATURE #​961](rust-lang/regex#961): Add `regex-lite` crate with smaller binary sizes and faster compile times. - [FEATURE #​1022](rust-lang/regex#1022): Add `TryFrom` implementations for the `Regex` type. Performance improvements: - [PERF #​68](rust-lang/regex#68): Added a one-pass DFA engine for faster capture group matching. - [PERF #​510](rust-lang/regex#510): Inner literals are now used to accelerate searches, e.g., `\w+@​\w+` will scan for `@`. - [PERF #​787](rust-lang/regex#787), [PERF #​891](rust-lang/regex#891): Makes literal optimizations apply to regexes of the form `\b(foo|bar|quux)\b`. (There are many more performance improvements as well, but not all of them have specific issues devoted to them.) Bug fixes: - [BUG #​429](rust-lang/regex#429): Fix matching bugs related to `\B` and inconsistencies across internal engines. - [BUG #​517](rust-lang/regex#517): Fix matching bug with capture groups. - [BUG #​579](rust-lang/regex#579): Fix matching bug with word boundaries. - [BUG #​779](rust-lang/regex#779): Fix bug where some regexes like `(re)+` were not equivalent to `(re)(re)*`. - [BUG #​850](rust-lang/regex#850): Fix matching bug inconsistency between NFA and DFA engines. - [BUG #​921](rust-lang/regex#921): Fix matching bug where literal extraction got confused by `$`. - [BUG #​976](rust-lang/regex#976): Add documentation to replacement routines about dealing with fallibility. - [BUG #​1002](rust-lang/regex#1002): Use corpus rejection in fuzz testing. </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNi4wLjAiLCJ1cGRhdGVkSW5WZXIiOiIzNi44LjExIiwidGFyZ2V0QnJhbmNoIjoiZGV2ZWxvcCJ9--> Co-authored-by: cabr2-bot <cabr2.help@gmail.com> Co-authored-by: crapStone <crapstone01@gmail.com> Reviewed-on: https://codeberg.org/Calciumdibromid/CaBr2/pulls/1957 Reviewed-by: crapStone <crapstone01@gmail.com> Co-authored-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org> Co-committed-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>
Just giving a more convenient place to iron out the details behind a potential
regex-lite
crate that aims to be light on compile time and binary sizeI'll let @BurntSushi start things off since they're more qualified and have likely given this a great deal more thought
The text was updated successfully, but these errors were encountered: