-
Notifications
You must be signed in to change notification settings - Fork 445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finding matches with capturing subgroups anchored at beginning depends on length of haystack? #215
Comments
OK. This is an awesome (and valid) bug report. Thank you. I'll start off by baking your noodle. I take it you decided to stop benchmarking at 20x, because clearly, a pattern had formed! But, alas, you were deceived! If you continued, you'd see this:
Note the last three. Wat. :-) So it turns out that there are two distinct matching engines in this crate that can compute capture locations. One of them is based on the Thompson NFA construction (the "Pike VM") and the other uses backtracking. It turns out that backtracking is actually pretty fast---when it doesn't go exponential. The problem is, we really want to guarantee matching in linear time, so we actually use a bounded backtracking approach, where the same state for the same position in the input is visited exactly once. The problem with this approach is that it takes some serious memory to keep track of where we've been, so we can only use the backtracking engine on small inputs. As the input grows, it will eventually reach a point where the backtracker says, "nope, too big!" and the Pike VM kicks in and handles it. OK, but, but, the pattern is anchored! Shouldn't the backtracker quit once it knows it can't find a match? Indeed it should, and it does. The issue here is that the backtracker must spend time initializing (i.e., zeroing) its state table before searching begins. This state table has size proportional to the input, which means search times will vary as the input grows, even if the search itself doesn't examine any more input. So I think actual issue here is that the heuristic for picking the backtracking engine, in this case, is simply wrong. There will always be cases where it's wrong, because obviously picking the size limit is a bit of a black art. In any case, it seems like it's probably over extending itself, so I opened #216. With that change, the benchmarks now look like this:
Still not quite perfect, but better. At least, you would have noticed the plateau in this case I think. :-) |
Relatedly, might I interest you in I will note that #186 outlines a performance bug with |
Awesome, thanks for the quick explanation and the fix, the timings look better indeed if you go higher, and more so after the fix. This is how it looks for me afterwards:
However, thinking of the length problem I thought of another trick. It gives me
As you have probably guessed, it involves the DFA. I'm doing this:
Even for very small inputs, where the overhead of the Also, would it make sense to use this strategy directly in regex, with some preconditions (minimum input length, DFA available, ...)? |
Ah, and yes, I did try |
That's strange... That's exactly what
Looking at the code, I think I know the problem. When (3) runs, it doesn't actually limit the |
That's sad to hear about |
That's great! An easy fix with large gains. |
Indeed! Before:
After:
Yay! PR incoming... |
Please refer to the following gist for the test code: https://gist.github.com/birkenfeld/ffc2f0da8b93da33f9fbd5a08a8790f3
In short, I'm experimenting with porting Pygments to Rust. The main task is implementing a scanner/lexer using lists of regexes for each state of the scanner. The scanner is an
Iterator
, and on each iteration, it goes through the regex list in order and tries to match every regex against the remaining text (all regexes are anchored at the beginning with\A
, in Pythonre.match()
is used). When a match is found, the remaining string is updated, and the search begins at the beginning of the list on the next iteration. The scanner usescaptures()
instead offind()
because I want to use subgroups to assign different token types to them within one match, without having to use states for that.Now, I've noticed that making the haystack longer, e.g. by appending copies of the test input, makes the matching slower and slower. You can see that in the benchmark output in the gist, where the timing is for scanning through the same string every time, but within a slice that is 1x, 2x, etc. this base string. I.e. the timings should stay constant (and they do when using
.find()
).It seems strange to me that when the patterns are anchored at the start of the haystack, and are structured like the ones in the example (i.e. there are no patterns that have to scan through to the end of the haystack) that the time to match should not depend on the haystack length. Is this a problem of optimization (e.g. looking in the whole haystack for literals without applying the special constraints due to the
\A
assertion)?Sorry for the long issue and code, hope it's understandable.
The text was updated successfully, but these errors were encountered: