Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finding matches with capturing subgroups anchored at beginning depends on length of haystack? #215

Closed
birkenfeld opened this issue Apr 26, 2016 · 8 comments

Comments

@birkenfeld
Copy link

Please refer to the following gist for the test code: https://gist.github.com/birkenfeld/ffc2f0da8b93da33f9fbd5a08a8790f3

In short, I'm experimenting with porting Pygments to Rust. The main task is implementing a scanner/lexer using lists of regexes for each state of the scanner. The scanner is an Iterator, and on each iteration, it goes through the regex list in order and tries to match every regex against the remaining text (all regexes are anchored at the beginning with \A, in Python re.match() is used). When a match is found, the remaining string is updated, and the search begins at the beginning of the list on the next iteration. The scanner uses captures() instead of find() because I want to use subgroups to assign different token types to them within one match, without having to use states for that.

Now, I've noticed that making the haystack longer, e.g. by appending copies of the test input, makes the matching slower and slower. You can see that in the benchmark output in the gist, where the timing is for scanning through the same string every time, but within a slice that is 1x, 2x, etc. this base string. I.e. the timings should stay constant (and they do when using .find()).

It seems strange to me that when the patterns are anchored at the start of the haystack, and are structured like the ones in the example (i.e. there are no patterns that have to scan through to the end of the haystack) that the time to match should not depend on the haystack length. Is this a problem of optimization (e.g. looking in the whole haystack for literals without applying the special constraints due to the \A assertion)?

Sorry for the long issue and code, hope it's understandable.

@BurntSushi
Copy link
Member

BurntSushi commented Apr 26, 2016

OK. This is an awesome (and valid) bug report. Thank you.

I'll start off by baking your noodle. I take it you decided to stop benchmarking at 20x, because clearly, a pattern had formed! But, alas, you were deceived! If you continued, you'd see this:

test bench_001x ... bench:     445,333 ns/iter (+/- 1,398)
test bench_002x ... bench:     563,531 ns/iter (+/- 5,479)
test bench_003x ... bench:     670,828 ns/iter (+/- 3,312)
test bench_005x ... bench:     901,192 ns/iter (+/- 2,839)
test bench_010x ... bench:   1,566,748 ns/iter (+/- 4,172)
test bench_020x ... bench:   3,538,074 ns/iter (+/- 5,703)
test bench_100x ... bench:     669,923 ns/iter (+/- 708)
test bench_200x ... bench:     671,756 ns/iter (+/- 8,083)
test bench_300x ... bench:     672,581 ns/iter (+/- 23,761)

Note the last three. Wat. :-)

So it turns out that there are two distinct matching engines in this crate that can compute capture locations. One of them is based on the Thompson NFA construction (the "Pike VM") and the other uses backtracking. It turns out that backtracking is actually pretty fast---when it doesn't go exponential. The problem is, we really want to guarantee matching in linear time, so we actually use a bounded backtracking approach, where the same state for the same position in the input is visited exactly once. The problem with this approach is that it takes some serious memory to keep track of where we've been, so we can only use the backtracking engine on small inputs. As the input grows, it will eventually reach a point where the backtracker says, "nope, too big!" and the Pike VM kicks in and handles it.

OK, but, but, the pattern is anchored! Shouldn't the backtracker quit once it knows it can't find a match? Indeed it should, and it does. The issue here is that the backtracker must spend time initializing (i.e., zeroing) its state table before searching begins. This state table has size proportional to the input, which means search times will vary as the input grows, even if the search itself doesn't examine any more input.

So I think actual issue here is that the heuristic for picking the backtracking engine, in this case, is simply wrong. There will always be cases where it's wrong, because obviously picking the size limit is a bit of a black art. In any case, it seems like it's probably over extending itself, so I opened #216. With that change, the benchmarks now look like this:

test bench_001x ... bench:     432,289 ns/iter (+/- 934)
test bench_002x ... bench:     550,904 ns/iter (+/- 3,360)
test bench_003x ... bench:     659,164 ns/iter (+/- 2,237)
test bench_005x ... bench:     889,168 ns/iter (+/- 2,246)
test bench_010x ... bench:   1,556,030 ns/iter (+/- 13,166)
test bench_020x ... bench:     908,140 ns/iter (+/- 2,301)
test bench_100x ... bench:     682,278 ns/iter (+/- 832)
test bench_200x ... bench:     683,504 ns/iter (+/- 2,103)
test bench_300x ... bench:     683,351 ns/iter (+/- 1,068)

Still not quite perfect, but better. At least, you would have noticed the plateau in this case I think. :-)

@BurntSushi
Copy link
Member

BurntSushi commented Apr 26, 2016

Relatedly, might I interest you in RegexSet? It will match multiple regexes at the same time in a way that is a bit more ergonomic (and faster, because it's the DFA all the way) than using an alternation of capture groups.

I will note that #186 outlines a performance bug with RegexSet, which unfortunately looks applicable to your use case. :-(

@birkenfeld
Copy link
Author

Awesome, thanks for the quick explanation and the fix, the timings look better indeed if you go higher, and more so after the fix. This is how it looks for me afterwards:

running 8 tests
test bench_001x ... bench:     294,531 ns/iter (+/- 13,629)
test bench_002x ... bench:     401,605 ns/iter (+/- 32,247)
test bench_003x ... bench:     523,468 ns/iter (+/- 27,907)
test bench_005x ... bench:     738,912 ns/iter (+/- 32,982)
test bench_010x ... bench:   1,292,472 ns/iter (+/- 32,124)
test bench_020x ... bench:     785,822 ns/iter (+/- 29,604)
test bench_100x ... bench:     540,670 ns/iter (+/- 32,480)
test bench_200x ... bench:     540,817 ns/iter (+/- 34,137)

However, thinking of the length problem I thought of another trick. It gives me

running 8 tests
test bench_001x ... bench:     275,929 ns/iter (+/- 15,897)
test bench_002x ... bench:     275,433 ns/iter (+/- 11,595)
test bench_003x ... bench:     275,596 ns/iter (+/- 10,879)
test bench_005x ... bench:     276,505 ns/iter (+/- 12,485)
test bench_010x ... bench:     275,981 ns/iter (+/- 15,354)
test bench_020x ... bench:     275,364 ns/iter (+/- 8,838)
test bench_100x ... bench:     276,624 ns/iter (+/- 14,581)
test bench_200x ... bench:     276,974 ns/iter (+/- 19,627)

As you have probably guessed, it involves the DFA. I'm doing this:

            if let Some((_, idx)) = rx.find(self.0) {
                if let Some(cap) = rx.captures(&self.0[..idx]) {
                    ...
                }
            }

Even for very small inputs, where the overhead of the find makes matching slower overall, the slowdown is minimal for my use case (since most inputs start at least with a large-ish size, the speedup will dominate). Of course I'll have to watch my regexps to see if they all allow DFA matching (is there an easy way to ensure that?).

Also, would it make sense to use this strategy directly in regex, with some preconditions (minimum input length, DFA available, ...)?

@birkenfeld
Copy link
Author

Ah, and yes, I did try RegexSet before, but it was pretty slow, so I guess I did hit #186.

@BurntSushi
Copy link
Member

That's strange... That's exactly what captures should be doing in the first place. In fact, your code translates to:

  1. DFA scan to find start/end.
  2. DFA scan to find start/end.
  3. NFA scan to fill in captures.

Looking at the code, I think I know the problem. When (3) runs, it doesn't actually limit the text searched in the call to the NFA. I think this line should have &text[..e] instead of text in the call to captures_nfa.

@BurntSushi
Copy link
Member

That's sad to hear about RegexSet. I hope I can figure out a solution!

@birkenfeld
Copy link
Author

I think this line should have &text[..e] instead of text in the call to captures_nfa.

That's great! An easy fix with large gains.

@BurntSushi
Copy link
Member

Indeed!

Before:

test bench_001x ... bench:     614,417 ns/iter (+/- 2,192)
test bench_002x ... bench:     552,771 ns/iter (+/- 2,860)
test bench_003x ... bench:     659,380 ns/iter (+/- 2,894)
test bench_005x ... bench:     891,484 ns/iter (+/- 1,319)
test bench_010x ... bench:   1,553,615 ns/iter (+/- 6,965)
test bench_020x ... bench:     912,122 ns/iter (+/- 2,181)
test bench_100x ... bench:     678,077 ns/iter (+/- 1,564)
test bench_200x ... bench:     678,091 ns/iter (+/- 1,352)
test bench_300x ... bench:     677,813 ns/iter (+/- 1,244)

After:

test bench_001x ... bench:     386,715 ns/iter (+/- 2,780)
test bench_002x ... bench:     386,620 ns/iter (+/- 1,339)
test bench_003x ... bench:     386,450 ns/iter (+/- 1,335)
test bench_005x ... bench:     386,718 ns/iter (+/- 1,761)
test bench_010x ... bench:     386,515 ns/iter (+/- 2,116)
test bench_020x ... bench:     386,390 ns/iter (+/- 1,537)
test bench_100x ... bench:     386,624 ns/iter (+/- 1,153)
test bench_200x ... bench:     386,017 ns/iter (+/- 540)
test bench_300x ... bench:     386,157 ns/iter (+/- 836)

Yay! PR incoming...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants