-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
str::contains
is very slow for substring checks
#25483
Comments
StrSearcher has a lot of panic branches that can probably avoided (slicing and unwrapping that will never happen). |
I wonder wether LLVM can move the comparison out of the loop and if not, why? |
I think this is fundamentally algorithmic problem, since |
Oh wow, I thought that was something that we had already fixed. |
The algorithmic problem was fixed, but this fix got disabled in the switch to the pattern API. The two-way search code is still present as dead code. |
For comparison, I just did a straight translation of a C implementation of BMH. Results for the test in the SO question were:
Where |
triage: I-nominated |
@DanielKeep Can it be adapted to the pattern API? |
@bluss I don't see why not; the signature is just (Not that my BMH implementation is likely to be an improvement: it was literally just copy, pasted, and translated from the C reference code. Hell, it uses... indexing. I was in a hurry. :P) |
Yeah, after implementing the pattern API I haven't found the time to reintegrate the old fast searcher code yet. Part of the issue why I haven't done it directly is that the fast string searcher only supports single ended search, while with the Pattern API you'd want double ended search. One way to resolve this is to just hack the naive and the fast search together into a hybrid that uses the fast search for forward iteration (which is all that happened before anyway), and the naive for backwards iteration, but due to me not fully understanding the fast algorithm I was reluctant to do that as part of the original patch. Wouldn't harm to just do this now and see if any bugs appear though... |
triage: P-medium |
This is allegedly a very interesting improvement upon our two-way search algorithm. http://drops.dagstuhl.de/opus/volltexte/2011/3355/ |
To improve our substring search performance, revive the two way searcher and adapt it to the Pattern API. Fixes rust-lang#25483, a performance bug: that particular case now completes faster in optimized rust than in ruby (but they share the same order of magnitude). Much thanks to @gereeter who helped me understand the reverse case better and wrote the comment explaining `next_back` in the code. I had quickcheck to fuzz test forward and reverse searching thoroughly. The two way searcher implements both forward and reverse search, but not double ended search. The forward and reverse parts of the two way searcher are completely independent. The two way searcher algorithm has very small, constant space overhead, requiring no dynamic allocation. Our implementation is relatively fast, especially due to the `byteset` addition to the algorithm, which speeds up many no-match cases. A bad case for the two way algorithm is: ``` let haystack = (0..10_000).map(|_| "dac").collect::<String>(); let needle = (0..100).map(|_| "bac").collect::<String>()); ``` For this particular case, two way is not much faster than the naive implementation it replaces.
Update substring search to use the Two Way algorithm To improve our substring search performance, revive the two way searcher and adapt it to the Pattern API. Fixes #25483, a performance bug: that particular case now completes faster in optimized rust than in ruby (but they share the same order of magnitude). Many thanks to @gereeter who helped me understand the reverse case better and wrote the comment explaining `next_back` in the code. I had quickcheck to fuzz test forward and reverse searching thoroughly. The two way searcher implements both forward and reverse search, but not double ended search. The forward and reverse parts of the two way searcher are completely independent. The two way searcher algorithm has very small, constant space overhead, requiring no dynamic allocation. Our implementation is relatively fast, especially due to the `byteset` addition to the algorithm, which speeds up many no-match cases. A bad case for the two way algorithm is: ``` let haystack = (0..10_000).map(|_| "dac").collect::<String>(); let needle = (0..100).map(|_| "bac").collect::<String>()); ``` For this particular case, two way is not much faster than the naive implementation it replaces.
Originally found in SO question.
This Rust code
Executes for up to three minutes on my machine (even with
-C opt-level=3
), while the equivalent Ruby or Python code:finishes in several seconds. Java variant finishes almost instantly (but that seems to be due to JIT optimizations).
It looks like that pattern searching for substrings could be severely optimized.
The text was updated successfully, but these errors were encountered: