-
-
Notifications
You must be signed in to change notification settings - Fork 105
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
api: add memmem implementation, initially from bstr
This commit primarily adds vectorized substring search routines in a new memmem sub-module. They were originally taken from bstr, but heavily modified to incorporate a variant of the "generic SIMD" algorithm[1]. The main highlights: * We guarantee `O(m + n)` time complexity and constant space complexity. * Two-Way is the primary implementation that can handle all cases. * Vectorized variants handle a number of common cases. * Vectorized code uses a heuristic informed by a frequency background distribution of bytes, originally devised inside the regex crate. This makes it more likely that searching will spend more time in the fast vector loops. While adding memmem to this crate is perhaps a bit of a scope increase, I think it fits well. It also puts a core primitive, substring search, very low in the dependency DAG and therefore making it widely available. For example, it is intended to use these new routines in the regex, aho-corasick and bstr crates. This commit does a number of other things, mainly as a result of convenience. It drastically improves test coverage for substring search (as compared to what bstr had), completely overhauls the benchmark suite to make it more comprehensive and adds `cargo fuzz` support for all API items in the crate. [1] - http://0x80.pl/articles/simd-strfind.html#algorithm-1-generic-simd
- Loading branch information
1 parent
78cc45d
commit 7865405
Showing
84 changed files
with
141,308 additions
and
793 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
These were downloaded and derived from the Open Subtitles data set: | ||
https://opus.nlpl.eu/OpenSubtitles-v2018.php | ||
|
||
The specific way in which they were modified has been lost to time, but it's | ||
likely they were just a simple truncation based on target file sizes for | ||
various benchmarks. | ||
|
||
The main reason why we have them is that it gives us a way to test similar | ||
inputs on non-ASCII text. Normally this wouldn't matter for a substring search | ||
implementation, but because of the heuristics used to pick a priori determined | ||
"rare bytes" to base a prefilter on, it's possible for this heuristic to do | ||
more poorly on non-ASCII text than one might expect. |
Oops, something went wrong.