Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use rustc_layout_scalar_valid_range_end(usize::MAX - 1) for the index #72

Closed
jyn514 opened this issue Jun 14, 2020 · 1 comment
Closed

Comments

@jyn514
Copy link

jyn514 commented Jun 14, 2020

From rust-lang/rust#73139 (comment):

The index is always less than the length. So even if the length is usize::MAX, the index will be at most MAX - 1 and so cannot overflow.

It would be great to tell rustc this is the case by using rustc_layout_scalar_valid_range_end(usize::MAX - 1). That would allow storing Option<index> in usize instead of needing an extra bit, which in some cases could double the size of the struct due to alignment requirements.

Failing that (since rustc_layout_scalar_valid_range_end is unstable and likely will never be stabilized), would it be possible to document that the index is always less than usize::MAX?

@BurntSushi
Copy link
Owner

I guess I'm not opposed to documenting this, although it kind of seems self evident to me? memchr returns the position of a particular byte in a slice, if it exists. It it doesn't exist, then None is returned. Since the largest possible size of a slice is usize::MAX and since memchr can never possibly return an index greater than or equal to the length of the given slice (since it would be an invalid index), it follows immediately that the value returned is guaranteed to be less than usize::MAX.

BurntSushi added a commit that referenced this issue Apr 30, 2021
This commit primarily adds vectorized substring search routines in
a new memmem sub-module. They were originally taken from bstr, but
heavily modified to incorporate a variant of the "generic SIMD"
algorithm[1]. The main highlights:

* We guarantee `O(m + n)` time complexity and constant space
  complexity.
* Two-Way is the primary implementation that can handle all cases.
* Vectorized variants handle a number of common cases.
* Vectorized code uses a heuristic informed by a frequency background
  distribution of bytes, originally devised inside the regex crate.
  This makes it more likely that searching will spend more time in the
  fast vector loops.

While adding memmem to this crate is perhaps a bit of a scope increase,
I think it fits well. It also puts a core primitive, substring
search, very low in the dependency DAG and therefore making it widely
available. For example, it is intended to use these new routines in the
regex, aho-corasick and bstr crates.

This commit does a number of other things, mainly as a result of
convenience. It drastically improves test coverage for substring search
(as compared to what bstr had), completely overhauls the benchmark suite
to make it more comprehensive and adds `cargo fuzz` support for all API
items in the crate.

Closes #58, Closes #72

[1] - http://0x80.pl/articles/simd-strfind.html#algorithm-1-generic-simd
BurntSushi added a commit that referenced this issue Apr 30, 2021
This commit primarily adds vectorized substring search routines in
a new memmem sub-module. They were originally taken from bstr, but
heavily modified to incorporate a variant of the "generic SIMD"
algorithm[1]. The main highlights:

* We guarantee `O(m + n)` time complexity and constant space
  complexity.
* Two-Way is the primary implementation that can handle all cases.
* Vectorized variants handle a number of common cases.
* Vectorized code uses a heuristic informed by a frequency background
  distribution of bytes, originally devised inside the regex crate.
  This makes it more likely that searching will spend more time in the
  fast vector loops.

While adding memmem to this crate is perhaps a bit of a scope increase,
I think it fits well. It also puts a core primitive, substring
search, very low in the dependency DAG and therefore making it widely
available. For example, it is intended to use these new routines in the
regex, aho-corasick and bstr crates.

This commit does a number of other things, mainly as a result of
convenience. It drastically improves test coverage for substring search
(as compared to what bstr had), completely overhauls the benchmark suite
to make it more comprehensive and adds `cargo fuzz` support for all API
items in the crate.

Closes #58, Closes #72

[1] - http://0x80.pl/articles/simd-strfind.html#algorithm-1-generic-simd
BurntSushi added a commit that referenced this issue Apr 30, 2021
This commit primarily adds vectorized substring search routines in
a new memmem sub-module. They were originally taken from bstr, but
heavily modified to incorporate a variant of the "generic SIMD"
algorithm[1]. The main highlights:

* We guarantee `O(m + n)` time complexity and constant space
  complexity.
* Two-Way is the primary implementation that can handle all cases.
* Vectorized variants handle a number of common cases.
* Vectorized code uses a heuristic informed by a frequency background
  distribution of bytes, originally devised inside the regex crate.
  This makes it more likely that searching will spend more time in the
  fast vector loops.

While adding memmem to this crate is perhaps a bit of a scope increase,
I think it fits well. It also puts a core primitive, substring
search, very low in the dependency DAG and therefore making it widely
available. For example, it is intended to use these new routines in the
regex, aho-corasick and bstr crates.

This commit does a number of other things, mainly as a result of
convenience. It drastically improves test coverage for substring search
(as compared to what bstr had), completely overhauls the benchmark suite
to make it more comprehensive and adds `cargo fuzz` support for all API
items in the crate.

Closes #58, Closes #72

[1] - http://0x80.pl/articles/simd-strfind.html#algorithm-1-generic-simd
BurntSushi added a commit that referenced this issue Apr 30, 2021
This commit primarily adds vectorized substring search routines in
a new memmem sub-module. They were originally taken from bstr, but
heavily modified to incorporate a variant of the "generic SIMD"
algorithm[1]. The main highlights:

* We guarantee `O(m + n)` time complexity and constant space
  complexity.
* Two-Way is the primary implementation that can handle all cases.
* Vectorized variants handle a number of common cases.
* Vectorized code uses a heuristic informed by a frequency background
  distribution of bytes, originally devised inside the regex crate.
  This makes it more likely that searching will spend more time in the
  fast vector loops.

While adding memmem to this crate is perhaps a bit of a scope increase,
I think it fits well. It also puts a core primitive, substring
search, very low in the dependency DAG and therefore making it widely
available. For example, it is intended to use these new routines in the
regex, aho-corasick and bstr crates.

This commit does a number of other things, mainly as a result of
convenience. It drastically improves test coverage for substring search
(as compared to what bstr had), completely overhauls the benchmark suite
to make it more comprehensive and adds `cargo fuzz` support for all API
items in the crate.

Closes #58, Closes #72

[1] - http://0x80.pl/articles/simd-strfind.html#algorithm-1-generic-simd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants