-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize is_ascii for str and [u8]. #74066
Conversation
r? @shepmaster (rust_highfive has picked a reviewer for you, use r? to override) |
That's over 400 GB/s, are you sure that's measuring the correct thing? |
Good point! |
You should have a look at slice::align_to. |
@ecstatic-morse: The algorithm I'm using is different (and more efficient on most machines) than what I'd get with slice::align_to. @jonas-schievink: Not sure I know how to debug this really -- all the timings from that module seem dodgy. Well over 30GB/s for most of them, and those also contain
|
Simple code is good code, and people running ARM may also want to check whether their strings are ASCII-only. If your more complex version is meaningfully more efficient, then fine, but I suspect it's not. |
for fn is_ascii_slice_align_to(bytes: &[u8]) {
fn contains_nonascii(v: usize) -> bool {
const NONASCII_MASK: usize = 0x80808080_80808080u64 as usize;
(NONASCII_MASK & v) != 0
}
let (head, body, tail) = unsafe { bytes.align_to::<usize>() };
head.iter().all(|b| b.is_ascii()) &&
body.iter().all(|w| !contains_nonascii(*w)) &&
tail.iter().all(|b| b.is_ascii())
} Of course with the standard caveat of "I spent far longer on my version than this version". Also: On modern ARM (armv7+) read_unaligned is only slow if the address is actually unaligned, and even then it's really not that bad in my experience. Like, I don't have numbers for this but it seems around the same as just reading the bytes. There are platforms where this is worse though, but I'd expect those to compile read_unaligned down to load each byte. Which is what the Anyway, I actually noticed |
I would have added a fast path for when |
Beyond my knowledge, let's try... r? @sfackler /cc @BurntSushi as I know your interest in these matters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider slice::align_to
instead of align_offset
? It also takes care of various other asserts you have in the rest of this function as well as removes the necessity for the "thoroughly" test case and significantly reduce the amount of unsafe
in the code...
EDIT: wrote this before I saw the other comments...
The algorithm I'm using is different (and more efficient on most machines) than what I'd get with slice::align_to.
That likely means slice::align_to
can be improved instead. I’ll experiment with it.
This is discussed in other comments above, it's measurably less performant (on my machine, anyway). I suspect in practice it will do a little worse than it does on the benchmark I linked in #74066 (comment) (at least for the case that caused me to file this PR -- mostly medium-length strings). That is, the benchmarks don't really hit the cases where To demonstrate this, if I modify the code* so that first and last bytes of the test string are skipped (to force unaligned on both ends), you see the following for medium (small is smaller than a usize, and long is too large for something like this to matter):
Note that previously medium would have been aligned to usize (came right out of an allocator), and was 32 bytes -- e.g. it would totally avoid the problem cases for In practice a case like bit me before (it was with SIMD vectors, which exacerbates the issue, since they have longer tail/head sequences, but...). Which is... admittedly why hold a little bit of a grudge against That said, I do feel like I'm getting a signal of "This is not worth the complexity", and I'd rather have an That said, IMO the stdlib is a better-than-average place to make a complexity/performance trade-off like this, so I'm going to probably wait for the reviewer to tell me to switch to * Uh, I'll have the
EDIT: (My comment was posted before I saw your response too)
@nagisa IMO it's kind of a problem with the API rather than something like a missed optimization (although, to be clear, align_to compiles weird -- LLVM aggressively often unrolls loops on the head/tail, e.g: https://godbolt.org/z/N7t96u). In particular a big issue for me is that coming from That said, I can't deny it's much more readable and maintainable than what I came up with. |
I extracted the implementations into a local crate and wrote a criterion-based benchmark… and tested the three implementations on both x86_64 and an aarch64. From what I can see, the proposed implementation is (noticeably) better than the align_to implementation at input sizes ~8 bytes to ~1kB, after which point the This makes sense to me given that So ultimately its probably a question of what kind of input we want to optimize for (and I think the medium length case is the most interesting one, favouring the currently proposed implementation). That said, I also see extreme (>100%) run-to-run variance in some of the benchmark results, even though I did make an effort to avoid most of the well known pitfalls in benchmarking, so whatever data my benchmark is generating is gibberish anyway… This also makes it super difficult to just experiment with the minor changes to the code... |
…rks into own file
One idea I had was something like: fn is_ascii_align_to_unrolled(bytes: &[u8]) -> bool {
// Not really clear if this should be testing size_of [usize; 2] or size_of usize still...
if bytes.len() < core::mem::size_of::<[usize; 2]>() {
return bytes.iter().all(|b| b.is_ascii());
}
// SAFETY: transmuting a sequence of `u8` to `[usize; 2]` is always fine
let (head, body, tail) = unsafe { bytes.align_to::<[usize; 2]>() };
head.iter().all(|b| b.is_ascii())
&& body.iter().all(|w| !contains_nonascii(w[0] | w[1]))
&& tail.iter().all(|b| b.is_ascii())
} And this actually is as fast or almost as fast as the main impl in the PR... So long as your string's length is a multiple of Which seems like a pretty common case. Anyway my most recent push adds a bench for that, separates out benches for different alignment issues that A run of the benchmark results is here.
Hm. I'm going to think about this since I feel like there's no reason we shouldn't be doing the same. I'm not fully sure I follow though... That said really there are a few things I probably should do to improve the implementation of is_ascii in the PR (it probably should unroll the inner loop once), but also it might be complex enough as-is given that it's complexity already caused minor controversy. |
…_` benches, and clean up stray semicolon
I’m going to approve this. I think optimizing for performance with smaller strings here makes sense and the algorithm as implemented here will do better with said smaller strings specifically because it just does less work. I also don't see a reason to block this on benchmarks that are super fuzzy anyway... @bors r+ |
📌 Commit a150dcc has been approved by |
…arth Rollup of 10 pull requests Successful merges: - rust-lang#72920 (Stabilize `transmute` in constants and statics but not const fn) - rust-lang#73715 (debuginfo: Mangle tuples to be natvis friendly, typedef basic types) - rust-lang#74066 (Optimize is_ascii for str and [u8].) - rust-lang#74116 (Fix cross compilation of LLVM to aarch64 Windows targets) - rust-lang#74167 (linker: illumos ld does not support --eh-frame-hdr) - rust-lang#74168 (Add a help to use `in_band_lifetimes` in nightly) - rust-lang#74197 (Reword incorrect `self` token suggestion) - rust-lang#74213 (Minor refactor for rustc_resolve diagnostics match) - rust-lang#74240 (Fix rust-lang#74081 and add the test case from rust-lang#74236) - rust-lang#74241 (update miri) Failed merges: r? @ghost
if byte_pos == len { | ||
return true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? What happens if we always check last_word? As in making it branchless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an optimization to avoid an extra (redundant) load. It's not necessary for correctness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an optimization to avoid an extra (redundant) load. It's not necessary for correctness.
Avoid extra load? But what happens if you remove it? Along with the debug assert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we perform an extra load of the last word, which we've already checked. What's your point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's what I mean, perform the extra load of the last word to remove the branch. Or we could rework the logic a bit and remove this branch, I wonder would it be faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that doesn't seem worth it to me at all. If you can show it's faster in benchmarks (which there are many), I suggest you submit a PR.
Keep in mind on some platforms a read_unaligned is comparatively expensive, so it's worth avoiding if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a minimal testing crate to test this easily without having to build rust?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, sorry.
Remove branch in optimized is_ascii Performs slightly better in short or medium bytes by eliminating the last branch check on `byte_pos == len` and always check the last byte as it is always at most one `usize`. Benchmark, before `libcore`, after `libcore_new`. It improves medium and short by 1ns but regresses unaligned_tail by 2ns, either way we can get unaligned_tail have a tiny chance of 1/8 on a 64 bit machine. I don't think we should bet on that, the probability is worse than dice. ``` test long::case00_libcore ... bench: 38 ns/iter (+/- 1) = 183947 MB/s test long::case00_libcore_new ... bench: 38 ns/iter (+/- 1) = 183947 MB/s test long::case01_iter_all ... bench: 227 ns/iter (+/- 6) = 30792 MB/s test long::case02_align_to ... bench: 40 ns/iter (+/- 1) = 174750 MB/s test long::case03_align_to_unrolled ... bench: 19 ns/iter (+/- 1) = 367894 MB/s test medium::case00_libcore ... bench: 5 ns/iter (+/- 0) = 6400 MB/s test medium::case00_libcore_new ... bench: 4 ns/iter (+/- 0) = 8000 MB/s test medium::case01_iter_all ... bench: 20 ns/iter (+/- 1) = 1600 MB/s test medium::case02_align_to ... bench: 6 ns/iter (+/- 0) = 5333 MB/s test medium::case03_align_to_unrolled ... bench: 5 ns/iter (+/- 0) = 6400 MB/s test short::case00_libcore ... bench: 7 ns/iter (+/- 0) = 1000 MB/s test short::case00_libcore_new ... bench: 6 ns/iter (+/- 0) = 1166 MB/s test short::case01_iter_all ... bench: 5 ns/iter (+/- 0) = 1400 MB/s test short::case02_align_to ... bench: 5 ns/iter (+/- 0) = 1400 MB/s test short::case03_align_to_unrolled ... bench: 5 ns/iter (+/- 1) = 1400 MB/s test unaligned_both::case00_libcore ... bench: 4 ns/iter (+/- 0) = 7500 MB/s test unaligned_both::case00_libcore_new ... bench: 4 ns/iter (+/- 0) = 7500 MB/s test unaligned_both::case01_iter_all ... bench: 26 ns/iter (+/- 0) = 1153 MB/s test unaligned_both::case02_align_to ... bench: 13 ns/iter (+/- 2) = 2307 MB/s test unaligned_both::case03_align_to_unrolled ... bench: 11 ns/iter (+/- 0) = 2727 MB/s test unaligned_head::case00_libcore ... bench: 5 ns/iter (+/- 0) = 6200 MB/s test unaligned_head::case00_libcore_new ... bench: 5 ns/iter (+/- 0) = 6200 MB/s test unaligned_head::case01_iter_all ... bench: 19 ns/iter (+/- 1) = 1631 MB/s test unaligned_head::case02_align_to ... bench: 10 ns/iter (+/- 0) = 3100 MB/s test unaligned_head::case03_align_to_unrolled ... bench: 14 ns/iter (+/- 0) = 2214 MB/s test unaligned_tail::case00_libcore ... bench: 3 ns/iter (+/- 0) = 10333 MB/s test unaligned_tail::case00_libcore_new ... bench: 5 ns/iter (+/- 0) = 6200 MB/s test unaligned_tail::case01_iter_all ... bench: 19 ns/iter (+/- 0) = 1631 MB/s test unaligned_tail::case02_align_to ... bench: 10 ns/iter (+/- 0) = 3100 MB/s test unaligned_tail::case03_align_to_unrolled ... bench: 13 ns/iter (+/- 0) = 2384 MB/s ``` Rough (unfair) maths on improvements for fun: 1ns * 7/8 - 2ns * 1/8 = 0.625ns Inspired by fish and zsh clever trick to highlight missing linefeeds (⏎) and branchless implementation of binary_search in rust. cc @thomcc rust-lang#74066 r? @nagisa
Optimize `core::str::Chars::count` I wrote this a while ago after seeing this function as a bottleneck in a profile, but never got around to contributing it. I saw it again, and so here it is. The implementation is fairly complex, but I tried to explain what's happening at both a high level (in the header comment for the file), and in line comments in the impl. Hopefully it's clear enough. This implementation (`case00_cur_libcore` in the benchmarks below) is somewhat consistently around 4x to 5x faster than the old implementation (`case01_old_libcore` in the benchmarks below), for a wide variety of workloads, without regressing performance on any of the workload sizes I've tried. I also improved the benchmarks for this code, so that they explicitly check text in different languages and of different sizes (err, the cross product of language x size). The results of the benchmarks are here: <details> <summary>Benchmark results</summary> <pre> test str::char_count::emoji_huge::case00_cur_libcore ... bench: 20,216 ns/iter (+/- 3,673) = 17931 MB/s test str::char_count::emoji_huge::case01_old_libcore ... bench: 108,851 ns/iter (+/- 12,777) = 3330 MB/s test str::char_count::emoji_huge::case02_iter_increment ... bench: 329,502 ns/iter (+/- 4,163) = 1100 MB/s test str::char_count::emoji_huge::case03_manual_char_len ... bench: 223,333 ns/iter (+/- 14,167) = 1623 MB/s test str::char_count::emoji_large::case00_cur_libcore ... bench: 293 ns/iter (+/- 6) = 19331 MB/s test str::char_count::emoji_large::case01_old_libcore ... bench: 1,681 ns/iter (+/- 28) = 3369 MB/s test str::char_count::emoji_large::case02_iter_increment ... bench: 5,166 ns/iter (+/- 85) = 1096 MB/s test str::char_count::emoji_large::case03_manual_char_len ... bench: 3,476 ns/iter (+/- 62) = 1629 MB/s test str::char_count::emoji_medium::case00_cur_libcore ... bench: 48 ns/iter (+/- 0) = 14750 MB/s test str::char_count::emoji_medium::case01_old_libcore ... bench: 217 ns/iter (+/- 4) = 3262 MB/s test str::char_count::emoji_medium::case02_iter_increment ... bench: 642 ns/iter (+/- 7) = 1102 MB/s test str::char_count::emoji_medium::case03_manual_char_len ... bench: 445 ns/iter (+/- 3) = 1591 MB/s test str::char_count::emoji_small::case00_cur_libcore ... bench: 18 ns/iter (+/- 0) = 3777 MB/s test str::char_count::emoji_small::case01_old_libcore ... bench: 23 ns/iter (+/- 0) = 2956 MB/s test str::char_count::emoji_small::case02_iter_increment ... bench: 66 ns/iter (+/- 2) = 1030 MB/s test str::char_count::emoji_small::case03_manual_char_len ... bench: 29 ns/iter (+/- 1) = 2344 MB/s test str::char_count::en_huge::case00_cur_libcore ... bench: 25,909 ns/iter (+/- 39,260) = 13299 MB/s test str::char_count::en_huge::case01_old_libcore ... bench: 102,887 ns/iter (+/- 3,257) = 3349 MB/s test str::char_count::en_huge::case02_iter_increment ... bench: 166,370 ns/iter (+/- 12,439) = 2071 MB/s test str::char_count::en_huge::case03_manual_char_len ... bench: 166,332 ns/iter (+/- 4,262) = 2071 MB/s test str::char_count::en_large::case00_cur_libcore ... bench: 281 ns/iter (+/- 6) = 19160 MB/s test str::char_count::en_large::case01_old_libcore ... bench: 1,598 ns/iter (+/- 19) = 3369 MB/s test str::char_count::en_large::case02_iter_increment ... bench: 2,598 ns/iter (+/- 167) = 2072 MB/s test str::char_count::en_large::case03_manual_char_len ... bench: 2,578 ns/iter (+/- 55) = 2088 MB/s test str::char_count::en_medium::case00_cur_libcore ... bench: 44 ns/iter (+/- 1) = 15295 MB/s test str::char_count::en_medium::case01_old_libcore ... bench: 201 ns/iter (+/- 51) = 3348 MB/s test str::char_count::en_medium::case02_iter_increment ... bench: 322 ns/iter (+/- 40) = 2090 MB/s test str::char_count::en_medium::case03_manual_char_len ... bench: 319 ns/iter (+/- 5) = 2109 MB/s test str::char_count::en_small::case00_cur_libcore ... bench: 15 ns/iter (+/- 0) = 2333 MB/s test str::char_count::en_small::case01_old_libcore ... bench: 14 ns/iter (+/- 0) = 2500 MB/s test str::char_count::en_small::case02_iter_increment ... bench: 30 ns/iter (+/- 1) = 1166 MB/s test str::char_count::en_small::case03_manual_char_len ... bench: 30 ns/iter (+/- 1) = 1166 MB/s test str::char_count::ru_huge::case00_cur_libcore ... bench: 16,439 ns/iter (+/- 3,105) = 19777 MB/s test str::char_count::ru_huge::case01_old_libcore ... bench: 89,480 ns/iter (+/- 2,555) = 3633 MB/s test str::char_count::ru_huge::case02_iter_increment ... bench: 217,703 ns/iter (+/- 22,185) = 1493 MB/s test str::char_count::ru_huge::case03_manual_char_len ... bench: 157,330 ns/iter (+/- 19,188) = 2066 MB/s test str::char_count::ru_large::case00_cur_libcore ... bench: 243 ns/iter (+/- 6) = 20905 MB/s test str::char_count::ru_large::case01_old_libcore ... bench: 1,384 ns/iter (+/- 51) = 3670 MB/s test str::char_count::ru_large::case02_iter_increment ... bench: 3,381 ns/iter (+/- 543) = 1502 MB/s test str::char_count::ru_large::case03_manual_char_len ... bench: 2,423 ns/iter (+/- 429) = 2096 MB/s test str::char_count::ru_medium::case00_cur_libcore ... bench: 42 ns/iter (+/- 1) = 15119 MB/s test str::char_count::ru_medium::case01_old_libcore ... bench: 180 ns/iter (+/- 4) = 3527 MB/s test str::char_count::ru_medium::case02_iter_increment ... bench: 402 ns/iter (+/- 45) = 1579 MB/s test str::char_count::ru_medium::case03_manual_char_len ... bench: 280 ns/iter (+/- 29) = 2267 MB/s test str::char_count::ru_small::case00_cur_libcore ... bench: 12 ns/iter (+/- 0) = 2666 MB/s test str::char_count::ru_small::case01_old_libcore ... bench: 12 ns/iter (+/- 0) = 2666 MB/s test str::char_count::ru_small::case02_iter_increment ... bench: 19 ns/iter (+/- 0) = 1684 MB/s test str::char_count::ru_small::case03_manual_char_len ... bench: 14 ns/iter (+/- 1) = 2285 MB/s test str::char_count::zh_huge::case00_cur_libcore ... bench: 15,053 ns/iter (+/- 2,640) = 20067 MB/s test str::char_count::zh_huge::case01_old_libcore ... bench: 82,622 ns/iter (+/- 3,602) = 3656 MB/s test str::char_count::zh_huge::case02_iter_increment ... bench: 230,456 ns/iter (+/- 7,246) = 1310 MB/s test str::char_count::zh_huge::case03_manual_char_len ... bench: 220,595 ns/iter (+/- 11,624) = 1369 MB/s test str::char_count::zh_large::case00_cur_libcore ... bench: 227 ns/iter (+/- 65) = 20792 MB/s test str::char_count::zh_large::case01_old_libcore ... bench: 1,136 ns/iter (+/- 144) = 4154 MB/s test str::char_count::zh_large::case02_iter_increment ... bench: 3,147 ns/iter (+/- 253) = 1499 MB/s test str::char_count::zh_large::case03_manual_char_len ... bench: 2,993 ns/iter (+/- 400) = 1577 MB/s test str::char_count::zh_medium::case00_cur_libcore ... bench: 36 ns/iter (+/- 5) = 16388 MB/s test str::char_count::zh_medium::case01_old_libcore ... bench: 142 ns/iter (+/- 18) = 4154 MB/s test str::char_count::zh_medium::case02_iter_increment ... bench: 379 ns/iter (+/- 37) = 1556 MB/s test str::char_count::zh_medium::case03_manual_char_len ... bench: 364 ns/iter (+/- 51) = 1620 MB/s test str::char_count::zh_small::case00_cur_libcore ... bench: 11 ns/iter (+/- 1) = 3000 MB/s test str::char_count::zh_small::case01_old_libcore ... bench: 11 ns/iter (+/- 1) = 3000 MB/s test str::char_count::zh_small::case02_iter_increment ... bench: 20 ns/iter (+/- 3) = 1650 MB/s </pre> </details> I also added fairly thorough tests for different sizes and alignments. This completes on my machine in 0.02s, which is surprising given how thorough they are, but it seems to detect bugs in the implementation. (I haven't run the tests on a 32 bit machine yet since before I reworked the code a little though, so... hopefully I'm not about to embarrass myself). This uses similar SWAR-style techniques to the `is_ascii` impl I contributed in rust-lang#74066, so I'm going to request review from the same person who reviewed that one. That said am not particularly picky, and might not have the correct syntax for requesting a review from someone (so it goes). r? `@nagisa`
This optimizes the
is_ascii
function for[u8]
andstr
. I've been surprised this wasn't done for a while, so I just did it.Benchmarks comparing before/after look like:
(Taken on a x86_64 macbook 2.9 GHz Intel Core i9 with 6 cores)
Where
is_ascii_slice_iter_all
is the old version, andis_ascii_slice_libcore
is the new.I tried to document the code well, so hopefully it's understandable. It has fairly exhaustive tests ensuring size/align doesn't get violated -- because
miri
doesn't really help a lot for this sort of code right now, I tried todebug_assert
all the safety invariants I'm depending on. (Of course, none of them are required for correctness or soundness -- just allows us to test that this sort of pointer manipulation is sound and such).Anyway, thanks. Let me know if you have questions/desired changes.