Optimize is_ascii for str and [u8]. #74066

thomcc · 2020-07-05T17:28:02Z

This optimizes the is_ascii function for [u8] and str. I've been surprised this wasn't done for a while, so I just did it.

Benchmarks comparing before/after look like:

test ascii::long_readonly::is_ascii_slice_iter_all              ... bench:         174 ns/iter (+/- 79) = 40172 MB/s
test ascii::long_readonly::is_ascii_slice_libcore               ... bench:          16 ns/iter (+/- 5) = 436875 MB/s
test ascii::medium_readonly::is_ascii_slice_iter_all            ... bench:          12 ns/iter (+/- 3) = 2666 MB/s
test ascii::medium_readonly::is_ascii_slice_libcore             ... bench:           2 ns/iter (+/- 0) = 16000 MB/s
test ascii::short_readonly::is_ascii_slice_iter_all             ... bench:           3 ns/iter (+/- 0) = 2333 MB/s
test ascii::short_readonly::is_ascii_slice_libcore              ... bench:           4 ns/iter (+/- 0) = 1750 MB/s

(Taken on a x86_64 macbook 2.9 GHz Intel Core i9 with 6 cores)

Where is_ascii_slice_iter_all is the old version, and is_ascii_slice_libcore is the new.

I tried to document the code well, so hopefully it's understandable. It has fairly exhaustive tests ensuring size/align doesn't get violated -- because miri doesn't really help a lot for this sort of code right now, I tried to debug_assert all the safety invariants I'm depending on. (Of course, none of them are required for correctness or soundness -- just allows us to test that this sort of pointer manipulation is sound and such).

Anyway, thanks. Let me know if you have questions/desired changes.

rust-highfive · 2020-07-05T17:28:05Z

r? @shepmaster

(rust_highfive has picked a reviewer for you, use r? to override)

jonas-schievink · 2020-07-05T17:42:49Z

is_ascii_slice_libcore ... bench: 16 ns/iter (+/- 5) = 436875 MB/s

That's over 400 GB/s, are you sure that's measuring the correct thing?

thomcc · 2020-07-05T17:50:06Z

Good point!

ecstatic-morse · 2020-07-05T17:52:15Z

You should have a look at slice::align_to.

thomcc · 2020-07-05T18:01:35Z

@ecstatic-morse: The algorithm I'm using is different (and more efficient on most machines) than what I'd get with slice::align_to.

@jonas-schievink: Not sure I know how to debug this really -- all the timings from that module seem dodgy. Well over 30GB/s for most of them, and those also contain to_vec() of a ~6000 byte slice.

test ascii::long::is_ascii_alphabetic                           ... bench:         186 ns/iter (+/- 17) = 37580 MB/s
test ascii::long::is_ascii_alphanumeric                         ... bench:         208 ns/iter (+/- 62) = 33605 MB/s
test ascii::long::is_ascii_control                              ... bench:         238 ns/iter (+/- 89) = 29369 MB/s
test ascii::long::is_ascii_digit                                ... bench:         301 ns/iter (+/- 199) = 23222 MB/s
test ascii::long::is_ascii_graphic                              ... bench:         252 ns/iter (+/- 242) = 27738 MB/s
test ascii::long::is_ascii_hexdigit                             ... bench:         213 ns/iter (+/- 87) = 32816 MB/s
test ascii::long::is_ascii_lowercase                            ... bench:         239 ns/iter (+/- 114) = 29246 MB/s
test ascii::long::is_ascii_punctuation                          ... bench:         197 ns/iter (+/- 103) = 35482 MB/s
test ascii::long::is_ascii_uppercase                            ... bench:         188 ns/iter (+/- 34) = 37180 MB/s
test ascii::long::is_ascii_whitespace                           ... bench:         207 ns/iter (+/- 195) = 33768 MB/s

ecstatic-morse · 2020-07-05T18:12:09Z

Simple code is good code, and people running ARM may also want to check whether their strings are ASCII-only. If your more complex version is meaningfully more efficient, then fine, but I suspect it's not.

thomcc · 2020-07-05T18:26:58Z

test ascii::long_readonly::is_ascii_slice_align_to              ... bench:          23 ns/iter (+/- 5) = 303913 MB/s
test ascii::long_readonly::is_ascii_slice_iter_all              ... bench:         161 ns/iter (+/- 24) = 43416 MB/s
test ascii::long_readonly::is_ascii_slice_libcore               ... bench:          16 ns/iter (+/- 4) = 436875 MB/s
test ascii::medium_readonly::is_ascii_slice_align_to            ... bench:           6 ns/iter (+/- 1) = 5333 MB/s
test ascii::medium_readonly::is_ascii_slice_iter_all            ... bench:          14 ns/iter (+/- 6) = 2285 MB/s
test ascii::medium_readonly::is_ascii_slice_libcore             ... bench:           3 ns/iter (+/- 3) = 10666 MB/s
test ascii::short_readonly::is_ascii_slice_align_to             ... bench:           8 ns/iter (+/- 2) = 875 MB/s
test ascii::short_readonly::is_ascii_slice_iter_all             ... bench:           3 ns/iter (+/- 3) = 2333 MB/s
test ascii::short_readonly::is_ascii_slice_libcore              ... bench:           4 ns/iter (+/- 5) = 1750 MB/s

for

    fn is_ascii_slice_align_to(bytes: &[u8]) {
        fn contains_nonascii(v: usize) -> bool {
            const NONASCII_MASK: usize = 0x80808080_80808080u64 as usize;
            (NONASCII_MASK & v) != 0
        }
        let (head, body, tail) = unsafe { bytes.align_to::<usize>() };
        head.iter().all(|b| b.is_ascii()) &&
        body.iter().all(|w| !contains_nonascii(*w)) &&
        tail.iter().all(|b| b.is_ascii())
    }

Of course with the standard caveat of "I spent far longer on my version than this version".

Also: On modern ARM (armv7+) read_unaligned is only slow if the address is actually unaligned, and even then it's really not that bad in my experience. Like, I don't have numbers for this but it seems around the same as just reading the bytes.

There are platforms where this is worse though, but I'd expect those to compile read_unaligned down to load each byte. Which is what the align_to version has to do (I'll admit, I find align_to a little annoying for this reason).

Anyway, I actually noticed is_ascii was bad because I was optimizing firefox for android's (the "preview") autocomplete search implementation, so I do care very much about ARM performance for this code.

ecstatic-morse · 2020-07-05T18:47:15Z

I would have added a fast path for when slice.len() < mem::size_of::<usize>() to the align_to version since your version has a similar check that makes it fast on the small benchmark. I will note that the margin of error for both versions overlaps, but I would call your version meaningfully more efficient on x86 at least. Thanks for doing the work.

shepmaster · 2020-07-05T19:33:24Z

Beyond my knowledge, let's try...

r? @sfackler

/cc @BurntSushi as I know your interest in these matters.

nagisa

Consider slice::align_to instead of align_offset? It also takes care of various other asserts you have in the rest of this function as well as removes the necessity for the "thoroughly" test case and significantly reduce the amount of unsafe in the code...

EDIT: wrote this before I saw the other comments...

The algorithm I'm using is different (and more efficient on most machines) than what I'd get with slice::align_to.

That likely means slice::align_to can be improved instead. I’ll experiment with it.

thomcc · 2020-07-05T21:23:58Z

Consider slice::align_to instead of align_offset? It also takes care of various other asserts you have in the rest of this function as well as removes the necessity for the "thoroughly" test case and significantly reduce the amount of unsafe in the code...

This is discussed in other comments above, it's measurably less performant (on my machine, anyway).

I suspect in practice it will do a little worse than it does on the benchmark I linked in #74066 (comment) (at least for the case that caused me to file this PR -- mostly medium-length strings). That is, the benchmarks don't really hit the cases where align_to is bad -- And I specifically wrote my function the way I did to avoid the cases where align_to is bad (concretely: to avoid a per-byte loop on the unaligned case).

To demonstrate this, if I modify the code* so that first and last bytes of the test string are skipped (to force unaligned on both ends), you see the following for medium (small is smaller than a usize, and long is too large for something like this to matter):

test ascii::medium_readonly::is_ascii_slice_align_to            ... bench:          12 ns/iter (+/- 4) = 2500 MB/s
test ascii::medium_readonly::is_ascii_slice_iter_all            ... bench:          10 ns/iter (+/- 2) = 3000 MB/s
test ascii::medium_readonly::is_ascii_slice_libcore             ... bench:           2 ns/iter (+/- 1) = 15000 MB/s

Note that previously medium would have been aligned to usize (came right out of an allocator), and was 32 bytes -- e.g. it would totally avoid the problem cases for align_to.

In practice a case like bit me before (it was with SIMD vectors, which exacerbates the issue, since they have longer tail/head sequences, but...). Which is... admittedly why hold a little bit of a grudge against align_to >_>

That said, I do feel like I'm getting a signal of "This is not worth the complexity", and I'd rather have an is_ascii that uses align_to than the current one, and I'm totally willing to change it if desired... I also have a few ideas on how to make it perform closer to my implementation, although IDK if there's any way to avoid bad performance in it's problem cases.

That said, IMO the stdlib is a better-than-average place to make a complexity/performance trade-off like this, so I'm going to probably wait for the reviewer to tell me to switch to align_to if desired.

* Uh, I'll have the is_ascii_slice_align_to bench up in a few minutes after sending this comment, but it's basically what's #74066 (comment) + the suggestion from #74066 (comment).

I probably won't add the hacky code I used to force misalignment (replacing vec[..] with vec[1..vec.len()-1]...) and generate the benchmark above, but I'll try to add a non-hacky benchmark that measures the misaligned case soon. I've added this in not-entirely-hacky way.

EDIT: (My comment was posted before I saw your response too)

That likely means slice::align_to can be improved instead. I’ll experiment with it.

@nagisa IMO it's kind of a problem with the API rather than something like a missed optimization (although, to be clear, align_to compiles weird -- LLVM aggressively often unrolls loops on the head/tail, e.g: https://godbolt.org/z/N7t96u).

In particular a big issue for me is that coming from &[u8] is common, but you often really don't want to be doing operations on each u8 in the lead/trail. In practice it's a little annoying since it feels like 80% of what I want, but the remaining 20%... If I'm reaching for it at all.

That said, I can't deny it's much more readable and maintainable than what I came up with.

nagisa · 2020-07-05T22:31:46Z

I extracted the implementations into a local crate and wrote a criterion-based benchmark… and tested the three implementations on both x86_64 and an aarch64. From what I can see, the proposed implementation is (noticeably) better than the align_to implementation at input sizes ~8 bytes to ~1kB, after which point the align_to begins overtaking the proposed implementation.

This makes sense to me given that align_to computes the slices ahead of time and the align_offset version checks its ability to stay in the fast path on every iteration.

So ultimately its probably a question of what kind of input we want to optimize for (and I think the medium length case is the most interesting one, favouring the currently proposed implementation).

That said, I also see extreme (>100%) run-to-run variance in some of the benchmark results, even though I did make an effort to avoid most of the well known pitfalls in benchmarking, so whatever data my benchmark is generating is gibberish anyway… This also makes it super difficult to just experiment with the minor changes to the code...

…rks into own file

thomcc · 2020-07-05T23:05:31Z

One idea I had was something like:

fn is_ascii_align_to_unrolled(bytes: &[u8]) -> bool {
    // Not really clear if this should be testing size_of [usize; 2] or size_of usize still...
    if bytes.len() < core::mem::size_of::<[usize; 2]>() {
        return bytes.iter().all(|b| b.is_ascii());
    }
    // SAFETY: transmuting a sequence of `u8` to `[usize; 2]` is always fine
    let (head, body, tail) = unsafe { bytes.align_to::<[usize; 2]>() };
    head.iter().all(|b| b.is_ascii())
        && body.iter().all(|w| !contains_nonascii(w[0] | w[1]))
        && tail.iter().all(|b| b.is_ascii())
}

And this actually is as fast or almost as fast as the main impl in the PR... So long as your string's length is a multiple of size_of::<usize>() (Actually, I guess it would be size_of::<[usize; 2]>(), since align_to's tail is based on the size of the type, not it's alignment...).

Which seems like a pretty common case.

Anyway my most recent push adds a bench for that, separates out benches for different alignment issues that align_to can have problem with (head vs tail vs both), and moves the benchmarks to a different file, which cleaned a lot of things up. I probably should have been a bit better about making it clear which cases I was trying to optimize (or avoid pessimizing) in my initial PR)

A run of the benchmark results is here.

This makes sense to me given that align_to computes the slices ahead of time and the align_offset version checks its ability to stay in the fast path on every iteration.

Hm. I'm going to think about this since I feel like there's no reason we shouldn't be doing the same. I'm not fully sure I follow though...

That said really there are a few things I probably should do to improve the implementation of is_ascii in the PR (it probably should unroll the inner loop once), but also it might be complex enough as-is given that it's complexity already caused minor controversy.

…_` benches, and clean up stray semicolon

nagisa · 2020-07-11T15:20:57Z

I’m going to approve this. I think optimizing for performance with smaller strings here makes sense and the algorithm as implemented here will do better with said smaller strings specifically because it just does less work.

I also don't see a reason to block this on benchmarks that are super fuzzy anyway...

@bors r+

bors · 2020-07-11T15:20:58Z

📌 Commit a150dcc has been approved by nagisa

@ghost

…arth Rollup of 10 pull requests Successful merges: - rust-lang#72920 (Stabilize `transmute` in constants and statics but not const fn) - rust-lang#73715 (debuginfo: Mangle tuples to be natvis friendly, typedef basic types) - rust-lang#74066 (Optimize is_ascii for str and [u8].) - rust-lang#74116 (Fix cross compilation of LLVM to aarch64 Windows targets) - rust-lang#74167 (linker: illumos ld does not support --eh-frame-hdr) - rust-lang#74168 (Add a help to use `in_band_lifetimes` in nightly) - rust-lang#74197 (Reword incorrect `self` token suggestion) - rust-lang#74213 (Minor refactor for rustc_resolve diagnostics match) - rust-lang#74240 (Fix rust-lang#74081 and add the test case from rust-lang#74236) - rust-lang#74241 (update miri) Failed merges: r? @ghost

pickfire · 2020-07-20T06:55:37Z

src/libcore/slice/mod.rs

+    if byte_pos == len {
+        return true;
+    }


Is this necessary? What happens if we always check last_word? As in making it branchless.

It's an optimization to avoid an extra (redundant) load. It's not necessary for correctness.

It's an optimization to avoid an extra (redundant) load. It's not necessary for correctness.

Avoid extra load? But what happens if you remove it? Along with the debug assert.

Then we perform an extra load of the last word, which we've already checked. What's your point?

Yes, that's what I mean, perform the extra load of the last word to remove the branch. Or we could rework the logic a bit and remove this branch, I wonder would it be faster.

Yes, that doesn't seem worth it to me at all. If you can show it's faster in benchmarks (which there are many), I suggest you submit a PR.

Keep in mind on some platforms a read_unaligned is comparatively expensive, so it's worth avoiding if possible.

Do you have a minimal testing crate to test this easily without having to build rust?

@thomcc

Remove branch in optimized is_ascii Performs slightly better in short or medium bytes by eliminating the last branch check on `byte_pos == len` and always check the last byte as it is always at most one `usize`. Benchmark, before `libcore`, after `libcore_new`. It improves medium and short by 1ns but regresses unaligned_tail by 2ns, either way we can get unaligned_tail have a tiny chance of 1/8 on a 64 bit machine. I don't think we should bet on that, the probability is worse than dice. ``` test long::case00_libcore ... bench: 38 ns/iter (+/- 1) = 183947 MB/s test long::case00_libcore_new ... bench: 38 ns/iter (+/- 1) = 183947 MB/s test long::case01_iter_all ... bench: 227 ns/iter (+/- 6) = 30792 MB/s test long::case02_align_to ... bench: 40 ns/iter (+/- 1) = 174750 MB/s test long::case03_align_to_unrolled ... bench: 19 ns/iter (+/- 1) = 367894 MB/s test medium::case00_libcore ... bench: 5 ns/iter (+/- 0) = 6400 MB/s test medium::case00_libcore_new ... bench: 4 ns/iter (+/- 0) = 8000 MB/s test medium::case01_iter_all ... bench: 20 ns/iter (+/- 1) = 1600 MB/s test medium::case02_align_to ... bench: 6 ns/iter (+/- 0) = 5333 MB/s test medium::case03_align_to_unrolled ... bench: 5 ns/iter (+/- 0) = 6400 MB/s test short::case00_libcore ... bench: 7 ns/iter (+/- 0) = 1000 MB/s test short::case00_libcore_new ... bench: 6 ns/iter (+/- 0) = 1166 MB/s test short::case01_iter_all ... bench: 5 ns/iter (+/- 0) = 1400 MB/s test short::case02_align_to ... bench: 5 ns/iter (+/- 0) = 1400 MB/s test short::case03_align_to_unrolled ... bench: 5 ns/iter (+/- 1) = 1400 MB/s test unaligned_both::case00_libcore ... bench: 4 ns/iter (+/- 0) = 7500 MB/s test unaligned_both::case00_libcore_new ... bench: 4 ns/iter (+/- 0) = 7500 MB/s test unaligned_both::case01_iter_all ... bench: 26 ns/iter (+/- 0) = 1153 MB/s test unaligned_both::case02_align_to ... bench: 13 ns/iter (+/- 2) = 2307 MB/s test unaligned_both::case03_align_to_unrolled ... bench: 11 ns/iter (+/- 0) = 2727 MB/s test unaligned_head::case00_libcore ... bench: 5 ns/iter (+/- 0) = 6200 MB/s test unaligned_head::case00_libcore_new ... bench: 5 ns/iter (+/- 0) = 6200 MB/s test unaligned_head::case01_iter_all ... bench: 19 ns/iter (+/- 1) = 1631 MB/s test unaligned_head::case02_align_to ... bench: 10 ns/iter (+/- 0) = 3100 MB/s test unaligned_head::case03_align_to_unrolled ... bench: 14 ns/iter (+/- 0) = 2214 MB/s test unaligned_tail::case00_libcore ... bench: 3 ns/iter (+/- 0) = 10333 MB/s test unaligned_tail::case00_libcore_new ... bench: 5 ns/iter (+/- 0) = 6200 MB/s test unaligned_tail::case01_iter_all ... bench: 19 ns/iter (+/- 0) = 1631 MB/s test unaligned_tail::case02_align_to ... bench: 10 ns/iter (+/- 0) = 3100 MB/s test unaligned_tail::case03_align_to_unrolled ... bench: 13 ns/iter (+/- 0) = 2384 MB/s ``` Rough (unfair) maths on improvements for fun: 1ns * 7/8 - 2ns * 1/8 = 0.625ns Inspired by fish and zsh clever trick to highlight missing linefeeds (⏎) and branchless implementation of binary_search in rust. cc @thomcc rust-lang#74066 r? @nagisa

Optimize `core::str::Chars::count` I wrote this a while ago after seeing this function as a bottleneck in a profile, but never got around to contributing it. I saw it again, and so here it is. The implementation is fairly complex, but I tried to explain what's happening at both a high level (in the header comment for the file), and in line comments in the impl. Hopefully it's clear enough. This implementation (`case00_cur_libcore` in the benchmarks below) is somewhat consistently around 4x to 5x faster than the old implementation (`case01_old_libcore` in the benchmarks below), for a wide variety of workloads, without regressing performance on any of the workload sizes I've tried. I also improved the benchmarks for this code, so that they explicitly check text in different languages and of different sizes (err, the cross product of language x size). The results of the benchmarks are here: <details> <summary>Benchmark results</summary> <pre> test str::char_count::emoji_huge::case00_cur_libcore ... bench: 20,216 ns/iter (+/- 3,673) = 17931 MB/s test str::char_count::emoji_huge::case01_old_libcore ... bench: 108,851 ns/iter (+/- 12,777) = 3330 MB/s test str::char_count::emoji_huge::case02_iter_increment ... bench: 329,502 ns/iter (+/- 4,163) = 1100 MB/s test str::char_count::emoji_huge::case03_manual_char_len ... bench: 223,333 ns/iter (+/- 14,167) = 1623 MB/s test str::char_count::emoji_large::case00_cur_libcore ... bench: 293 ns/iter (+/- 6) = 19331 MB/s test str::char_count::emoji_large::case01_old_libcore ... bench: 1,681 ns/iter (+/- 28) = 3369 MB/s test str::char_count::emoji_large::case02_iter_increment ... bench: 5,166 ns/iter (+/- 85) = 1096 MB/s test str::char_count::emoji_large::case03_manual_char_len ... bench: 3,476 ns/iter (+/- 62) = 1629 MB/s test str::char_count::emoji_medium::case00_cur_libcore ... bench: 48 ns/iter (+/- 0) = 14750 MB/s test str::char_count::emoji_medium::case01_old_libcore ... bench: 217 ns/iter (+/- 4) = 3262 MB/s test str::char_count::emoji_medium::case02_iter_increment ... bench: 642 ns/iter (+/- 7) = 1102 MB/s test str::char_count::emoji_medium::case03_manual_char_len ... bench: 445 ns/iter (+/- 3) = 1591 MB/s test str::char_count::emoji_small::case00_cur_libcore ... bench: 18 ns/iter (+/- 0) = 3777 MB/s test str::char_count::emoji_small::case01_old_libcore ... bench: 23 ns/iter (+/- 0) = 2956 MB/s test str::char_count::emoji_small::case02_iter_increment ... bench: 66 ns/iter (+/- 2) = 1030 MB/s test str::char_count::emoji_small::case03_manual_char_len ... bench: 29 ns/iter (+/- 1) = 2344 MB/s test str::char_count::en_huge::case00_cur_libcore ... bench: 25,909 ns/iter (+/- 39,260) = 13299 MB/s test str::char_count::en_huge::case01_old_libcore ... bench: 102,887 ns/iter (+/- 3,257) = 3349 MB/s test str::char_count::en_huge::case02_iter_increment ... bench: 166,370 ns/iter (+/- 12,439) = 2071 MB/s test str::char_count::en_huge::case03_manual_char_len ... bench: 166,332 ns/iter (+/- 4,262) = 2071 MB/s test str::char_count::en_large::case00_cur_libcore ... bench: 281 ns/iter (+/- 6) = 19160 MB/s test str::char_count::en_large::case01_old_libcore ... bench: 1,598 ns/iter (+/- 19) = 3369 MB/s test str::char_count::en_large::case02_iter_increment ... bench: 2,598 ns/iter (+/- 167) = 2072 MB/s test str::char_count::en_large::case03_manual_char_len ... bench: 2,578 ns/iter (+/- 55) = 2088 MB/s test str::char_count::en_medium::case00_cur_libcore ... bench: 44 ns/iter (+/- 1) = 15295 MB/s test str::char_count::en_medium::case01_old_libcore ... bench: 201 ns/iter (+/- 51) = 3348 MB/s test str::char_count::en_medium::case02_iter_increment ... bench: 322 ns/iter (+/- 40) = 2090 MB/s test str::char_count::en_medium::case03_manual_char_len ... bench: 319 ns/iter (+/- 5) = 2109 MB/s test str::char_count::en_small::case00_cur_libcore ... bench: 15 ns/iter (+/- 0) = 2333 MB/s test str::char_count::en_small::case01_old_libcore ... bench: 14 ns/iter (+/- 0) = 2500 MB/s test str::char_count::en_small::case02_iter_increment ... bench: 30 ns/iter (+/- 1) = 1166 MB/s test str::char_count::en_small::case03_manual_char_len ... bench: 30 ns/iter (+/- 1) = 1166 MB/s test str::char_count::ru_huge::case00_cur_libcore ... bench: 16,439 ns/iter (+/- 3,105) = 19777 MB/s test str::char_count::ru_huge::case01_old_libcore ... bench: 89,480 ns/iter (+/- 2,555) = 3633 MB/s test str::char_count::ru_huge::case02_iter_increment ... bench: 217,703 ns/iter (+/- 22,185) = 1493 MB/s test str::char_count::ru_huge::case03_manual_char_len ... bench: 157,330 ns/iter (+/- 19,188) = 2066 MB/s test str::char_count::ru_large::case00_cur_libcore ... bench: 243 ns/iter (+/- 6) = 20905 MB/s test str::char_count::ru_large::case01_old_libcore ... bench: 1,384 ns/iter (+/- 51) = 3670 MB/s test str::char_count::ru_large::case02_iter_increment ... bench: 3,381 ns/iter (+/- 543) = 1502 MB/s test str::char_count::ru_large::case03_manual_char_len ... bench: 2,423 ns/iter (+/- 429) = 2096 MB/s test str::char_count::ru_medium::case00_cur_libcore ... bench: 42 ns/iter (+/- 1) = 15119 MB/s test str::char_count::ru_medium::case01_old_libcore ... bench: 180 ns/iter (+/- 4) = 3527 MB/s test str::char_count::ru_medium::case02_iter_increment ... bench: 402 ns/iter (+/- 45) = 1579 MB/s test str::char_count::ru_medium::case03_manual_char_len ... bench: 280 ns/iter (+/- 29) = 2267 MB/s test str::char_count::ru_small::case00_cur_libcore ... bench: 12 ns/iter (+/- 0) = 2666 MB/s test str::char_count::ru_small::case01_old_libcore ... bench: 12 ns/iter (+/- 0) = 2666 MB/s test str::char_count::ru_small::case02_iter_increment ... bench: 19 ns/iter (+/- 0) = 1684 MB/s test str::char_count::ru_small::case03_manual_char_len ... bench: 14 ns/iter (+/- 1) = 2285 MB/s test str::char_count::zh_huge::case00_cur_libcore ... bench: 15,053 ns/iter (+/- 2,640) = 20067 MB/s test str::char_count::zh_huge::case01_old_libcore ... bench: 82,622 ns/iter (+/- 3,602) = 3656 MB/s test str::char_count::zh_huge::case02_iter_increment ... bench: 230,456 ns/iter (+/- 7,246) = 1310 MB/s test str::char_count::zh_huge::case03_manual_char_len ... bench: 220,595 ns/iter (+/- 11,624) = 1369 MB/s test str::char_count::zh_large::case00_cur_libcore ... bench: 227 ns/iter (+/- 65) = 20792 MB/s test str::char_count::zh_large::case01_old_libcore ... bench: 1,136 ns/iter (+/- 144) = 4154 MB/s test str::char_count::zh_large::case02_iter_increment ... bench: 3,147 ns/iter (+/- 253) = 1499 MB/s test str::char_count::zh_large::case03_manual_char_len ... bench: 2,993 ns/iter (+/- 400) = 1577 MB/s test str::char_count::zh_medium::case00_cur_libcore ... bench: 36 ns/iter (+/- 5) = 16388 MB/s test str::char_count::zh_medium::case01_old_libcore ... bench: 142 ns/iter (+/- 18) = 4154 MB/s test str::char_count::zh_medium::case02_iter_increment ... bench: 379 ns/iter (+/- 37) = 1556 MB/s test str::char_count::zh_medium::case03_manual_char_len ... bench: 364 ns/iter (+/- 51) = 1620 MB/s test str::char_count::zh_small::case00_cur_libcore ... bench: 11 ns/iter (+/- 1) = 3000 MB/s test str::char_count::zh_small::case01_old_libcore ... bench: 11 ns/iter (+/- 1) = 3000 MB/s test str::char_count::zh_small::case02_iter_increment ... bench: 20 ns/iter (+/- 3) = 1650 MB/s </pre> </details> I also added fairly thorough tests for different sizes and alignments. This completes on my machine in 0.02s, which is surprising given how thorough they are, but it seems to detect bugs in the implementation. (I haven't run the tests on a 32 bit machine yet since before I reworked the code a little though, so... hopefully I'm not about to embarrass myself). This uses similar SWAR-style techniques to the `is_ascii` impl I contributed in rust-lang#74066, so I'm going to request review from the same person who reviewed that one. That said am not particularly picky, and might not have the correct syntax for requesting a review from someone (so it goes). r? `@nagisa`

Thom Chiovoloni added 2 commits July 5, 2020 10:23

Optimize is_ascii for &str and &[u8]

980d8e1

Avoid vec! allocation in is_ascii_slice_* benches

63e2e2e

rust-highfive assigned shepmaster Jul 5, 2020

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jul 5, 2020

rust-highfive assigned sfackler and unassigned shepmaster Jul 5, 2020

nagisa reviewed Jul 5, 2020

View reviewed changes

Thom Chiovoloni added 2 commits July 5, 2020 14:33

Add benchmark for slice is_ascii using align_to

e1d4db6

Benchmark the unaligned case for is_ascii, and add missing SAFETY

13e380d

Add 'unrolled' is_ascii_align_to benchmark, and move is_ascii benchma…

dc4a644

…rks into own file

Remove pointless black_box call, add a comment about the `unaligned…

a150dcc

…_` benches, and clean up stray semicolon

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jul 11, 2020

Manishearth mentioned this pull request Jul 11, 2020

Rollup of 10 pull requests #74245

Merged

bors merged commit 1979fa8 into rust-lang:master Jul 12, 2020

jyn514 mentioned this pull request Jul 19, 2020

SIMD-enabled utf-8 validation #68455

Open

pickfire reviewed Jul 20, 2020

View reviewed changes

pickfire mentioned this pull request Jul 20, 2020

Remove branch in optimized is_ascii #74562

Merged

matthiaskrgr mentioned this pull request Nov 1, 2020

ICE: non utf8 str from miri: Utf8Error -Zdump-mir-all #78520

Closed

thomcc mentioned this pull request Oct 30, 2021

Optimize core::str::Chars::count #90414

Merged

cuviper added this to the 1.46 milestone May 2, 2024

okaneco mentioned this pull request Sep 23, 2024

Optimize is_ascii for str and [u8] further #130733

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize is_ascii for str and [u8]. #74066

Optimize is_ascii for str and [u8]. #74066

thomcc commented Jul 5, 2020 •

edited

Loading

rust-highfive commented Jul 5, 2020

jonas-schievink commented Jul 5, 2020

thomcc commented Jul 5, 2020

ecstatic-morse commented Jul 5, 2020

thomcc commented Jul 5, 2020

ecstatic-morse commented Jul 5, 2020

thomcc commented Jul 5, 2020

ecstatic-morse commented Jul 5, 2020

shepmaster commented Jul 5, 2020

nagisa left a comment •

edited

Loading

thomcc commented Jul 5, 2020 •

edited

Loading

nagisa commented Jul 5, 2020 •

edited

Loading

thomcc commented Jul 5, 2020 •

edited

Loading

nagisa commented Jul 11, 2020

bors commented Jul 11, 2020

pickfire Jul 20, 2020 •

edited

Loading

thomcc Jul 20, 2020

pickfire Jul 20, 2020 •

edited

Loading

thomcc Jul 20, 2020

pickfire Jul 20, 2020 •

edited

Loading

thomcc Jul 20, 2020

pickfire Jul 20, 2020

thomcc Jul 20, 2020

Optimize is_ascii for str and [u8]. #74066

Optimize is_ascii for str and [u8]. #74066

Conversation

thomcc commented Jul 5, 2020 • edited Loading

rust-highfive commented Jul 5, 2020

jonas-schievink commented Jul 5, 2020

thomcc commented Jul 5, 2020

ecstatic-morse commented Jul 5, 2020

thomcc commented Jul 5, 2020

ecstatic-morse commented Jul 5, 2020

thomcc commented Jul 5, 2020

ecstatic-morse commented Jul 5, 2020

shepmaster commented Jul 5, 2020

nagisa left a comment • edited Loading

Choose a reason for hiding this comment

thomcc commented Jul 5, 2020 • edited Loading

nagisa commented Jul 5, 2020 • edited Loading

thomcc commented Jul 5, 2020 • edited Loading

nagisa commented Jul 11, 2020

bors commented Jul 11, 2020

pickfire Jul 20, 2020 • edited Loading

Choose a reason for hiding this comment

thomcc Jul 20, 2020

Choose a reason for hiding this comment

pickfire Jul 20, 2020 • edited Loading

Choose a reason for hiding this comment

thomcc Jul 20, 2020

Choose a reason for hiding this comment

pickfire Jul 20, 2020 • edited Loading

Choose a reason for hiding this comment

thomcc Jul 20, 2020

Choose a reason for hiding this comment

pickfire Jul 20, 2020

Choose a reason for hiding this comment

thomcc Jul 20, 2020

Choose a reason for hiding this comment

thomcc commented Jul 5, 2020 •

edited

Loading

nagisa left a comment •

edited

Loading

thomcc commented Jul 5, 2020 •

edited

Loading

nagisa commented Jul 5, 2020 •

edited

Loading

thomcc commented Jul 5, 2020 •

edited

Loading

pickfire Jul 20, 2020 •

edited

Loading

pickfire Jul 20, 2020 •

edited

Loading

pickfire Jul 20, 2020 •

edited

Loading