Add fast path for ASCII in UTF-8 validation #30740

bluss · 2016-01-06T15:16:27Z

Add fast path for ASCII in UTF-8 validation

This speeds up the ASCII case (and long stretches of ASCII in otherwise
mixed UTF-8 data) when checking UTF-8 validity.

Benchmark results suggest that on purely ASCII input, we can improve
throughput (megabytes verified / second) by a factor of 13 to 14 (smallish input).
On XML and mostly English language input (en.wikipedia XML dump),
throughput improves by a factor 7 (large input).

On mostly non-ASCII input, performance increases slightly or is the
same.

The UTF-8 validation is rewritten to use indexed access; since all
access is preceded by a (mandatory for validation) length check, bounds
checks are statically elided by LLVM and this formulation is in fact the best
for performance. A previous version had losses due to slice to iterator
conversions.

A large credit to Björn Steinbrink who improved this patch immensely,
writing this second version.

Benchmark results on x86-64 (Sandy Bridge) compiled with -C opt-level=3.

Old code is regular, this PR is called fast.

Datasets:

ascii is just ASCII (2.5 kB)
cyr is cyrillic script with ascii spaces (5 kB)
dewik10 is 10MB of a de.wikipedia XML dump
enwik8 is 100MB of an en.wikipedia XML dump
jawik10 is 10MB of a ja.wikipedia XML dump

test from_utf8_ascii_fast        ... bench:         140 ns/iter (+/- 4) = 18221 MB/s
test from_utf8_ascii_regular     ... bench:       1,932 ns/iter (+/- 19) = 1320 MB/s
test from_utf8_cyr_fast          ... bench:      10,025 ns/iter (+/- 245) = 511 MB/s
test from_utf8_cyr_regular       ... bench:      10,944 ns/iter (+/- 795) = 468 MB/s
test from_utf8_dewik10_fast      ... bench:   6,017,909 ns/iter (+/- 105,755) = 1740 MB/s
test from_utf8_dewik10_regular   ... bench:  11,669,493 ns/iter (+/- 264,045) = 891 MB/s
test from_utf8_enwik8_fast       ... bench:  14,085,692 ns/iter (+/- 1,643,316) = 7000 MB/s
test from_utf8_enwik8_regular    ... bench:  93,657,410 ns/iter (+/- 5,353,353) = 1000 MB/s
test from_utf8_jawik10_fast      ... bench:  29,154,073 ns/iter (+/- 4,659,534) = 340 MB/s
test from_utf8_jawik10_regular   ... bench:  29,112,917 ns/iter (+/- 2,475,123) = 340 MB/s

Co-authored-by: Björn Steinbrink bsteinbr@gmail.com

rust-highfive · 2016-01-06T15:16:37Z

r? @brson

(rust_highfive has picked a reviewer for you, use r? to override)

bluss · 2016-01-06T15:18:03Z

Benchmarks using long texts are here: https://gist.github.com/bluss/bf45e07e711238e22b7a

2-3% slowdown on japanese and cyrillic texts that are mostly non-ascii. I don't have a problem championing that regression, given the speedup on utf-8 validation for predominantly ASCII input. The example texts are pretty arbitrary, the wikipedia texts a /little/ less so.

shepmaster · 2016-01-06T16:14:57Z

src/libcollectionstest/str.rs

@@ -468,6 +468,18 @@ fn test_is_utf8() {
    assert!(from_utf8(&[0xEF, 0xBF, 0xBF]).is_ok());
    assert!(from_utf8(&[0xF0, 0x90, 0x80, 0x80]).is_ok());
    assert!(from_utf8(&[0xF4, 0x8F, 0xBF, 0xBF]).is_ok());
+
+    // deny embedded in long stretches of ascii


I don't really know what this specific set of tests is doing.

I always have a bit of a sad when there are these giant "test everything" tests; my personal pref would be another test like is_utf8_is_not_tricked_by_non_ascii_in_long_stretches_of_ascii. No need to add test_, no need to have a comment, a failed test tells you what failed. 😸

Entirely reasonable, no reason to share test name there, no common setup or anything. Fixed to have its own test function.

shepmaster · 2016-01-06T16:19:28Z

src/libcore/str/mod.rs

+    let ptr = v.as_ptr();
+
+    let mut offset = 0;
+    if len >= 2 * usize::BYTES {


Why the 2?

The loop is unrolled by 2 (reads 2 usize per lap).

Apologies, I wasn't very clear. I guess it's a two-part question:

Why unroll at all?

Why only unroll by 2?

It's a bit arbitrary, I've only tried 1, 2, and 4 and compared performance, and it's a trade off. In the memchr code, where this is taken from it's to fill a 16-byte register on x86-64, but that doesn't happen here.

Can you extract the 2 to a const with a descriptive name about unrolling? Since I don't see any hand-unrolling here, I am guessing that the if statement allows the compiler to the unrolling according to the unrolling factor. This is not obvious to me. Can you also add a comment explaining?

Edit: Oh, are the duplicated contains_nonascii calls the loop unrolling?

Yes, and the two ptr.offset and deref per iteration.

shepmaster · 2016-01-06T16:21:55Z

Pedantically, I'd say it should be ASCII (all caps) when in comments or prose as it's an acronym. Also non_ascii cause I'd normally write non-ASCII. All my comments are at your discretion to take or leave! 😇

ranma42 · 2016-01-06T18:41:00Z

src/libcore/str/mod.rs

+        }
+    }
+
+    // find the byte after the point the loop stopped


If the result of (x & 0x80808080_80808080) is non-zero, you can "immediately" find which byte it is using leading_zeros() / 8

depends on endianness, it works fine with .trailing_zeros() on x86-64. It deserved to be tried for sure, but I couldn't make it be an improvement.

What llvm compiles the current code into, the beast it is, is actually if contains_nonascii(u | v) { break; } which seems to make for a much simpler computation inside the loop, and a tight loop.

I'm not 100% happy with the code in find_nonascii, so any suggestion for improvement would be super welcome, feel free to take the code (from the benchmark link) and find something.

I downloaded the gist, but I am having some trouble in getting the datasets you used. Specifically, I assumed that enwik8 should be http://mattmahoney.net/dc/enwik8.zip and that the specific version of the Japanese wiki should not matter much, but I have no idea about big10.

yes, maybe you can just skip those datasets you don't have though? I could have provided everything better.

big10 is the dataset in http://vaskir.blogspot.ru/2015/09/regular-expressions-rust-vs-f.html

so it's the first 10MB of the unzipped file from https://drive.google.com/open?id=0B8HLQUKik9VtUWlOaHJPdG0xbnM

jawik10 is the first 10MB from the unzip of http://dumps.wikimedia.org/archive/2006/2006-07/jawiki/20061016/jawiki-20061016-pages-articles.xml.bz2

brson · 2016-01-12T02:04:39Z

Sweet wins. r=me but please do extract 2 to a more descriptive constant.

bluss · 2016-01-12T09:00:14Z

Ok, I'll look over if there's a neat way to write the unrolling factor

This speeds up the ascii case (and long stretches of ascii in otherwise mixed UTF-8 data) when checking UTF-8 validity. Benchmark results suggest that on purely ASCII input, we can improve throughput (megabytes verified / second) by a factor of 13 to 14! On xml and mostly english language input (en.wikipedia xml dump), throughput increases by a factor 7. On mostly non-ASCII input, performance increases slightly or is the same. The UTF-8 validation is rewritten to use indexed access; since all access is preceded by a (mandatory for validation) length check, they are statically elided by llvm and this formulation is in fact the best for performance. A previous version had losses due to slice to iterator conversions. A large credit to Björn Steinbrink who improved this patch immensely, writing this second version. Benchmark results on x86-64 (Sandy Bridge) compiled with -C opt-level=3. Old code is `regular`, this PR is called `fast`. Datasets: - `ascii` is just ascii (2.5 kB) - `cyr` is cyrillic script with ascii spaces (5 kB) - `dewik10` is 10MB of a de.wikipedia xml dump - `enwik10` is 100MB of an en.wikipedia xml dump - `jawik10` is 10MB of a ja.wikipedia xml dump ``` test from_utf8_ascii_fast ... bench: 140 ns/iter (+/- 4) = 18221 MB/s test from_utf8_ascii_regular ... bench: 1,932 ns/iter (+/- 19) = 1320 MB/s test from_utf8_cyr_fast ... bench: 10,025 ns/iter (+/- 245) = 511 MB/s test from_utf8_cyr_regular ... bench: 12,250 ns/iter (+/- 437) = 418 MB/s test from_utf8_dewik10_fast ... bench: 6,017,909 ns/iter (+/- 105,755) = 1740 MB/s test from_utf8_dewik10_regular ... bench: 11,669,493 ns/iter (+/- 264,045) = 891 MB/s test from_utf8_enwik8_fast ... bench: 14,085,692 ns/iter (+/- 1,643,316) = 7000 MB/s test from_utf8_enwik8_regular ... bench: 93,657,410 ns/iter (+/- 5,353,353) = 1000 MB/s test from_utf8_jawik10_fast ... bench: 29,154,073 ns/iter (+/- 4,659,534) = 340 MB/s test from_utf8_jawik10_regular ... bench: 29,112,917 ns/iter (+/- 2,475,123) = 340 MB/s ``` Co-authored-by: Björn Steinbrink <bsteinbr@gmail.com>

bluss · 2016-01-12T21:02:49Z

I received an improved version by @dotdash (with permission to incorporate, of course!) and it's an improvement you wouldn't believe.

No slowdown for non-ascii cases (cyrillic test case improves for some reason)
Less unsafe code
Even faster on the pure ascii case.

Updated PR description & benchmarks are in there

@brson I addressed loop unrolling only by adding another comment for it, don't see a nice way to factor it out to a constant

Gankra · 2016-01-12T21:05:14Z

Wow, awesome stuff!

bluss · 2016-01-12T22:06:42Z

Pushed a fix, there was a missing conditional, let's try this in travis. I can't measure any difference in perf.

Oh and the fix actually has a const UNROLL_BY because we needed to repeat that 2 yet another time.

shepmaster · 2016-01-13T22:01:35Z

As a bit of "real world" performance information, I pulled this down and used it for SXD.

Parsing a 16M XML file

Valgrind reported that str::from_utf8 took this much of the total run time:

Rust 1.5	This PR
5.47%	0.29%

And I measured a ~1.25% overall speedup in the program.

Parsing a 111M XML file

Rust 1.5	This PR
4.12%	0.22%

And I measured a ~1.1% overall speedup in the program.

Thanks for the awesome performance gains!

dotdash · 2016-01-13T22:17:38Z

Thanks a lot @shepmaster! Always encouraging to get that kind of feedback! 😻 And thanks to @bluss for getting this started, I've been completely blind to the masking quick check when I initially looked into this a few weeks ago! 🍻

bluss · 2016-01-13T22:22:39Z

@shepmaster Awesome to see some numbers! I'm guessing your data files are almost purely ASCII (as a lot of the data in the world is).

@brson This is ready for re-review. It's the same algorithm, indexed access though, and the fast skip ahead loop is simpler, because it's only attempted at aligned locations. The main loop will progress to an aligned location quickly anyway, if the input is mostly ascii.

shepmaster · 2016-01-13T23:03:59Z

I'm guessing your data files are almost purely ASCII

Ah, yes, I meant to mention that. They indeed are pure-ASCII.

shepmaster · 2016-01-13T23:09:05Z

src/libcore/str/mod.rs

+                    }
+                }
+                // step from the point where the wordwise loop stopped
+                while offset < len && v[offset] < 128 {


Reading through this, I thought at first that 128 was another number relating to byte widths, then realized it is the ASCII cutoff value. Since this is also used above (first >= 128), perhaps another constant could be in order?

hm, I don't think it's needed

We need to guard that `len` is large enough for the fast skip loop.

bluss · 2016-01-14T14:00:49Z

I updated the second commit to use a constant for 2 * usize::BYTES instead, to follow shepmaster's suggestion roughly.

brson · 2016-01-16T01:14:31Z

@bors r+

bors · 2016-01-16T01:14:32Z

📌 Commit cadcd70 has been approved by brson

bors · 2016-01-16T01:18:49Z

⌛ Testing commit cadcd70 with merge e7e4ecc...

Add fast path for ASCII in UTF-8 validation This speeds up the ASCII case (and long stretches of ASCII in otherwise mixed UTF-8 data) when checking UTF-8 validity. Benchmark results suggest that on purely ASCII input, we can improve throughput (megabytes verified / second) by a factor of 13 to 14 (smallish input). On XML and mostly English language input (en.wikipedia XML dump), throughput improves by a factor 7 (large input). On mostly non-ASCII input, performance increases slightly or is the same. The UTF-8 validation is rewritten to use indexed access; since all access is preceded by a (mandatory for validation) length check, bounds checks are statically elided by LLVM and this formulation is in fact the best for performance. A previous version had losses due to slice to iterator conversions. A large credit to Björn Steinbrink who improved this patch immensely, writing this second version. Benchmark results on x86-64 (Sandy Bridge) compiled with -C opt-level=3. Old code is `regular`, this PR is called `fast`. Datasets: - `ascii` is just ASCII (2.5 kB) - `cyr` is cyrillic script with ascii spaces (5 kB) - `dewik10` is 10MB of a de.wikipedia XML dump - `enwik8` is 100MB of an en.wikipedia XML dump - `jawik10` is 10MB of a ja.wikipedia XML dump ``` test from_utf8_ascii_fast ... bench: 140 ns/iter (+/- 4) = 18221 MB/s test from_utf8_ascii_regular ... bench: 1,932 ns/iter (+/- 19) = 1320 MB/s test from_utf8_cyr_fast ... bench: 10,025 ns/iter (+/- 245) = 511 MB/s test from_utf8_cyr_regular ... bench: 10,944 ns/iter (+/- 795) = 468 MB/s test from_utf8_dewik10_fast ... bench: 6,017,909 ns/iter (+/- 105,755) = 1740 MB/s test from_utf8_dewik10_regular ... bench: 11,669,493 ns/iter (+/- 264,045) = 891 MB/s test from_utf8_enwik8_fast ... bench: 14,085,692 ns/iter (+/- 1,643,316) = 7000 MB/s test from_utf8_enwik8_regular ... bench: 93,657,410 ns/iter (+/- 5,353,353) = 1000 MB/s test from_utf8_jawik10_fast ... bench: 29,154,073 ns/iter (+/- 4,659,534) = 340 MB/s test from_utf8_jawik10_regular ... bench: 29,112,917 ns/iter (+/- 2,475,123) = 340 MB/s ``` Co-authored-by: Björn Steinbrink <bsteinbr@gmail.com>

bors · 2016-01-16T03:11:38Z

☀️ Test successful - auto-linux-32-nopt-t, auto-linux-32-opt, auto-linux-64-debug-opt, auto-linux-64-nopt-t, auto-linux-64-opt, auto-linux-64-x-android-t, auto-linux-cross-opt, auto-linux-musl-64-opt, auto-mac-32-opt, auto-mac-64-nopt-t, auto-mac-64-opt, auto-win-gnu-32-nopt-t, auto-win-gnu-32-opt, auto-win-gnu-64-nopt-t, auto-win-gnu-64-opt, auto-win-msvc-32-opt, auto-win-msvc-64-opt

bluss · 2016-01-16T10:54:27Z

awesome. Thanks @brson and everyone.

rust-highfive assigned brson Jan 6, 2016

shepmaster reviewed Jan 6, 2016
View reviewed changes

bluss force-pushed the ascii-is-the-best branch from 7ddcac2 to 6fd108d Compare January 6, 2016 16:19

shepmaster reviewed Jan 6, 2016
View reviewed changes

ranma42 reviewed Jan 6, 2016
View reviewed changes

bluss force-pushed the ascii-is-the-best branch from 6fd108d to 7037ef7 Compare January 6, 2016 20:32

brson added the relnotes Marks issues that should be documented in the release notes of the next release. label Jan 12, 2016

bluss force-pushed the ascii-is-the-best branch from 7037ef7 to 11e3de3 Compare January 12, 2016 20:58

bluss changed the title ~~Add fast path for ascii in UTF-8 validation~~ Add fast path for ASCII in UTF-8 validation Jan 12, 2016

shepmaster reviewed Jan 13, 2016
View reviewed changes

UTF-8 validation: Add missing if conditional for short input

cadcd70

We need to guard that `len` is large enough for the fast skip loop.

bluss force-pushed the ascii-is-the-best branch from 4cc87ee to cadcd70 Compare January 14, 2016 14:00

bors merged commit cadcd70 into rust-lang:master Jan 16, 2016

bluss deleted the ascii-is-the-best branch January 16, 2016 10:53

SimonSapin mentioned this pull request Jan 21, 2016

Tracking issue for char encoding methods #27784

Closed

TimNN mentioned this pull request Jan 26, 2016

Add fast path for ASCII TimNN/char-slice#1

Open

nbaksalyar mentioned this pull request Feb 3, 2016

rustc build core dump on OpenBSD #31363

Closed

SimonSapin mentioned this pull request Apr 7, 2016

Tracking issue for UTF-16 decoding iterators #27830

Closed

mbrubeck added a commit to mbrubeck/rust that referenced this pull request Jun 29, 2016

SIMD-optimized is_ascii based on rust-lang#30740

32b1d10

bluss mentioned this pull request Nov 21, 2016

Add iterator constructors and accessors from/to raw pointers #37921

Closed

SimonSapin mentioned this pull request Mar 4, 2017

Tracking issue for Read::chars #27802

Closed

SimonSapin mentioned this pull request Mar 17, 2018

Tracking issue: UTF-8 decoder in libcore #33906

Closed

nagisa mentioned this pull request Mar 26, 2019

UTF-8 parsing with state machine #59399

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fast path for ASCII in UTF-8 validation #30740

Add fast path for ASCII in UTF-8 validation #30740

bluss commented Jan 6, 2016

rust-highfive commented Jan 6, 2016

bluss commented Jan 6, 2016

shepmaster Jan 6, 2016

bluss Jan 6, 2016

shepmaster Jan 6, 2016

bluss Jan 6, 2016

shepmaster Jan 6, 2016

bluss Jan 6, 2016

brson Jan 12, 2016

bluss Jan 12, 2016

shepmaster commented Jan 6, 2016

ranma42 Jan 6, 2016

bluss Jan 6, 2016

ranma42 Jan 6, 2016

bluss Jan 6, 2016

bluss Jan 6, 2016

brson commented Jan 12, 2016

bluss commented Jan 12, 2016

bluss commented Jan 12, 2016

Gankra commented Jan 12, 2016

bluss commented Jan 12, 2016

shepmaster commented Jan 13, 2016

dotdash commented Jan 13, 2016

bluss commented Jan 13, 2016

shepmaster commented Jan 13, 2016

shepmaster Jan 13, 2016

bluss Jan 13, 2016

bluss commented Jan 14, 2016

brson commented Jan 16, 2016

bors commented Jan 16, 2016

bors commented Jan 16, 2016

bors commented Jan 16, 2016

bluss commented Jan 16, 2016

Add fast path for ASCII in UTF-8 validation #30740

Add fast path for ASCII in UTF-8 validation #30740

Conversation

bluss commented Jan 6, 2016

rust-highfive commented Jan 6, 2016

bluss commented Jan 6, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shepmaster commented Jan 6, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brson commented Jan 12, 2016

bluss commented Jan 12, 2016

bluss commented Jan 12, 2016

Gankra commented Jan 12, 2016

bluss commented Jan 12, 2016

shepmaster commented Jan 13, 2016

dotdash commented Jan 13, 2016

bluss commented Jan 13, 2016

shepmaster commented Jan 13, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bluss commented Jan 14, 2016

brson commented Jan 16, 2016

bors commented Jan 16, 2016

bors commented Jan 16, 2016

bors commented Jan 16, 2016

bluss commented Jan 16, 2016