rustc_metadata: more safely read/write the index positions. #59887

eddyb · 2019-04-11T17:10:26Z

This is a small part of a larger refactor, that I want to benchmark independently.
The final code would be even cleaner than this, so this is sort of an "worst case" test.

rust-highfive · 2019-04-11T17:10:30Z

r? @varkor

(rust_highfive has picked a reviewer for you, use r? to override)

eddyb · 2019-04-11T17:10:45Z

cc @nnethercote @bluss @michaelwoerister

@bors try

bors · 2019-04-11T17:11:02Z

⌛ Trying commit 6156326580aa679e050e20bb974b02660bb12286 with merge f31b67a27f06fb7d6f40349a8e8df8ec54ce9982...

bors · 2019-04-11T19:35:41Z

☀️ Try build successful - checks-travis
Build commit: f31b67a27f06fb7d6f40349a8e8df8ec54ce9982

eddyb · 2019-04-11T19:39:23Z

@rust-timer build f31b67a27f06fb7d6f40349a8e8df8ec54ce9982

rust-timer · 2019-04-11T19:39:24Z

Success: Queued f31b67a27f06fb7d6f40349a8e8df8ec54ce9982 with parent 8509127, comparison URL.

rust-timer · 2019-04-12T00:03:44Z

Finished benchmarking try commit f31b67a27f06fb7d6f40349a8e8df8ec54ce9982

nnethercote · 2019-04-12T01:30:01Z

A small but clear improvement: the best instruction count reduction is 3%, and lots are in the 0-1% range.

eddyb · 2019-04-12T04:58:13Z

I'm sorry, what? I was expecting a regression, at most!
Maybe that's all noise and the codegen is the same (which is what I wanted).

eddyb · 2019-04-12T05:06:12Z

cc @alexcrichton @rust-lang/wg-compiler-performance
Does anyone think it's possible unaligned accesses could be slower than reading/writing the bytes and using {from,to}_le_bytes? I would expect that they would be the same, or the latter would be slower.
But the profiling results suggest the latter is faster?!

Zoxc · 2019-04-14T09:08:53Z

This seems to be a slight performance regression.

nnethercote · 2019-04-14T11:29:51Z

This seems to be a slight performance regression.

How so? Pre-landing measurements indicated it was a slight improvement.

Zoxc · 2019-04-14T11:33:28Z

Instruction count doesn't matter =P

rustc_metadata: replace Entry table with one table for each of its fields (AoS -> SoA). *Based on top of #59887* In #59789 (comment) I noticed that for many cross-crate queries (e.g. `predicates_of(def_id)`), we were deserializing the `rustc_metadata::schema::Entry` for `def_id` *only* to read one field (i.e. `predicates`). But there are several such queries, and `Entry` is not particularly small (in terms of number of fields, the encoding itself is quite compact), so there is a large (and unnecessary) constant factor. This PR replaces the (random-access) array¹ of `Entry` structures ("AoS"), with many separate arrays¹, one for each field that used to be in `Entry` ("SoA"), resulting in the ability to read individual fields separately, with negligible time overhead (in thoery), and some size overhead (as these arrays are not sparse). In a way, the new approach is closer to incremental on-disk caches, which store each query's cached results separately, but it would take significantly more work to unify the two. For stage1 `libcore`'s metadata blob, the size overhead is `8.44%`, and I have another commit (not initially included in this PR because I want to do perf runs with both) that brings it down to `5.88%`. ¹(in the source, these arrays are called "tables", but perhaps they could use a better name)

eddyb · 2019-04-14T15:01:26Z

Ah yeah, cycles:u is a much better comparison (in that it shows the presumed regression).
I guess the perf default link is just suboptimal? (heh)

Now I suppose I need to do some experiments to see if I can get LLVM to generate the same code for both (maybe try ptr::read_unaligned?)

petrochenkov · 2019-04-14T15:03:09Z

cc https://internals.rust-lang.org/t/what-is-perf-rust-lang-org-measuring-and-why-is-instructions-u-the-default/9815/1

nnethercote · 2019-04-15T00:11:59Z

To summarize my opinion on the metrics:

Wall time is the ultimate metric of interest (which is good) but it has high variance (which is bad).
Instruction counts is a proxy of wall time (which is bad) but it has low variance (which is good).
Cycles is a proxy of wall time (which is bad) and it has high variance (which is bad).

eddyb · 2019-04-16T07:17:59Z

@nnethercote Instruction counts are cycles divided by cycles per instruction, which may be high variance but also is significant to performance, and can vary with changes we make.
Without measuring the cost of memory accesses (this is not that hard with cache misses) and the differences between cheap and expensive instructions (this seems tricky, but @anp tells me there may have model-specific perf counters which may help), instruction counts only work as a proxy for "queries executed" (which we should measure with -Z self-profile!) and other algorithmic efficiencies (i.e. skipping some work).
It certainly ain't relevant here! If this were optimized correctly then we shouldn't see any difference.

Wall time is the ultimate metric of interest

Only if you actually put in the effort to make it reproducible (realtime kernel, disable ASLR, pin the thread of interest to one core with nothing else on it, etc.).

EDIT: oops, I repeated most of what was said on the irlo thread, I read that afterwards.

eddyb · 2019-04-16T08:12:26Z

(moved diff comment here so it doesn't get lost)

So, looking at the LLVM IR, this slicing is the culprit, I think.
That is, the b[i * 4..] to perform the indexing, plus the b[..4] one in read_le_u32 / write_le_u32, result in two checks, and the indexing is done with a shl %i, 2 then an i8 GEPi, instead of a i32 GEPi.

At least it's not reading the bytes one by one from memory, that's what I was really worried about.
~~I'll try to come up with some arrangement which optimizes better.~~

EDIT:
Turns out that using [u8; 4] optimizes very well, to pretty much the same thing as Unaligned<u32>, and has less safety concerns (despite still using unsafe to cast between slice element types).
The only issue with that is #59953 replaces the u32 with a generic thing, and [u8; Self::BYTE_LEN] still doesn't work (something something lazy normalization).
However, I should be able to do this instead, with a macro:

{
    const N: usize = $len;
    assert_eq!(N, Self::BYTE_LEN);
    let b: &[u8] = $bytes;
    unsafe { std::slice::from_raw_parts(b.as_ptr() as *const [u8; N], b.len() / N) }
}

eddyb · 2019-04-16T12:13:27Z

I've updated the implementation to bring back some unsafe code, but Unaligned is not needed anymore (as [u8; 4] is used instead of Unaligned<u32>), and the converted slices are only used once (by the code that created them using unsafe), so once const generics / lazy normalization make their way into Rust, we can (hopefully) abstract this away.

@bors try

bors · 2019-04-16T12:13:39Z

⌛ Trying commit f51e6c7 with merge 5c37296...

rustc_metadata: more safely read/write the index positions. This is a small part of a larger refactor, that I want to benchmark independently. The final code would be even cleaner than this, so this is sort of an "worst case" test.

bors · 2019-04-16T14:35:42Z

☀️ Try build successful - checks-travis
Build commit: 5c37296

eddyb · 2019-04-16T15:19:51Z

@rust-timer build 5c37296

rust-timer · 2019-04-16T15:19:52Z

Success: Queued 5c37296 with parent a7cef0b, comparison URL.

rust-timer · 2019-04-16T19:51:46Z

Finished benchmarking try commit 5c37296

eddyb · 2019-04-16T20:36:57Z

I think this is now within noise? I mean, the code being generated should be very close to before (other than unaligned for writing, instead of aligned).

nnethercote · 2019-04-17T02:35:39Z

Looks to me like the perf effects are negligible.

eddyb · 2019-04-17T03:42:45Z

r? @michaelwoerister or @alexcrichton

rustc_metadata: replace Entry table with one table for each of its fields (AoS -> SoA). *Based on top of #59887* In #59789 (comment) I noticed that for many cross-crate queries (e.g. `predicates_of(def_id)`), we were deserializing the `rustc_metadata::schema::Entry` for `def_id` *only* to read one field (i.e. `predicates`). But there are several such queries, and `Entry` is not particularly small (in terms of number of fields, the encoding itself is quite compact), so there is a large (and unnecessary) constant factor. This PR replaces the (random-access) array¹ of `Entry` structures ("AoS"), with many separate arrays¹, one for each field that used to be in `Entry` ("SoA"), resulting in the ability to read individual fields separately, with negligible time overhead (in thoery), and some size overhead (as these arrays are not sparse). In a way, the new approach is closer to incremental on-disk caches, which store each query's cached results separately, but it would take significantly more work to unify the two. For stage1 `libcore`'s metadata blob, the size overhead is `8.44%`, and I have another commit (~~not initially included because I want to do perf runs with both~~ **EDIT**: added it now) that brings it down to `5.88%`. ¹(in the source, these arrays are called "tables", but perhaps they could use a better name)

michaelwoerister · 2019-04-17T08:43:19Z

@Zoxc, this looks like the kind of thing you like. Would you mind doing the review?

Zoxc · 2019-04-17T10:58:42Z

@eddyb Do you intend on landing this PR separated from #59953 or was it just for perf?

Zoxc · 2019-04-17T11:01:32Z

src/librustc_metadata/index.rs

+        const BYTE_LEN: usize = $byte_len;
+        fn read_from_bytes_at(b: &[u8], i: usize) -> Self {
+            const BYTE_LEN: usize = $byte_len;
+            // HACK(eddyb) ideally this would be done with fully safe code,


Probably could split the unsafe code into a function which casts [u8] to [[u8; N]], just to make it a bit clearer what is going on.

I... would do that, if I could write such a function generically.

eddyb · 2019-04-17T13:34:05Z

@Zoxc mostly for perf, if #59953 is approved soon I can just close this one.

michaelwoerister · 2019-04-17T14:54:20Z

... if #59953 is approved soon I can just close this one.

I'm not sure if I'll get to reviewing that this week (#57967 is rather big and has priority, I'd say). So feel free to approve this PR standalone.

michaelwoerister · 2019-04-26T09:49:13Z

I think we should land this separately for better bisectibility.
r? @Zoxc

Zoxc · 2019-04-27T00:24:30Z

@bors r+

bors · 2019-04-27T00:24:31Z

📌 Commit f51e6c7 has been approved by Zoxc

bors · 2019-04-27T08:13:30Z

⌛ Testing commit f51e6c7 with merge d4a32d5...

rustc_metadata: more safely read/write the index positions. This is a small part of a larger refactor, that I want to benchmark independently. The final code would be even cleaner than this, so this is sort of an "worst case" test.

bors · 2019-04-27T10:57:29Z

☀️ Test successful - checks-travis, status-appveyor
Approved by: Zoxc
Pushing d4a32d5 to master...

rust-highfive assigned varkor Apr 11, 2019

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Apr 11, 2019

eddyb mentioned this pull request Apr 14, 2019

rustc_metadata: replace Entry table with one table for each of its fields (AoS -> SoA). #59953

Merged

rustc_metadata: more safely read/write the index positions.

f51e6c7

eddyb force-pushed the safer-metadata branch from 6156326 to f51e6c7 Compare April 16, 2019 12:10

eddyb changed the title ~~rustc_metadata: safely read/write the index positions.~~ rustc_metadata: more safely read/write the index positions. Apr 16, 2019

rust-highfive assigned michaelwoerister and unassigned varkor Apr 17, 2019

Zoxc reviewed Apr 17, 2019

View reviewed changes

rust-highfive assigned Zoxc and unassigned michaelwoerister Apr 26, 2019

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Apr 27, 2019

bors added the merged-by-bors This PR was explicitly merged by bors. label Apr 27, 2019

bors merged commit f51e6c7 into rust-lang:master Apr 27, 2019

eddyb deleted the safer-metadata branch April 30, 2019 08:01

rustc_metadata: more safely read/write the index positions. #59887

rustc_metadata: more safely read/write the index positions. #59887

Conversation

eddyb commented Apr 11, 2019

rust-highfive commented Apr 11, 2019

eddyb commented Apr 11, 2019

bors commented Apr 11, 2019

bors commented Apr 11, 2019

eddyb commented Apr 11, 2019

rust-timer commented Apr 11, 2019

rust-timer commented Apr 12, 2019

nnethercote commented Apr 12, 2019

eddyb commented Apr 12, 2019

eddyb commented Apr 12, 2019

Zoxc commented Apr 14, 2019

nnethercote commented Apr 14, 2019

Zoxc commented Apr 14, 2019

eddyb commented Apr 14, 2019 • edited Loading

petrochenkov commented Apr 14, 2019 • edited Loading

nnethercote commented Apr 15, 2019

eddyb commented Apr 16, 2019 • edited Loading

eddyb commented Apr 16, 2019 • edited Loading

eddyb commented Apr 16, 2019

bors commented Apr 16, 2019

bors commented Apr 16, 2019

eddyb commented Apr 16, 2019

rust-timer commented Apr 16, 2019

rust-timer commented Apr 16, 2019

eddyb commented Apr 16, 2019

nnethercote commented Apr 17, 2019

eddyb commented Apr 17, 2019

michaelwoerister commented Apr 17, 2019

Zoxc commented Apr 17, 2019

Zoxc Apr 17, 2019

Choose a reason for hiding this comment

eddyb Apr 17, 2019

Choose a reason for hiding this comment

eddyb commented Apr 17, 2019

michaelwoerister commented Apr 17, 2019

michaelwoerister commented Apr 26, 2019

Zoxc commented Apr 27, 2019

bors commented Apr 27, 2019

bors commented Apr 27, 2019

bors commented Apr 27, 2019

eddyb commented Apr 14, 2019 •

edited

Loading

petrochenkov commented Apr 14, 2019 •

edited

Loading

eddyb commented Apr 16, 2019 •

edited

Loading

eddyb commented Apr 16, 2019 •

edited

Loading