Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rustc_metadata: more safely read/write the index positions. #59887

Merged
merged 1 commit into from
Apr 27, 2019

Conversation

eddyb
Copy link
Member

@eddyb eddyb commented Apr 11, 2019

This is a small part of a larger refactor, that I want to benchmark independently.
The final code would be even cleaner than this, so this is sort of an "worst case" test.

@rust-highfive
Copy link
Collaborator

r? @varkor

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Apr 11, 2019
@eddyb
Copy link
Member Author

eddyb commented Apr 11, 2019

@bors
Copy link
Contributor

bors commented Apr 11, 2019

⌛ Trying commit 6156326580aa679e050e20bb974b02660bb12286 with merge f31b67a27f06fb7d6f40349a8e8df8ec54ce9982...

@bors
Copy link
Contributor

bors commented Apr 11, 2019

☀️ Try build successful - checks-travis
Build commit: f31b67a27f06fb7d6f40349a8e8df8ec54ce9982

@eddyb
Copy link
Member Author

eddyb commented Apr 11, 2019

@rust-timer build f31b67a27f06fb7d6f40349a8e8df8ec54ce9982

@rust-timer
Copy link
Collaborator

Success: Queued f31b67a27f06fb7d6f40349a8e8df8ec54ce9982 with parent 8509127, comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking try commit f31b67a27f06fb7d6f40349a8e8df8ec54ce9982

@nnethercote
Copy link
Contributor

A small but clear improvement: the best instruction count reduction is 3%, and lots are in the 0-1% range.

@eddyb
Copy link
Member Author

eddyb commented Apr 12, 2019

I'm sorry, what? I was expecting a regression, at most!
Maybe that's all noise and the codegen is the same (which is what I wanted).

@eddyb
Copy link
Member Author

eddyb commented Apr 12, 2019

cc @alexcrichton @rust-lang/wg-compiler-performance
Does anyone think it's possible unaligned accesses could be slower than reading/writing the bytes and using {from,to}_le_bytes? I would expect that they would be the same, or the latter would be slower.
But the profiling results suggest the latter is faster?!

@Zoxc
Copy link
Contributor

Zoxc commented Apr 14, 2019

This seems to be a slight performance regression.

@nnethercote
Copy link
Contributor

This seems to be a slight performance regression.

How so? Pre-landing measurements indicated it was a slight improvement.

@Zoxc
Copy link
Contributor

Zoxc commented Apr 14, 2019

Instruction count doesn't matter =P

bors added a commit that referenced this pull request Apr 14, 2019
 rustc_metadata: replace Entry table with one table for each of its fields (AoS -> SoA).

*Based on top of #59887*

In #59789 (comment) I noticed that for many cross-crate queries (e.g. `predicates_of(def_id)`), we were deserializing the `rustc_metadata::schema::Entry` for `def_id` *only* to read one field (i.e. `predicates`).

But there are several such queries, and `Entry` is not particularly small (in terms of number of fields, the encoding itself is quite compact), so there is a large (and unnecessary) constant factor.

This PR replaces the (random-access) array¹ of `Entry` structures ("AoS"), with many separate arrays¹, one for each field that used to be in `Entry` ("SoA"), resulting in the ability to read individual fields separately, with negligible time overhead (in thoery), and some size overhead (as these arrays are not sparse).

In a way, the new approach is closer to incremental on-disk caches, which store each query's cached results separately, but it would take significantly more work to unify the two.

For stage1 `libcore`'s metadata blob, the size overhead is `8.44%`, and I have another commit (not initially included in this PR because I want to do perf runs with both) that brings it down to `5.88%`.

¹(in the source, these arrays are called "tables", but perhaps they could use a better name)
@eddyb
Copy link
Member Author

eddyb commented Apr 14, 2019

Ah yeah, cycles:u is a much better comparison (in that it shows the presumed regression).
I guess the perf default link is just suboptimal? (heh)

Now I suppose I need to do some experiments to see if I can get LLVM to generate the same code for both (maybe try ptr::read_unaligned?)

@petrochenkov
Copy link
Contributor

petrochenkov commented Apr 14, 2019

cc https://internals.rust-lang.org/t/what-is-perf-rust-lang-org-measuring-and-why-is-instructions-u-the-default/9815/1

@nnethercote
Copy link
Contributor

To summarize my opinion on the metrics:

  • Wall time is the ultimate metric of interest (which is good) but it has high variance (which is bad).
  • Instruction counts is a proxy of wall time (which is bad) but it has low variance (which is good).
  • Cycles is a proxy of wall time (which is bad) and it has high variance (which is bad).

@eddyb
Copy link
Member Author

eddyb commented Apr 16, 2019

@nnethercote Instruction counts are cycles divided by cycles per instruction, which may be high variance but also is significant to performance, and can vary with changes we make.
Without measuring the cost of memory accesses (this is not that hard with cache misses) and the differences between cheap and expensive instructions (this seems tricky, but @anp tells me there may have model-specific perf counters which may help), instruction counts only work as a proxy for "queries executed" (which we should measure with -Z self-profile!) and other algorithmic efficiencies (i.e. skipping some work).
It certainly ain't relevant here! If this were optimized correctly then we shouldn't see any difference.

Wall time is the ultimate metric of interest

Only if you actually put in the effort to make it reproducible (realtime kernel, disable ASLR, pin the thread of interest to one core with nothing else on it, etc.).

EDIT: oops, I repeated most of what was said on the irlo thread, I read that afterwards.

@eddyb
Copy link
Member Author

eddyb commented Apr 16, 2019

(moved diff comment here so it doesn't get lost)

So, looking at the LLVM IR, this slicing is the culprit, I think.
That is, the b[i * 4..] to perform the indexing, plus the b[..4] one in read_le_u32 / write_le_u32, result in two checks, and the indexing is done with a shl %i, 2 then an i8 GEPi, instead of a i32 GEPi.

At least it's not reading the bytes one by one from memory, that's what I was really worried about.
I'll try to come up with some arrangement which optimizes better.

EDIT:
Turns out that using [u8; 4] optimizes very well, to pretty much the same thing as Unaligned<u32>, and has less safety concerns (despite still using unsafe to cast between slice element types).
The only issue with that is #59953 replaces the u32 with a generic thing, and [u8; Self::BYTE_LEN] still doesn't work (something something lazy normalization).
However, I should be able to do this instead, with a macro:

{
    const N: usize = $len;
    assert_eq!(N, Self::BYTE_LEN);
    let b: &[u8] = $bytes;
    unsafe { std::slice::from_raw_parts(b.as_ptr() as *const [u8; N], b.len() / N) }
}

@eddyb eddyb changed the title rustc_metadata: safely read/write the index positions. rustc_metadata: more safely read/write the index positions. Apr 16, 2019
@eddyb
Copy link
Member Author

eddyb commented Apr 16, 2019

I've updated the implementation to bring back some unsafe code, but Unaligned is not needed anymore (as [u8; 4] is used instead of Unaligned<u32>), and the converted slices are only used once (by the code that created them using unsafe), so once const generics / lazy normalization make their way into Rust, we can (hopefully) abstract this away.

@bors try

@bors
Copy link
Contributor

bors commented Apr 16, 2019

⌛ Trying commit f51e6c7 with merge 5c37296...

bors added a commit that referenced this pull request Apr 16, 2019
rustc_metadata: more safely read/write the index positions.

This is a small part of a larger refactor, that I want to benchmark independently.
The final code would be even cleaner than this, so this is sort of an "worst case" test.
@bors
Copy link
Contributor

bors commented Apr 16, 2019

☀️ Try build successful - checks-travis
Build commit: 5c37296

@eddyb
Copy link
Member Author

eddyb commented Apr 16, 2019

@rust-timer build 5c37296

@rust-timer
Copy link
Collaborator

Success: Queued 5c37296 with parent a7cef0b, comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking try commit 5c37296

@eddyb
Copy link
Member Author

eddyb commented Apr 16, 2019

I think this is now within noise? I mean, the code being generated should be very close to before (other than unaligned for writing, instead of aligned).

@nnethercote
Copy link
Contributor

Looks to me like the perf effects are negligible.

@eddyb
Copy link
Member Author

eddyb commented Apr 17, 2019

r? @michaelwoerister or @alexcrichton

bors added a commit that referenced this pull request Apr 17, 2019
 rustc_metadata: replace Entry table with one table for each of its fields (AoS -> SoA).

*Based on top of #59887*

In #59789 (comment) I noticed that for many cross-crate queries (e.g. `predicates_of(def_id)`), we were deserializing the `rustc_metadata::schema::Entry` for `def_id` *only* to read one field (i.e. `predicates`).

But there are several such queries, and `Entry` is not particularly small (in terms of number of fields, the encoding itself is quite compact), so there is a large (and unnecessary) constant factor.

This PR replaces the (random-access) array¹ of `Entry` structures ("AoS"), with many separate arrays¹, one for each field that used to be in `Entry` ("SoA"), resulting in the ability to read individual fields separately, with negligible time overhead (in thoery), and some size overhead (as these arrays are not sparse).

In a way, the new approach is closer to incremental on-disk caches, which store each query's cached results separately, but it would take significantly more work to unify the two.

For stage1 `libcore`'s metadata blob, the size overhead is `8.44%`, and I have another commit (~~not initially included because I want to do perf runs with both~~ **EDIT**: added it now) that brings it down to `5.88%`.

¹(in the source, these arrays are called "tables", but perhaps they could use a better name)
@michaelwoerister
Copy link
Member

@Zoxc, this looks like the kind of thing you like. Would you mind doing the review?

@Zoxc
Copy link
Contributor

Zoxc commented Apr 17, 2019

@eddyb Do you intend on landing this PR separated from #59953 or was it just for perf?

const BYTE_LEN: usize = $byte_len;
fn read_from_bytes_at(b: &[u8], i: usize) -> Self {
const BYTE_LEN: usize = $byte_len;
// HACK(eddyb) ideally this would be done with fully safe code,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably could split the unsafe code into a function which casts [u8] to [[u8; N]], just to make it a bit clearer what is going on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I... would do that, if I could write such a function generically.

@eddyb
Copy link
Member Author

eddyb commented Apr 17, 2019

@Zoxc mostly for perf, if #59953 is approved soon I can just close this one.

@michaelwoerister
Copy link
Member

... if #59953 is approved soon I can just close this one.

I'm not sure if I'll get to reviewing that this week (#57967 is rather big and has priority, I'd say). So feel free to approve this PR standalone.

@michaelwoerister
Copy link
Member

I think we should land this separately for better bisectibility.
r? @Zoxc

@Zoxc
Copy link
Contributor

Zoxc commented Apr 27, 2019

@bors r+

@bors
Copy link
Contributor

bors commented Apr 27, 2019

📌 Commit f51e6c7 has been approved by Zoxc

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Apr 27, 2019
@bors
Copy link
Contributor

bors commented Apr 27, 2019

⌛ Testing commit f51e6c7 with merge d4a32d5...

bors added a commit that referenced this pull request Apr 27, 2019
rustc_metadata: more safely read/write the index positions.

This is a small part of a larger refactor, that I want to benchmark independently.
The final code would be even cleaner than this, so this is sort of an "worst case" test.
@bors
Copy link
Contributor

bors commented Apr 27, 2019

☀️ Test successful - checks-travis, status-appveyor
Approved by: Zoxc
Pushing d4a32d5 to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Apr 27, 2019
@bors bors merged commit f51e6c7 into rust-lang:master Apr 27, 2019
@eddyb eddyb deleted the safer-metadata branch April 30, 2019 08:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged-by-bors This PR was explicitly merged by bors. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants