Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update VecDeque implementation to use head+len instead of head+tail #102991

Merged
merged 7 commits into from
Nov 28, 2022

Conversation

Sp00ph
Copy link
Member

@Sp00ph Sp00ph commented Oct 12, 2022

(See #99805)

This changes alloc::collections::VecDeque's internal representation from using head and tail indices to using a head index and a length field. It has a few advantages over the current design:

  • It allows the buffer to be of length 0, which means the VecDeque::new new longer has to allocate and could be changed to a const fn
  • It allows the VecDeque to fill the buffer completely, unlike the old implementation, which always had to leave a free space
  • It removes the restriction for the size to be a power of two, allowing it to properly shrink_to_fit, unlike the old VecDeque
  • The above points also combine to allow the Vec<T> -> VecDeque<T> conversion to be very cheap and guaranteed O(1). I mention this in the From<Vec<T>> impl, but it's not a strong guarantee just yet, as that would likely need some form of API change proposal.

All the tests seem to pass for the new VecDeque, with some slight adjustments.

r? @scottmcm

@rust-highfive
Copy link
Collaborator

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @scottmcm (or someone else) soon.

Please see the contribution instructions for more information.

@rustbot rustbot added the T-libs Relevant to the library team, which will review and decide on the PR/issue. label Oct 12, 2022
@rustbot
Copy link
Collaborator

rustbot commented Oct 12, 2022

Hey! It looks like you've submitted a new PR for the library teams!

If this PR contains changes to any rust-lang/rust public library APIs then please comment with @rustbot label +T-libs-api -T-libs to tag it appropriately. If this PR contains changes to any unstable APIs please edit the PR description to add a link to the relevant API Change Proposal or create one if you haven't already. If you're unsure where your change falls no worries, just leave it as is and the reviewer will take a look and make a decision to forward on if necessary.

Examples of T-libs-api changes:

  • Stabilizing library features
  • Introducing insta-stable changes such as new implementations of existing stable traits on existing stable types
  • Introducing new or changing existing unstable library APIs (excluding permanently unstable features / features without a tracking issue)
  • Changing public documentation in ways that create new stability guarantees
  • Changing observable runtime behavior of library APIs

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Oct 12, 2022
@Sp00ph
Copy link
Member Author

Sp00ph commented Oct 12, 2022

I'm trying to run the x.py test tidy --bless command but it's taking forever

@Sp00ph
Copy link
Member Author

Sp00ph commented Oct 12, 2022

Is there any better way to run the tidy tool just for alloc, since I didn't change any other code?

@Sp00ph
Copy link
Member Author

Sp00ph commented Oct 13, 2022

Ok this error is really weird, I got it locally too but didn't really think too much of it as I also got some weird errors due to winapi functions failing. I don't really know what this error is about and it doesn't give any stacktrace to help identify the issue unfortunately.

@Sp00ph
Copy link
Member Author

Sp00ph commented Oct 13, 2022

Also, for some reason it's telling me that collectiontests is somehow causing a stack buffer overrun, and I'd be quite surprised if that came from my VecDeque implementation.

@rust-log-analyzer

This comment has been minimized.

@Kobzol
Copy link
Contributor

Kobzol commented Oct 13, 2022

When I was trying to rewrite VecDeque recently, the tests were segfaulting all the time 😄 The problem is that the VecDeque itself is used for the test infrastructure. It would be nice if we could use stage0 stdlib to run the tests, but test stage1 code.

@Sp00ph
Copy link
Member Author

Sp00ph commented Oct 13, 2022

Yea I read that somewhere that VecDeque is used in the testing framework, but the weird thing is that all the tests in the first test suite pass just fine, and also i don't see how my code could cause a stack buffer overrun. The only part i can think of where it reads memory from the stack directly would be the From<[T; N]> impl, and from what I can tell that shouldn't be able to overrun the buffer.

@the8472
Copy link
Member

the8472 commented Oct 13, 2022

You can enable debug asserts for the standard library in config.toml and add them to your implementation, that might help uncovering bugs.

but the weird thing is that all the tests in the first test suite pass just fine

We don't have 100% test coverage, so it might be some edge-case

@Sp00ph
Copy link
Member Author

Sp00ph commented Oct 13, 2022

I think I finally got it to work, sorry for all that.

@Sp00ph
Copy link
Member Author

Sp00ph commented Oct 13, 2022

I'm not familiar with the format of these .stderr files, what do I need to do to make it not fail there?

@ChrisDenton
Copy link
Member

See https://rustc-dev-guide.rust-lang.org/tests/running.html#editing-and-updating-the-reference-files

...you can pass --bless to the test subcommand. E.g. if some tests in src/test/ui are failing, you can run

./x.py test src/test/ui --bless

@Sp00ph
Copy link
Member Author

Sp00ph commented Oct 13, 2022

Oh boy, doing that does an LLVM build for me, which my laptop doesn't have the resources for.

Edit: nevermind, figured out you can just set download-ci-llvm=true

@the8472
Copy link
Member

the8472 commented Oct 13, 2022

set download-ci-llvm = "if-available" in config.toml

@Sp00ph
Copy link
Member Author

Sp00ph commented Oct 13, 2022

Ugh, this is kinda frustrating. Problem is, I only have access to a pretty slow laptop, and so everything takes forever. Running just the ui tests took like 30 minutes, not including the compile time.

@Rageking8
Copy link
Contributor

Ugh, this is kinda frustrating. Problem is, I only have access to a pretty slow laptop, and so everything takes forever. Running just the ui tests took like 30 minutes, not including the compile time.

Not sure about the current circumstance of this PR, but you can try to only run previously failing test locally or try to get the CI to help you run the test (since its much faster, but still takes 10+ mins to finishing the first run of test and 40+ mins to complete the CI).

@Sp00ph
Copy link
Member Author

Sp00ph commented Oct 13, 2022

Phew, finally got the checks passing. I'm very sorry for all the hassle, this is my first time trying to contribute to the stdlib and it probably shows.

@Kobzol
Copy link
Contributor

Kobzol commented Oct 13, 2022

That's fine, there will be many more failures (those are not even all the tests that have to pass 😁 ). Could you share the results of stdlib (or any other) microbenchmarks of the old and the new VecDeque on your machine?

@Sp00ph
Copy link
Member Author

Sp00ph commented Oct 13, 2022

Running x.py bench --stage 0 library/alloc --test-args vec_deque gives the following results for the old implementation:

test collections::vec_deque::tests::bench_pop_back_100       ... bench:         259 ns/iter (+/- 4)
test collections::vec_deque::tests::bench_pop_front_100      ... bench:         252 ns/iter (+/- 67)
test collections::vec_deque::tests::bench_push_back_100      ... bench:         288 ns/iter (+/- 5)
test collections::vec_deque::tests::bench_push_front_100     ... bench:         291 ns/iter (+/- 388)
test collections::vec_deque::tests::bench_retain_half_10000  ... bench:     574,685 ns/iter (+/- 38,841)
test collections::vec_deque::tests::bench_retain_odd_10000   ... bench:     570,027 ns/iter (+/- 799,114)
test collections::vec_deque::tests::bench_retain_whole_10000 ... bench:     997,678 ns/iter (+/- 508,766)

test vec_deque::bench_extend_bytes                       ... bench:          20 ns/iter (+/- 35)
test vec_deque::bench_extend_chained_bytes               ... bench:       1,619 ns/iter (+/- 45)
test vec_deque::bench_extend_chained_trustedlen          ... bench:       1,676 ns/iter (+/- 58)
test vec_deque::bench_extend_trustedlen                  ... bench:       1,064 ns/iter (+/- 9)
test vec_deque::bench_extend_vec                         ... bench:          99 ns/iter (+/- 155)
test vec_deque::bench_from_array_1000                    ... bench:         489 ns/iter (+/- 246)
test vec_deque::bench_grow_1025                          ... bench:       4,082 ns/iter (+/- 48)
test vec_deque::bench_iter_1000                          ... bench:         638 ns/iter (+/- 1,029)
test vec_deque::bench_mut_iter_1000                      ... bench:         637 ns/iter (+/- 6)
test vec_deque::bench_new                                ... bench:          61 ns/iter (+/- 102)
test vec_deque::bench_try_fold                           ... bench:         375 ns/iter (+/- 651)

custom-bench vec_deque_append 249800 ns/iter

And these are the results for the new implementation:

test collections::vec_deque::tests::bench_pop_back_100       ... bench:          89 ns/iter (+/- 28)
test collections::vec_deque::tests::bench_pop_front_100      ... bench:         169 ns/iter (+/- 0)
test collections::vec_deque::tests::bench_push_back_100      ... bench:         208 ns/iter (+/- 2)
test collections::vec_deque::tests::bench_push_front_100     ... bench:         258 ns/iter (+/- 313)
test collections::vec_deque::tests::bench_retain_half_10000  ... bench:     491,300 ns/iter (+/- 201,210)
test collections::vec_deque::tests::bench_retain_odd_10000   ... bench:     569,320 ns/iter (+/- 170,957)
test collections::vec_deque::tests::bench_retain_whole_10000 ... bench:     596,310 ns/iter (+/- 98,938)

test vec_deque::bench_extend_bytes                       ... bench:       1,055 ns/iter (+/- 1,368)
test vec_deque::bench_extend_chained_bytes               ... bench:       1,481 ns/iter (+/- 1,934)
test vec_deque::bench_extend_chained_trustedlen          ... bench:       1,069 ns/iter (+/- 16)
test vec_deque::bench_extend_trustedlen                  ... bench:       1,053 ns/iter (+/- 15)
test vec_deque::bench_extend_vec                         ... bench:          91 ns/iter (+/- 176)
test vec_deque::bench_from_array_1000                    ... bench:         489 ns/iter (+/- 17)
test vec_deque::bench_grow_1025                          ... bench:       4,024 ns/iter (+/- 5,947)
test vec_deque::bench_iter_1000                          ... bench:         370 ns/iter (+/- 4)
test vec_deque::bench_mut_iter_1000                      ... bench:         828 ns/iter (+/- 1,236)
test vec_deque::bench_new                                ... bench:           2 ns/iter (+/- 0)
test vec_deque::bench_try_fold                           ... bench:         531 ns/iter (+/- 4)

custom-bench vec_deque_append 253700 ns/iter

Some of these seem a bit weird to me, like extend_bytes only taking 20ns in the old implementation and iter_mut being almost 3x slower than iter in the new implementation, despite their code being largely identical. I guess that's just due to background noise on my laptop though, as I haven't done any benchmarking specific setup other than closing unnecessary applications. Over all, I don't think there's any unacceptable performance regressions, and new being practically instant is a nice bonus.

@the8472
Copy link
Member

the8472 commented Oct 13, 2022

Some of these seem a bit weird to me

Std benches are finnicky because they measure wall-time. Especially if you're on a laptop. This pending update to the std dev guide might help

like extend_bytes only taking 20ns in the old implementation

It's copying 512 bytes into a pre-allocated vecdeque, that's 8 cache-lines worth of data and fits into the L1 cache. 20ns seems like the right order of magnitude, especially considering that a processor can execute a handful of instruction every nanosecond. 1000ns does not.

@Kobzol
Copy link
Contributor

Kobzol commented Oct 13, 2022

Results from my laptop (Zen 3):

master
collections::vec_deque::tests::bench_pop_back_100       ... bench:         187 ns/iter (+/- 5)
test collections::vec_deque::tests::bench_pop_front_100      ... bench:         165 ns/iter (+/- 2)
test collections::vec_deque::tests::bench_push_back_100      ... bench:         165 ns/iter (+/- 6)
test collections::vec_deque::tests::bench_push_front_100     ... bench:         165 ns/iter (+/- 2)
test collections::vec_deque::tests::bench_retain_half_10000  ... bench:     381,092 ns/iter (+/- 5,772)
test collections::vec_deque::tests::bench_retain_odd_10000   ... bench:     383,285 ns/iter (+/- 10,536)
test collections::vec_deque::tests::bench_retain_whole_10000 ... bench:     355,320 ns/iter (+/- 7,691)

test vec_deque::bench_extend_bytes                       ... bench:          10 ns/iter (+/- 0)
test vec_deque::bench_extend_chained_bytes               ... bench:       1,661 ns/iter (+/- 59)
test vec_deque::bench_extend_chained_trustedlen          ... bench:       1,617 ns/iter (+/- 58)
test vec_deque::bench_extend_trustedlen                  ... bench:       1,238 ns/iter (+/- 51)
test vec_deque::bench_extend_vec                         ... bench:          26 ns/iter (+/- 1)
test vec_deque::bench_from_array_1000                    ... bench:         192 ns/iter (+/- 7)
test vec_deque::bench_grow_1025                          ... bench:       1,986 ns/iter (+/- 90)
test vec_deque::bench_iter_1000                          ... bench:         465 ns/iter (+/- 17)
test vec_deque::bench_mut_iter_1000                      ... bench:         467 ns/iter (+/- 10)
test vec_deque::bench_new                                ... bench:          15 ns/iter (+/- 0)
test vec_deque::bench_try_fold                           ... bench:          34 ns/iter (+/- 0)

custom-bench vec_deque_append 311562 ns/iter

This PR
collections::vec_deque::tests::bench_pop_back_100       ... bench:         148 ns/iter (+/- 1)
test collections::vec_deque::tests::bench_pop_front_100      ... bench:         146 ns/iter (+/- 1)
test collections::vec_deque::tests::bench_push_back_100      ... bench:         169 ns/iter (+/- 4)
test collections::vec_deque::tests::bench_push_front_100     ... bench:         194 ns/iter (+/- 4)
test collections::vec_deque::tests::bench_retain_half_10000  ... bench:     255,548 ns/iter (+/- 5,597)
test collections::vec_deque::tests::bench_retain_odd_10000   ... bench:     257,342 ns/iter (+/- 8,091)
test collections::vec_deque::tests::bench_retain_whole_10000 ... bench:     231,784 ns/iter (+/- 6,616)

test vec_deque::bench_extend_bytes                       ... bench:         695 ns/iter (+/- 16)
test vec_deque::bench_extend_chained_bytes               ... bench:         813 ns/iter (+/- 32)
test vec_deque::bench_extend_chained_trustedlen          ... bench:         735 ns/iter (+/- 30)
test vec_deque::bench_extend_trustedlen                  ... bench:          15 ns/iter (+/- 0)
test vec_deque::bench_extend_vec                         ... bench:          27 ns/iter (+/- 0)
test vec_deque::bench_from_array_1000                    ... bench:         195 ns/iter (+/- 5)
test vec_deque::bench_grow_1025                          ... bench:       2,074 ns/iter (+/- 418)
test vec_deque::bench_iter_1000                          ... bench:         328 ns/iter (+/- 18)
test vec_deque::bench_mut_iter_1000                      ... bench:         327 ns/iter (+/- 5)
test vec_deque::bench_new                                ... bench:           2 ns/iter (+/- 0)
test vec_deque::bench_try_fold                           ... bench:          34 ns/iter (+/- 2)

custom-bench vec_deque_append 144851 ns/iter

(I try to reduce variance by turning off hyper-threading, turbo-boost etc., but it's still quite noisy, as usually). It looks really good! 🚀

I should have tried this simple representation when I was trying to do what you have done, I tried the "unmasked indices" approach from https://www.snellman.net/blog/archive/2016-12-13-ring-buffers/, but it was super complicated.

@Sp00ph
Copy link
Member Author

Sp00ph commented Oct 13, 2022

like extend_bytes only taking 20ns in the old implementation

It's copying 512 bytes into a pre-allocated vecdeque, that's 8 cache-lines worth of data and fits into the L1 cache. 20ns seems like the right order of magnitude, especially considering that a processor can execute a handful of instruction every nanosecond. 1000ns does not.

Seems like I messed up something then, if I had to guess it's a worse SpecExtend impl, I can look into it later.

@Sp00ph
Copy link
Member Author

Sp00ph commented Oct 13, 2022

Alrighty, the new numbers on my laptop are looking like this:

test collections::vec_deque::tests::bench_pop_back_100       ... bench:          89 ns/iter (+/- 108)
test collections::vec_deque::tests::bench_pop_front_100      ... bench:         169 ns/iter (+/- 136)
test collections::vec_deque::tests::bench_push_back_100      ... bench:         207 ns/iter (+/- 3)
test collections::vec_deque::tests::bench_push_front_100     ... bench:         253 ns/iter (+/- 278)
test collections::vec_deque::tests::bench_retain_half_10000  ... bench:     433,255 ns/iter (+/- 4,701)
test collections::vec_deque::tests::bench_retain_odd_10000   ... bench:     436,163 ns/iter (+/- 398,544)
test collections::vec_deque::tests::bench_retain_whole_10000 ... bench:     276,428 ns/iter (+/- 321,348)

test vec_deque::bench_extend_bytes                       ... bench:          17 ns/iter (+/- 23)
test vec_deque::bench_extend_chained_bytes               ... bench:       1,068 ns/iter (+/- 14)
test vec_deque::bench_extend_chained_trustedlen          ... bench:       1,069 ns/iter (+/- 16)
test vec_deque::bench_extend_trustedlen                  ... bench:       1,292 ns/iter (+/- 1,971)
test vec_deque::bench_extend_vec                         ... bench:         128 ns/iter (+/- 169)
test vec_deque::bench_from_array_1000                    ... bench:         539 ns/iter (+/- 846)
test vec_deque::bench_grow_1025                          ... bench:       3,947 ns/iter (+/- 89)
test vec_deque::bench_iter_1000                          ... bench:         426 ns/iter (+/- 626)
test vec_deque::bench_mut_iter_1000                      ... bench:         788 ns/iter (+/- 993)
test vec_deque::bench_new                                ... bench:           2 ns/iter (+/- 0)
test vec_deque::bench_try_fold                           ... bench:         607 ns/iter (+/- 194)

custom-bench vec_deque_append 238050 ns/iter

@Sp00ph
Copy link
Member Author

Sp00ph commented Oct 14, 2022

Alright, at this point I'm fairly confident in the correctness of the code, I just did a final thorough pass and looked over all of it. I'd still like to try and fuzz it against my original deque rewrite, which I had already fuzzed against the old VecDeque. Is there any way to link my locally compiled std into that program?

@the8472
Copy link
Member

the8472 commented Oct 14, 2022

It would be good if you could clean up the git history (with git rebase -i). Commit messages such as "final changes hopefully" are not very helpful and can be merged into the commit that introduced the thing that needs to be fixed.

Is there any way to link my locally compiled std into that program?

./x build library/std --stage 1 will build the compiler and then use that to build std, which can then be added as rustup toolchain

@rustbot rustbot added the T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. label Nov 27, 2022
@Sp00ph
Copy link
Member Author

Sp00ph commented Nov 27, 2022

The test just failed because a VecDeque had a smaller capacity than was previously possible, to be safe I'll just run the tests on CI now.

@rust-log-analyzer

This comment has been minimized.

@Sp00ph Sp00ph force-pushed the master branch 2 times, most recently from 352e5c6 to b1c3c63 Compare November 27, 2022 23:27
@Sp00ph
Copy link
Member Author

Sp00ph commented Nov 27, 2022

The tests all pass on windows now. I removed the commit that changed the CI, and I guess @scottmcm it's time for bors round 3.

@scottmcm
Copy link
Member

@bors r+

@bors
Copy link
Contributor

bors commented Nov 28, 2022

📌 Commit b1c3c63 has been approved by scottmcm

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Nov 28, 2022
@bors
Copy link
Contributor

bors commented Nov 28, 2022

⌛ Testing commit b1c3c63 with merge 69df0f2...

@bors
Copy link
Contributor

bors commented Nov 28, 2022

☀️ Test successful - checks-actions
Approved by: scottmcm
Pushing 69df0f2 to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Nov 28, 2022
@bors bors merged commit 69df0f2 into rust-lang:master Nov 28, 2022
@rustbot rustbot added this to the 1.67.0 milestone Nov 28, 2022
@rust-timer
Copy link
Collaborator

Finished benchmarking commit (69df0f2): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Next Steps: If you can justify the regressions found in this perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please open an issue or create a new PR that fixes the regressions, add a comment linking to the newly created issue or PR, and then add the perf-regression-triaged label to this PR.

@rustbot label: +perf-regression
cc @rust-lang/wg-compiler-performance

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
0.8% [0.2%, 1.4%] 4
Improvements ✅
(primary)
-0.3% [-0.5%, -0.2%] 3
Improvements ✅
(secondary)
-0.3% [-0.5%, -0.2%] 3
All ❌✅ (primary) -0.3% [-0.5%, -0.2%] 3

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
2.4% [2.4%, 2.4%] 1
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-4.0% [-7.8%, -0.3%] 3
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) -2.4% [-7.8%, 2.4%] 4

Cycles

This benchmark run did not return any relevant results for this metric.

@rustbot rustbot added the perf-regression Performance regression. label Nov 29, 2022
@Sp00ph
Copy link
Member Author

Sp00ph commented Nov 29, 2022

Uhh what should I say

@scottmcm
Copy link
Member

@Sp00ph I'd just leave it for perf triage to look at and see if they have particular concerns.

I think it's probably fine -- no primary regressions, a couple primary improvements, and mixed (albeit slightly negative overall) secondary results. For a fairly substantial change, I think it's good enough. We can always do more changes later should particular things be pointed out.

@nnethercote
Copy link
Contributor

Yeah, the number of perf changes is very small and it's a mix of small improvements and regressions. Not a concern.

@rustbot label: +perf-regression-triaged

@rustbot rustbot added the perf-regression-triaged The performance regression has been triaged. label Nov 29, 2022
@joboet joboet mentioned this pull request Dec 2, 2022
Aaron1011 pushed a commit to Aaron1011/rust that referenced this pull request Jan 6, 2023
Update VecDeque implementation to use head+len instead of head+tail

(See rust-lang#99805)

This changes `alloc::collections::VecDeque`'s internal representation from using head and tail indices to using a head index and a length field. It has a few advantages over the current design:
* It allows the buffer to be of length 0, which means the `VecDeque::new` new longer has to allocate and could be changed to a `const fn`
* It allows the `VecDeque` to fill the buffer completely, unlike the old implementation, which always had to leave a free space
* It removes the restriction for the size to be a power of two, allowing it to properly `shrink_to_fit`, unlike the old `VecDeque`
* The above points also combine to allow the `Vec<T> -> VecDeque<T>` conversion to be very cheap and guaranteed O(1). I mention this in the `From<Vec<T>>` impl, but it's not a strong guarantee just yet, as that would likely need some form of API change proposal.

All the tests seem to pass for the new `VecDeque`, with some slight adjustments.

r? `@scottmcm`
jonhoo added a commit to jonhoo/atone that referenced this pull request May 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testsuite Area: The testsuite used to check the correctness of rustc merged-by-bors This PR was explicitly merged by bors. perf-regression Performance regression. perf-regression-triaged The performance regression has been triaged. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.