Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use triomphe::Arc by default. #88

Merged
merged 2 commits into from
Nov 5, 2023
Merged

Use triomphe::Arc by default. #88

merged 2 commits into from
Nov 5, 2023

Conversation

michaelsproul
Copy link
Contributor

@michaelsproul michaelsproul commented Oct 31, 2023

Change the default Arc backend to triomphe, which results in speedups of up to 40% (per #85).

The default backend is changed to ArcTK, while retaining the ability for the user to swap out to ArcK if they prefer.

This PR does not currently include the ability to completely compile-out triomphe. If that were desired, we would have to expose a feature called triomphe which controls the backend, and causes std::sync::Arc to be used when the feature is disabled.

Note: this PR is a reopening of #86 against the new default branch main.

@michaelsproul
Copy link
Contributor Author

Some more benchmarks, this time on x86_64 Linux: https://gist.github.com/michaelsproul/4462d18ce7075479b03517349ffd06e2

Unfortunately it looks like we can't just Pareto-improve by switching to triomphe. There are some substantial regressions in some of the x86_64 benchmarks, especially for RedBlackTreeMap. I made a spreadsheet of results here: https://docs.google.com/spreadsheets/d/1WsuOhA2DoKUCfoEE-1OAFPjMMUgo4jFpiZx8RVbrpH0/edit

One strategy would be to switch the backend only where the gains are universal, i.e. for List, Queue and HashTrieMap. Vector could go either way, and RedBlackTreeMap is a hard no IMO unless we gate by target-arch (or OS). I'm not entirely sure whether it's the ARM CPU or the OS making more of a difference, as I know the macOS allocator is substantially better with fragmentation than GNU malloc. In production I tend to use jemalloc, and will try to re-run the Linux benchmarks with jemalloc to see if this makes a difference.

My benchmarking process was:

git checkout master
cargo bench -- sync # baseline
git checkout triomphe
cargo bench -- sync # comparison

If you have any different hardware @orium it might also be good to see the results from that.

@michaelsproul
Copy link
Contributor Author

CI is failing during vendoring. Blocked on an archery release.

@orium
Copy link
Owner

orium commented Oct 31, 2023

Blocked on an archery release.

I've just released archery v1.1.0.

@michaelsproul
Copy link
Contributor Author

Updated! Thanks!

What do you think of our options re: architectures & defaults?

Copy link

codecov bot commented Nov 1, 2023

Codecov Report

Merging #88 (c933de8) into main (d3faad5) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##             main      #88   +/-   ##
=======================================
  Coverage   96.29%   96.29%           
=======================================
  Files          11       11           
  Lines        1645     1645           
=======================================
  Hits         1584     1584           
  Misses         61       61           
Files Coverage Δ
src/list/mod.rs 95.41% <ø> (ø)
src/map/hash_trie_map/mod.rs 96.06% <ø> (ø)
src/map/red_black_tree_map/mod.rs 95.80% <ø> (ø)
src/queue/mod.rs 100.00% <ø> (ø)
src/set/hash_trie_set/mod.rs 89.47% <ø> (ø)
src/set/red_black_tree_set/mod.rs 96.11% <ø> (ø)
src/stack/mod.rs 100.00% <ø> (ø)
src/vector/mod.rs 97.61% <ø> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@orium
Copy link
Owner

orium commented Nov 1, 2023

I've run the benchmark in an intel x86_64 and I saw more regressions than in the spreadsheet you shared, but I ran it while doing other light work on my laptop so the results are not very reliable. I'll run the benchmarks properly soon and decide based on both our results.

@orium
Copy link
Owner

orium commented Nov 2, 2023

Results in an intel x86_64:

rpds hash trie map sync insert
                        change: [+2.6368% +3.4711% +4.3543%]
rpds hash trie map sync insert mut
                        change: [-2.0413% -0.8094% +0.2234%]
rpds hash trie map sync remove
                        change: [-2.0775% -1.0785% -0.0544%]
rpds hash trie map sync remove mut
                        change: [-7.8858% -6.7220% -5.4148%]
rpds hash trie map sync get
                        change: [-3.9807% -3.1931% -2.3541%]
rpds hash trie map sync iterate
                        change: [-1.1821% -0.3057% +0.5844%]
rpds list sync push front
                        change: [-20.446% -19.655% -18.806%]
rpds list sync push front mut
                        change: [-15.251% -14.533% -13.671%]
rpds list sync drop first
                        change: [-21.615% -20.416% -19.504%]
rpds list sync drop first mut
                        change: [-2.4113% -1.7169% -0.8925%]
rpds list sync reverse  
                        change: [-17.576% -17.022% -16.375%]
rpds list sync reverse mut
                        change: [-59.669% -59.168% -58.692%]
rpds list sync iterate  
                        change: [+2.7504% +3.4314% +4.0881%]
rpds queue sync enqueue 
                        change: [-15.407% -14.566% -13.534%]
rpds queue sync enqueue mut
                        change: [-18.247% -17.734% -17.248%]
rpds queue sync dequeue 
                        change: [-19.536% -18.902% -18.271%]
rpds queue sync dequeue mut
                        change: [-9.4082% -8.8096% -8.1704%]
rpds queue sync iterate 
                        change: [+3.1755% +3.9384% +4.7430%]
rpds red black tree map sync insert
                        change: [+1.4412% +2.4130% +3.4448%]
rpds red black tree map sync insert mut
                        change: [-21.915% -21.323% -20.706%]
rpds red black tree map sync remove
                        change: [-3.2492% -2.5512% -1.8362%]
rpds red black tree map sync remove mut
                        change: [-40.328% -39.836% -39.391%]
rpds red black tree map sync get
                        change: [-8.0047% -7.3228% -6.5182%]
rpds red black tree map sync iterate
                        change: [+1.4399% +2.2009% +3.0057%]
rpds vector sync push back
                        change: [+1.6571% +2.5189% +3.4616%]
rpds vector sync push back mut
                        change: [-6.4764% -5.5797% -4.6442%]
rpds vector sync drop last
                        change: [-0.6938% +0.2731% +1.1257%]
rpds vector sync drop last mut
                        change: [-32.543% -31.923% -31.408%]
rpds vector sync get    
                        change: [+1.5769% +3.2076% +4.3876%]
rpds vector sync iterate
                        change: [+3.7501% +4.7765% +5.8097%]

Results in arm (raspberry pi):

rpds hash trie map sync insert
                        change: [-4.7598% -4.1822% -3.5100%]
rpds hash trie map sync insert mut
                        change: [-17.672% -17.230% -16.825%]
rpds hash trie map sync remove
                        change: [-4.1593% -3.8020% -3.4559%]
rpds hash trie map sync remove mut
                        change: [-25.591% -25.305% -24.996%]
rpds hash trie map sync get
                        change: [-5.3317% -5.0032% -4.6927%]
rpds hash trie map sync iterate
                        change: [+22.415% +23.323% +24.237%]
rpds list sync push front
                        change: [-23.504% -23.140% -22.801%]
rpds list sync push front mut
                        change: [-24.341% -23.925% -23.566%]
rpds list sync drop first
                        change: [-19.712% -19.550% -19.398%]
rpds list sync drop first mut
                        change: [-20.398% -20.123% -19.885%]
rpds list sync reverse  
                        change: [-13.789% -13.756% -13.721%]
rpds list sync reverse mut
                        change: [-65.247% -65.097% -64.933%]
rpds list sync iterate  
                        change: [-46.215% -45.697% -45.193%]
rpds queue sync enqueue 
                        change: [-18.596% -18.243% -17.910%]
rpds queue sync enqueue mut
                        change: [-22.085% -21.689% -21.352%]
rpds queue sync dequeue 
                        change: [-16.732% -16.385% -16.059%]
rpds queue sync dequeue mut
                        change: [-26.592% -26.450% -26.302%]
rpds queue sync iterate 
                        change: [-27.569% -27.106% -26.634%]
rpds red black tree map sync insert
                        change: [-7.8592% -7.7074% -7.5590%]
rpds red black tree map sync insert mut
                        change: [-42.397% -42.258% -42.116%]
rpds red black tree map sync remove
                        change: [-6.2934% -6.1735% -6.0456%]
rpds red black tree map sync remove mut
                        change: [-47.404% -47.297% -47.186%]
rpds red black tree map sync get
                        change: [-16.655% -16.193% -15.749%]
rpds red black tree map sync iterate
                        change: [-30.892% -30.479% -30.042%]
rpds vector sync push back
                        change: [-1.2102% -1.0596% -0.9017%]
rpds vector sync push back mut
                        change: [-33.433% -33.304% -33.169%]
rpds vector sync drop last
                        change: [-4.2658% -4.1403% -4.0246%]
rpds vector sync drop last mut
                        change: [-41.569% -41.369% -41.185%]
rpds vector sync get    
                        change: [+0.1514% +0.2058% +0.2704%]
rpds vector sync iterate
                        change: [+26.207% +26.376% +26.529%]

In my results:

  1. In intel x86_64 I see mostly improvements and the regressions are relatively small.
  2. In arm there are significant regressions on iteration of vector and hash trie map.

In your results:

  1. In amd x86_64 there are significant regressions in red black tree insert/remove the non-mutable method (which is know to be slower than the _mut() methods).
  2. In arm M1 it is an improvement across the board.

My conclusion is still a win overall, even if there are some performance regressions. I think using triomphe by default for all data structures is probably the best tradeoff based on this benchmarks. What do you think?

I also reported this to triomphe because maybe there are optimizations in std::sync::Arc that triomphe is missing (Manishearth/triomphe#74).

@orium orium merged commit 98045ce into orium:main Nov 5, 2023
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants