-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce minimalistic ttl eviction for Recycler/PinnedVec #15139
Conversation
use std::sync::atomic::{AtomicUsize, Ordering}; | ||
use std::sync::{Arc, Mutex, Weak}; | ||
use std::time::Instant; | ||
use std::{collections::VecDeque, sync::atomic::AtomicBool}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hehe, rustfmt became lazy. ;)
Codecov Report
@@ Coverage Diff @@
## master #15139 +/- ##
=========================================
- Coverage 79.5% 79.5% -0.1%
=========================================
Files 402 402
Lines 102297 102342 +45
=========================================
+ Hits 81416 81449 +33
- Misses 20881 20893 +12 |
Nice, this seems pretty good, any perf change with the vec-deque and extra checks? |
perf/src/recycler.rs
Outdated
.gc | ||
.lock() | ||
.expect("recycler lock in pb fn allocate"); | ||
if let Some((oldest_time, _old_item)) = gc.front() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought about maybe checking the length of gc
here. Maybe the recycler is just not used in a while and we are popping an extra one where we only have a handful of entries (1-5).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, nice addition! Also, I noticed we shouldn't pop and drop the last one. We can just recycle it.. ;) I chose 10 for no particular strong reason. I just felt expiring with 5 items sounded too aggressive.... just that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, sounds good.
e44daf5
to
6425e4a
Compare
} | ||
|
||
bencher.iter(move || { | ||
recycler.recycle_for_test(recycler.allocate("me")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, this seems pretty good, any perf change with the vec-deque and extra checks?
In summary, sadly this increased twice of fast path of recycle->allocate iteration around 70ns to 140ns. But, I think this shouldn't affect the overall cluster perf. (I'm running gcp system-performance-test to be sure).
In detail, simply changing to VecDeque increases around 10ns and calling Instance::now() twice increases around 30ns * 2 by recycle and allocate.
base:
test bench_recycler ... bench: 73 ns/iter (+/- 22)
only changed Vec => VecDeque collection type:
test bench_recycler ... bench: 85 ns/iter (+/- 7)
this pr:
test bench_recycler ... bench: 144 ns/iter (+/- 19)
I think these numbers are reasonable. VecDec is basically Vec and it only incurs some small additional offset calculation so the order should be 10ns-20ns.
As for Instance::now, I think this is decent value as well, considering it's basically vdso call, recalling my vague perf number for it. xD
Also, I also think VecDeque
is faster than BTreeMap
or BinaryHeap
implementation, although, I haven't tested.
This tested on my laptop and I know there is jitter. If you're curious, I can run more rigorous experiment in more clean room enviroment. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok not bad, thanks.
As for the actual observation of leaks are fixed. I'm still running tested validator.. This takes a while. |
let (_, mut expired) = gc.pop_front().unwrap(); | ||
// unref-ing recycler here is crucial to prevent | ||
// the expired from being recycled again via Drop, | ||
// lading to dead lock! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my test validator was bitten by this. T_T
rerunning it....
well, test validator survived running against mainnet-beta for a day without deadlocks. So, I think I fixed the bug. Now testing running testnet via heaptrack to see leaks are really solved as a last check. |
well it turned out, this fix doesn't fix the leak...... the gossip-receiver code doesn't recycle PinnedVecs at all actually. I'm trying to fix it as well. |
3 similar comments
well it turned out, this fix doesn't fix the leak...... the gossip-receiver code doesn't recycle PinnedVecs at all actually. I'm trying to fix it as well. |
well it turned out, this fix doesn't fix the leak...... the gossip-receiver code doesn't recycle PinnedVecs at all actually. I'm trying to fix it as well. |
well it turned out, this fix doesn't fix the leak...... the gossip-receiver code doesn't recycle PinnedVecs at all actually. I'm trying to fix it as well. |
@ryoqun the reason this may not be recycling is because https://github.com/solana-labs/solana/pull/15139/files#diff-798271cb43dca47253c06d8ab6888d8bf4fb385fc8e74cc2cd70ac1b3b664ebfR93 checks the oldest element, but if I have
As long as I have |
@carllin Your understanding is mostly correct. Note that such My current theory is that |
Also, this bug looks like a classic buffer bloat. ;) |
use test::Bencher; | ||
|
||
#[bench] | ||
fn bench_recycler(bencher: &mut Bencher) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this
// expire if less than 10. | ||
if gc.len() > MAX_INVENTORY_COUNT_WITHOUT_EXPIRATION { | ||
if let Some((oldest_time, _old_item)) = gc.front() { | ||
if oldest_time.elapsed().as_secs() >= EXPIRATION_TTL_SECONDS { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like this
closing in favor of #15320. :) |
Problem
PinnedVec tends to leak indefinitely due to some past burst.
#14366 (comment)
Summary of Changes
Amortized very-lightweight (1 vdso call for writes, 1 vdso + 2
O(1)
VecDeque ops + 1T::drop()
for reads) eviction at each read from Recycler only when there has been enough capacity for the last hour.