-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adds a shrink policy to the recycler without an allocation limit #16457
adds a shrink policy to the recycler without an allocation limit #16457
Conversation
4f9d397
to
a7d5277
Compare
Codecov Report
@@ Coverage Diff @@
## master #16457 +/- ##
=========================================
- Coverage 80.0% 80.0% -0.1%
=========================================
Files 413 413
Lines 112129 111962 -167
=========================================
- Hits 89787 89644 -143
+ Misses 22342 22318 -24 |
a7d5277
to
7bd8b45
Compare
7bd8b45
to
aca5a27
Compare
First commit is just a revert (because the current shrink policy is too tightly tied to the allocation limit). |
tagging @ryoqun as well, since he also looked into the DOS bug 😃 |
33319fc
to
39da365
Compare
use test::Bencher; | ||
|
||
#[bench] | ||
fn bench_recycler(bencher: &mut Bencher) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh, this was stolen from @ryoqun's work: #15320 (comment) and integrated into my PR.
Can we keep this around and also gather some before/after perf benchmarks for this? Something similar to #15320 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added back the benchmark.
On this code:
test bench_recycler ... bench: 77 ns/iter (+/- 4)
On the master code:
test bench_recycler ... bench: 107 ns/iter (+/- 6)
} | ||
|
||
#[test] | ||
fn test_recycler_shrink() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add an equivalent test for the shrinking path that demonstrates a burst of transactions, and then the shrinking path deallocating unused objects over time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a test coverage
.size_factor | ||
.load(Ordering::Acquire) | ||
.saturating_mul(RECYCLER_SHRINK_WINDOW_SUB_ONE) | ||
.saturating_add(RECYCLER_SHRINK_WINDOW_HALF) | ||
.checked_div(RECYCLER_SHRINK_WINDOW) | ||
.unwrap() | ||
.saturating_add(gc.len()), | ||
Ordering::Release, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the sake of statistics noobs like me 👶 , could we add a definition for the exponential moving average then add some comments on which constants used/ intuition for picking those constants?
From my understanding if the value at iteration t
is V_t
, then the exponential average A
at iteration t
is computed as for some coefficient x
:
A_t = xV_{t} + (1 - x) * A_{t - 1}
From my naive understanding, I don't quite fully grasp the current implementation (I'm probably missing something 😃): https://github.com/solana-labs/solana/pull/16457/files#diff-798271cb43dca47253c06d8ab6888d8bf4fb385fc8e74cc2cd70ac1b3b664ebfR97-R106
It seems to me that in the above code if gc.len()
remains constant then the size_factor
blows up in size quickly because we're adding .saturating_add(gc.len())
every time.
For instance if gc.len() == RECYCLER_SHRINK_WINDOW
remains constant then this current calculation of size_factor
would increase by many multiples every iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If n
is the window length, a
is the moving average, and x
is the new observation (here gc.len()
), then the update equation is:
a <- a * (n - 1) / n + x / n
Here, I wanted to avoid floating math, so, if you define b = n a
, then
b <- b * (n - 1) / n + x
and you can make the remaining division to round (instead of truncate) by adding n / 2
to the numerator.
Effectively b
would be an exponential moving estimate of the "sum" of x
over the window as opposed to the "average".
The size_factor
comment also says that:
// Shrink window times the exponential moving average size of gc.
size_factor: AtomicUsize,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, thanks for the explanation, that was super helpful for my understanding.
Could we add this explanation as a comment in the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added comments
39da365
to
99d12b8
Compare
This reverts commit c2e8814.
solana-labs#15320 added an allocation limit to the recycler, which has been the source of a number of bugs. For example the code bellow panics by simply cloning packets: const RECYCLER_LIMIT: usize = 8; let recycler = PacketsRecycler::new_with_limit("", RECYCLER_LIMIT as u32); let packets = Packets::new_with_recycler(recycler.clone(), 1).unwrap(); for _ in 0..RECYCLER_LIMIT { let _ = packets.clone(); } Packets::new_with_recycler(recycler.clone(), 1); The implementation also fails to account for instances where objects are consumed. Having the allocation limit in the recycler also seems out of place, as higher level code has better context to impose allocation limits (e.g. by using bounded channels to rate-limit), whereas the recycler would be simpler and more efficient if it just do the recycling. This commit: * Reverts solana-labs#15320 * Adds a shrink policy to the recycler without an allocation limit.
99d12b8
to
b3c5e6c
Compare
if gc.len() > RECYCLER_SHRINK_SIZE | ||
&& self.size_factor.load(Ordering::Acquire) >= SIZE_FACTOR_AFTER_SHRINK |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi! I'm called. :) the exponential stuff looks like really cool. I'm still grokking the code... I'm weak at math to be honest...
That being said, I think this is maybe susceptible to eclipse attack? So, I think we need some highwater mark check to force shrinking if gc.len()
is too large? If I read the code correctly, there is no upper bound for gc.len()
?
Consider this: victim's network traffic shaping is patially under the control by malicious actor like network intermediaries or turbine neighbor (PinnedVec/Recycler is used on that codepath?) (I know this is a bit too theoretical...)
Attack: the malicious actor throttles packets with intermittent spikes to the victim so that size_factor
is unintendedly below SIZE_FACTOR_AFTER_SHRINK
although gc.len()
is swollen because of processing the spike.
Also, I'm still wondering new impl is as safe as older shrinker and limiter one. My gut feeling is that centralized hard-limit at lower layer is safer while I know it contained bugs that frustrated @behzadnouri due to my lack of review at #15320... I'm still trying convincing this cleaner impl is actually ok. :)
Anyway, let's have a happy discussion. the more the merrier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal is to pull rate-limiting and allocation limit out of the recycler to higher level code which has more context. So,
- The recycler will just do the recycling and shrinking its internal buffer when necessary (i.e. this pull request); which
- Allows simpler code, as the recycler will not bother with tracking number of objects on the fly.
- Avoids several bugs with current implementation as described in the pull-request description.
and, then follow up with:
- Use things like bounded channels to enforce rate limits.
- Use things like load-shedding to limit memory usage; as in for example in gossip:
https://github.com/solana-labs/solana/blob/364af3a3e/core/src/cluster_info.rs#L2811-L2822
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like your exponential shrinker impl in fact. it's just beautiful.
That being said, let's focus on whether or not we should rare-limit at lower layer. :)
I feel the merit of your idea of moving these concerns (rate-limit, allocation limit) to higher layer. But, that would impose all of higher level individual components must be equipped with rate-limits and allocation limits, maybe?
That sounds tedious to me. Also, I'm thinking ideally we want to introduce some flow-control somewhere (preferably, at the common lower layer) in the future (@leoluk I'm not forgetting you... #12410). So, I'm slightly inclined to make the lower layer smarter not dumber.
Similar to the general data sanitization strategy in spirit, I'd rather to have these limits at lower layer if lack of higher level context doesn't introduce too much trouble. I think this is yes to some extent for now.
@behzadnouri @carllin Could you share you opinion on this after listening to me? Thanks in advance. :)
Firstly, solana-validator is kind of (insanely-fast) steaming-processor at the highest level in this packet processing aspect. It's constantly fed with saturated 1G or 10G incoming bandwidth. And those bytes are untrusted, so must be cryptographically-verified. So, we want to fed the opaque blobs/bytes to GPU for that end as smooth as possible before any further processing.
What I want to say here is that:
- The lower layer adequately knows the upper bound of input rate. Even it can be aware of the nature of shared bandwidth between other incoming packet sources
- Higher layer does create higher-level
struct
from bytes but won't create more than what's coming from the wire (unless we have some data amplification bug). That means I think the higher layer isn't necessarily smarter than lower layer. (But the higher layer must adjust theirstruct
s lifetimes not to pressure mem by pruning older data in their definition) - we don't care about out-going packets as for rate-limiting. In other words, higher layer doesn't create PinnedVec out of nowhere internally at undetermined rate; PinnedVec is always tied to the incoming network link bandwidth.
- we need some tighter control over the mem region for gpu pinning.
And here's PinnedVec
and Recyler
toys for that end, which tries to reuse same buffer over and over for better data-locality and to avoid allocation cost and reduce calling potentially costly pinning api call for Cuda.
From this high-level design philosophy, I think it's actually bad coding pattern for clone()
-ing or consuming off PinnedVecs
. Doing these isn't efficient and it also crates another risk of potential mem bloating hidden buffer. But the current code allows that, confusing its motto. Also, its implementation part actually had bugs as @behzadnouri has pointed with this pr.
So, here's my draft patch to remove these PinnedVec blob escapes, so fixing the bugs from another different angle: https://github.com/ryoqun/solana/tree/revert-alloc-limit-no-clone
At the patch, I'm making sure PinnedVec
is something which should only be referenced, not moved or cloned. So, Recycler
can clearer understanding of current allocation situation for possible smarter decision in the future code (if we decide to do so).
I wanted to remove .clones()
altogether to begin with at #15320 (comment). but blame me, no time... ;)
Also, note that the patch is in draft: it only complies with cargo clippy --bin solana-validator
. lol. I'm just showing that my idea is possible with the current codebase.
So fun to have these engineering discussions among the team. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this is a handy diff view for my branch: https://github.com/solana-labs/solana/compare/99d12b8...ryoqun:revert-alloc-limit-no-clone?expand=1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sakridge by the way, the above comment makes sense as the oldest member who wrote the relevant code? That's what I sensed only from reading the code without any briefing. lol
Also, as for coalscing packets for banking stage, it might make sense to create thin struct like BankPacket[s]
or TxPacket[s]
(bad naming) to follow the higher-layer specific struct
s design.
Anyway, this is my 2 cents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree.
I think recycler just better to do the recycling, and leave the rate-limiting, allocation limits, load shedding, etc to higher level code which has more context, and can do so explicitly.
Putting the allocation limit in the recycler is a fundamentally wrong design to me, leading to very hard to debug bugs as the ones which wasted me several days.
and, please note that you were not aware of these bugs. I was the one who was finding these bugs, and there may be even more corner cases around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ryoqun Yea, I think your comment makes sense in terms of the reasons for PinnedVec
and Recycler
and how they are designed to keep allocations around for re-use.
I would say at the time rust has no support for custom allocators for Vec
so the semantics around PinnedVec
are a bit different from the normal rust semantics. One has to be careful to consume the memory in the right way to get it recycled and mapped/unmapped in the right way unlike a normal Vec
with fungible memory space.
I agree with @behzadnouri these rate checks probably should be at a higher level since recycler may not have the information for shedding and rate limiting for all the protocols that it may use. The best pruning/rate limiting algorithms may be protocol specific in terms of what packets you want to filter and prioritize, so recycler is not the right place even if it is convenient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@behzadnouri thanks for additional points. Okay. point taken. let's remove the limit from recycler.
If I regard the recycler like a system-provided malloc (which basically should never fail), I can understand it's odd place to add limit as you say. Previously, I was thinking it's like an io scheduler/network shaping queue or something backed by shared common resource (with limited throughput).
Anyway, our goal here is just to build very robust and fast stream processor under completely byzantine environment as fast as possible. I'm looking forward to seeing you to implement proper limiters on respective higher-level protocols in follow-up prs. :)
@sakridge Thanks for chiming in! Glad to hear my understanding isn't far off at least. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks for taking the change!
Problem
#15320 added an allocation limit to the recycler, which has been the source of a
number of bugs. For example the code bellow panics by simply cloning packets:
The implementation also fails to account for instances where objects are
consumed. Having the allocation limit in the recycler also seems out of place,
as higher level code has better context to impose allocation limits (e.g. by
using bounded channels to rate-limit), whereas the recycler would be simpler
and more efficient if it just do the recycling.
Summary of Changes
This commit:
total_allocated_count
which has been bug-prone.Will follow up by adding rate-limits to streamer using bounded channels,
or perhaps similar load-shedding as in gossip:
https://github.com/solana-labs/solana/blob/364af3a3e/core/src/cluster_info.rs#L2811-L2822