Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adds a shrink policy to the recycler without an allocation limit #16457

Merged
merged 2 commits into from
Apr 18, 2021

Conversation

behzadnouri
Copy link
Contributor

@behzadnouri behzadnouri commented Apr 9, 2021

Problem

#15320 added an allocation limit to the recycler, which has been the source of a
number of bugs. For example the code bellow panics by simply cloning packets:

const RECYCLER_LIMIT: usize = 8;
let recycler = PacketsRecycler::new_with_limit("", RECYCLER_LIMIT as u32);
let packets = Packets::new_with_recycler(recycler.clone(), 1).unwrap();
for _ in 0..RECYCLER_LIMIT {
    let _ = packets.clone();
}
Packets::new_with_recycler(recycler.clone(), 1);

The implementation also fails to account for instances where objects are
consumed. Having the allocation limit in the recycler also seems out of place,
as higher level code has better context to impose allocation limits (e.g. by
using bounded channels to rate-limit), whereas the recycler would be simpler
and more efficient if it just do the recycling.

Summary of Changes

This commit:

  • Reverts Add limit and shrink policy for recycler #15320. The shrink policy there is tightly tied to the allocation limit and total_allocated_count which has been bug-prone.
  • Adds a shrink policy to the recycler without an allocation limit, based on exponential moving average of number of garbage collected objects.

Will follow up by adding rate-limits to streamer using bounded channels,
or perhaps similar load-shedding as in gossip:
https://github.com/solana-labs/solana/blob/364af3a3e/core/src/cluster_info.rs#L2811-L2822

@codecov
Copy link

codecov bot commented Apr 9, 2021

Codecov Report

Merging #16457 (b3c5e6c) into master (74f5837) will decrease coverage by 0.0%.
The diff coverage is 82.7%.

@@            Coverage Diff            @@
##           master   #16457     +/-   ##
=========================================
- Coverage    80.0%    80.0%   -0.1%     
=========================================
  Files         413      413             
  Lines      112129   111962    -167     
=========================================
- Hits        89787    89644    -143     
+ Misses      22342    22318     -24     

@behzadnouri behzadnouri marked this pull request as ready for review April 9, 2021 20:02
@behzadnouri
Copy link
Contributor Author

behzadnouri commented Apr 9, 2021

First commit is just a revert (because the current shrink policy is too tightly tied to the allocation limit).
The 2nd commit:
b3c5e6c
is the actual added code which implements the shrink policy on the original recycler code and is pretty small and easier to review.

@behzadnouri behzadnouri requested review from carllin and sakridge April 9, 2021 22:04
@carllin
Copy link
Contributor

carllin commented Apr 9, 2021

tagging @ryoqun as well, since he also looked into the DOS bug 😃

@carllin carllin requested a review from ryoqun April 9, 2021 23:11
@behzadnouri behzadnouri force-pushed the revert-alloc-limit branch 2 times, most recently from 33319fc to 39da365 Compare April 12, 2021 16:07
use test::Bencher;

#[bench]
fn bench_recycler(bencher: &mut Bencher) {
Copy link
Contributor

@carllin carllin Apr 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, this was stolen from @ryoqun's work: #15320 (comment) and integrated into my PR.

Can we keep this around and also gather some before/after perf benchmarks for this? Something similar to #15320 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added back the benchmark.
On this code:

test bench_recycler ... bench:          77 ns/iter (+/- 4)

On the master code:

test bench_recycler ... bench:         107 ns/iter (+/- 6)

}

#[test]
fn test_recycler_shrink() {
Copy link
Contributor

@carllin carllin Apr 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add an equivalent test for the shrinking path that demonstrates a burst of transactions, and then the shrinking path deallocating unused objects over time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a test coverage

Comment on lines +99 to +115
.size_factor
.load(Ordering::Acquire)
.saturating_mul(RECYCLER_SHRINK_WINDOW_SUB_ONE)
.saturating_add(RECYCLER_SHRINK_WINDOW_HALF)
.checked_div(RECYCLER_SHRINK_WINDOW)
.unwrap()
.saturating_add(gc.len()),
Ordering::Release,
Copy link
Contributor

@carllin carllin Apr 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the sake of statistics noobs like me 👶 , could we add a definition for the exponential moving average then add some comments on which constants used/ intuition for picking those constants?

From my understanding if the value at iteration t is V_t, then the exponential average A at iteration t is computed as for some coefficient x:

A_t = xV_{t} + (1 - x) * A_{t - 1}

From my naive understanding, I don't quite fully grasp the current implementation (I'm probably missing something 😃): https://github.com/solana-labs/solana/pull/16457/files#diff-798271cb43dca47253c06d8ab6888d8bf4fb385fc8e74cc2cd70ac1b3b664ebfR97-R106

It seems to me that in the above code if gc.len() remains constant then the size_factor blows up in size quickly because we're adding .saturating_add(gc.len()) every time.

For instance if gc.len() == RECYCLER_SHRINK_WINDOW remains constant then this current calculation of size_factor would increase by many multiples every iteration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If n is the window length, a is the moving average, and x is the new observation (here gc.len()), then the update equation is:

a <- a * (n - 1) / n + x / n

Here, I wanted to avoid floating math, so, if you define b = n a, then

b <- b * (n - 1) / n + x

and you can make the remaining division to round (instead of truncate) by adding n / 2 to the numerator.

Effectively b would be an exponential moving estimate of the "sum" of x over the window as opposed to the "average".

The size_factor comment also says that:

// Shrink window times the exponential moving average size of gc.  
size_factor: AtomicUsize,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, thanks for the explanation, that was super helpful for my understanding.

Could we add this explanation as a comment in the code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added comments

@behzadnouri behzadnouri requested a review from carllin April 13, 2021 15:06
solana-labs#15320
added an allocation limit to the recycler, which has been the source of a
number of bugs. For example the code bellow panics by simply cloning packets:

    const RECYCLER_LIMIT: usize = 8;
    let recycler = PacketsRecycler::new_with_limit("", RECYCLER_LIMIT as u32);
    let packets = Packets::new_with_recycler(recycler.clone(), 1).unwrap();
    for _ in 0..RECYCLER_LIMIT {
        let _ = packets.clone();
    }
    Packets::new_with_recycler(recycler.clone(), 1);

The implementation also fails to account for instances where objects are
consumed. Having the allocation limit in the recycler also seems out of place,
as higher level code has better context to impose allocation limits (e.g. by
using bounded channels to rate-limit), whereas the recycler would be simpler
and more efficient if it just do the recycling.

This commit:
* Reverts solana-labs#15320
* Adds a shrink policy to the recycler without an allocation limit.
Comment on lines +145 to +146
if gc.len() > RECYCLER_SHRINK_SIZE
&& self.size_factor.load(Ordering::Acquire) >= SIZE_FACTOR_AFTER_SHRINK
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! I'm called. :) the exponential stuff looks like really cool. I'm still grokking the code... I'm weak at math to be honest...

That being said, I think this is maybe susceptible to eclipse attack? So, I think we need some highwater mark check to force shrinking if gc.len() is too large? If I read the code correctly, there is no upper bound for gc.len()?

Consider this: victim's network traffic shaping is patially under the control by malicious actor like network intermediaries or turbine neighbor (PinnedVec/Recycler is used on that codepath?) (I know this is a bit too theoretical...)

Attack: the malicious actor throttles packets with intermittent spikes to the victim so that size_factor is unintendedly below SIZE_FACTOR_AFTER_SHRINK although gc.len() is swollen because of processing the spike.

Also, I'm still wondering new impl is as safe as older shrinker and limiter one. My gut feeling is that centralized hard-limit at lower layer is safer while I know it contained bugs that frustrated @behzadnouri due to my lack of review at #15320... I'm still trying convincing this cleaner impl is actually ok. :)

Anyway, let's have a happy discussion. the more the merrier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal is to pull rate-limiting and allocation limit out of the recycler to higher level code which has more context. So,

  • The recycler will just do the recycling and shrinking its internal buffer when necessary (i.e. this pull request); which
    • Allows simpler code, as the recycler will not bother with tracking number of objects on the fly.
    • Avoids several bugs with current implementation as described in the pull-request description.

and, then follow up with:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@behzadnouri

I like your exponential shrinker impl in fact. it's just beautiful.

That being said, let's focus on whether or not we should rare-limit at lower layer. :)

I feel the merit of your idea of moving these concerns (rate-limit, allocation limit) to higher layer. But, that would impose all of higher level individual components must be equipped with rate-limits and allocation limits, maybe?

That sounds tedious to me. Also, I'm thinking ideally we want to introduce some flow-control somewhere (preferably, at the common lower layer) in the future (@leoluk I'm not forgetting you... #12410). So, I'm slightly inclined to make the lower layer smarter not dumber.

Similar to the general data sanitization strategy in spirit, I'd rather to have these limits at lower layer if lack of higher level context doesn't introduce too much trouble. I think this is yes to some extent for now.

@behzadnouri @carllin Could you share you opinion on this after listening to me? Thanks in advance. :)

Firstly, solana-validator is kind of (insanely-fast) steaming-processor at the highest level in this packet processing aspect. It's constantly fed with saturated 1G or 10G incoming bandwidth. And those bytes are untrusted, so must be cryptographically-verified. So, we want to fed the opaque blobs/bytes to GPU for that end as smooth as possible before any further processing.

What I want to say here is that:

  1. The lower layer adequately knows the upper bound of input rate. Even it can be aware of the nature of shared bandwidth between other incoming packet sources
  2. Higher layer does create higher-level struct from bytes but won't create more than what's coming from the wire (unless we have some data amplification bug). That means I think the higher layer isn't necessarily smarter than lower layer. (But the higher layer must adjust their structs lifetimes not to pressure mem by pruning older data in their definition)
  3. we don't care about out-going packets as for rate-limiting. In other words, higher layer doesn't create PinnedVec out of nowhere internally at undetermined rate; PinnedVec is always tied to the incoming network link bandwidth.
  4. we need some tighter control over the mem region for gpu pinning.

And here's PinnedVec and Recyler toys for that end, which tries to reuse same buffer over and over for better data-locality and to avoid allocation cost and reduce calling potentially costly pinning api call for Cuda.

From this high-level design philosophy, I think it's actually bad coding pattern for clone()-ing or consuming off PinnedVecs. Doing these isn't efficient and it also crates another risk of potential mem bloating hidden buffer. But the current code allows that, confusing its motto. Also, its implementation part actually had bugs as @behzadnouri has pointed with this pr.

So, here's my draft patch to remove these PinnedVec blob escapes, so fixing the bugs from another different angle: https://github.com/ryoqun/solana/tree/revert-alloc-limit-no-clone

At the patch, I'm making sure PinnedVec is something which should only be referenced, not moved or cloned. So, Recycler can clearer understanding of current allocation situation for possible smarter decision in the future code (if we decide to do so).

I wanted to remove .clones() altogether to begin with at #15320 (comment). but blame me, no time... ;)
Also, note that the patch is in draft: it only complies with cargo clippy --bin solana-validator. lol. I'm just showing that my idea is possible with the current codebase.

So fun to have these engineering discussions among the team. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sakridge by the way, the above comment makes sense as the oldest member who wrote the relevant code? That's what I sensed only from reading the code without any briefing. lol

Also, as for coalscing packets for banking stage, it might make sense to create thin struct like BankPacket[s] or TxPacket[s] (bad naming) to follow the higher-layer specific structs design.

Anyway, this is my 2 cents.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree.

I think recycler just better to do the recycling, and leave the rate-limiting, allocation limits, load shedding, etc to higher level code which has more context, and can do so explicitly.

Putting the allocation limit in the recycler is a fundamentally wrong design to me, leading to very hard to debug bugs as the ones which wasted me several days.

and, please note that you were not aware of these bugs. I was the one who was finding these bugs, and there may be even more corner cases around.

Copy link
Member

@sakridge sakridge Apr 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ryoqun Yea, I think your comment makes sense in terms of the reasons for PinnedVec and Recycler and how they are designed to keep allocations around for re-use.

I would say at the time rust has no support for custom allocators for Vec so the semantics around PinnedVec are a bit different from the normal rust semantics. One has to be careful to consume the memory in the right way to get it recycled and mapped/unmapped in the right way unlike a normal Vec with fungible memory space.

I agree with @behzadnouri these rate checks probably should be at a higher level since recycler may not have the information for shedding and rate limiting for all the protocols that it may use. The best pruning/rate limiting algorithms may be protocol specific in terms of what packets you want to filter and prioritize, so recycler is not the right place even if it is convenient.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@behzadnouri thanks for additional points. Okay. point taken. let's remove the limit from recycler.

If I regard the recycler like a system-provided malloc (which basically should never fail), I can understand it's odd place to add limit as you say. Previously, I was thinking it's like an io scheduler/network shaping queue or something backed by shared common resource (with limited throughput).

Anyway, our goal here is just to build very robust and fast stream processor under completely byzantine environment as fast as possible. I'm looking forward to seeing you to implement proper limiters on respective higher-level protocols in follow-up prs. :)

@sakridge Thanks for chiming in! Glad to hear my understanding isn't far off at least. :)

Copy link
Contributor

@carllin carllin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for taking the change!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants