adds a shrink policy to the recycler without an allocation limit #16457

behzadnouri · 2021-04-09T17:02:44Z

Problem

#15320 added an allocation limit to the recycler, which has been the source of a
number of bugs. For example the code bellow panics by simply cloning packets:

const RECYCLER_LIMIT: usize = 8;
let recycler = PacketsRecycler::new_with_limit("", RECYCLER_LIMIT as u32);
let packets = Packets::new_with_recycler(recycler.clone(), 1).unwrap();
for _ in 0..RECYCLER_LIMIT {
    let _ = packets.clone();
}
Packets::new_with_recycler(recycler.clone(), 1);

The implementation also fails to account for instances where objects are
consumed. Having the allocation limit in the recycler also seems out of place,
as higher level code has better context to impose allocation limits (e.g. by
using bounded channels to rate-limit), whereas the recycler would be simpler
and more efficient if it just do the recycling.

Summary of Changes

This commit:

Reverts Add limit and shrink policy for recycler #15320. The shrink policy there is tightly tied to the allocation limit and total_allocated_count which has been bug-prone.
Adds a shrink policy to the recycler without an allocation limit, based on exponential moving average of number of garbage collected objects.

Will follow up by adding rate-limits to streamer using bounded channels,
or perhaps similar load-shedding as in gossip:
https://github.com/solana-labs/solana/blob/364af3a3e/core/src/cluster_info.rs#L2811-L2822

codecov · 2021-04-09T18:46:33Z

Codecov Report

Merging #16457 (b3c5e6c) into master (74f5837) will decrease coverage by 0.0%.
The diff coverage is 82.7%.

@@            Coverage Diff            @@
##           master   #16457     +/-   ##
=========================================
- Coverage    80.0%    80.0%   -0.1%     
=========================================
  Files         413      413             
  Lines      112129   111962    -167     
=========================================
- Hits        89787    89644    -143     
+ Misses      22342    22318     -24

behzadnouri · 2021-04-09T22:04:27Z

First commit is just a revert (because the current shrink policy is too tightly tied to the allocation limit).
The 2nd commit:
b3c5e6c
is the actual added code which implements the shrink policy on the original recycler code and is pretty small and easier to review.

carllin · 2021-04-09T23:10:53Z

tagging @ryoqun as well, since he also looked into the DOS bug 😃

carllin · 2021-04-13T09:24:50Z

perf/benches/recycler.rs

-use test::Bencher;
-
-#[bench]
-fn bench_recycler(bencher: &mut Bencher) {


Heh, this was stolen from @ryoqun's work: #15320 (comment) and integrated into my PR.

Can we keep this around and also gather some before/after perf benchmarks for this? Something similar to #15320 (comment)

Added back the benchmark.
On this code:

test bench_recycler ... bench: 77 ns/iter (+/- 4)

On the master code:

test bench_recycler ... bench: 107 ns/iter (+/- 6)

carllin · 2021-04-13T09:32:57Z

perf/src/recycler.rs

-    }
-
-    #[test]
-    fn test_recycler_shrink() {


Can we add an equivalent test for the shrinking path that demonstrates a burst of transactions, and then the shrinking path deallocating unused objects over time?

added a test coverage

carllin · 2021-04-13T10:50:30Z

perf/src/recycler.rs

+                    .size_factor
+                    .load(Ordering::Acquire)
+                    .saturating_mul(RECYCLER_SHRINK_WINDOW_SUB_ONE)
+                    .saturating_add(RECYCLER_SHRINK_WINDOW_HALF)
+                    .checked_div(RECYCLER_SHRINK_WINDOW)
+                    .unwrap()
+                    .saturating_add(gc.len()),
+                Ordering::Release,


For the sake of statistics noobs like me 👶 , could we add a definition for the exponential moving average then add some comments on which constants used/ intuition for picking those constants?

From my understanding if the value at iteration t is V_t, then the exponential average A at iteration t is computed as for some coefficient x:

A_t = xV_{t} + (1 - x) * A_{t - 1}

From my naive understanding, I don't quite fully grasp the current implementation (I'm probably missing something 😃): https://github.com/solana-labs/solana/pull/16457/files#diff-798271cb43dca47253c06d8ab6888d8bf4fb385fc8e74cc2cd70ac1b3b664ebfR97-R106

It seems to me that in the above code if gc.len() remains constant then the size_factor blows up in size quickly because we're adding .saturating_add(gc.len()) every time.

For instance if gc.len() == RECYCLER_SHRINK_WINDOW remains constant then this current calculation of size_factor would increase by many multiples every iteration.

If n is the window length, a is the moving average, and x is the new observation (here gc.len()), then the update equation is:

a <- a * (n - 1) / n + x / n

Here, I wanted to avoid floating math, so, if you define b = n a, then

b <- b * (n - 1) / n + x

and you can make the remaining division to round (instead of truncate) by adding n / 2 to the numerator.

Effectively b would be an exponential moving estimate of the "sum" of x over the window as opposed to the "average".

The size_factor comment also says that:

// Shrink window times the exponential moving average size of gc. size_factor: AtomicUsize,

Ah I see, thanks for the explanation, that was super helpful for my understanding.

Could we add this explanation as a comment in the code?

added comments

This reverts commit c2e8814.

solana-labs#15320 added an allocation limit to the recycler, which has been the source of a number of bugs. For example the code bellow panics by simply cloning packets: const RECYCLER_LIMIT: usize = 8; let recycler = PacketsRecycler::new_with_limit("", RECYCLER_LIMIT as u32); let packets = Packets::new_with_recycler(recycler.clone(), 1).unwrap(); for _ in 0..RECYCLER_LIMIT { let _ = packets.clone(); } Packets::new_with_recycler(recycler.clone(), 1); The implementation also fails to account for instances where objects are consumed. Having the allocation limit in the recycler also seems out of place, as higher level code has better context to impose allocation limits (e.g. by using bounded channels to rate-limit), whereas the recycler would be simpler and more efficient if it just do the recycling. This commit: * Reverts solana-labs#15320 * Adds a shrink policy to the recycler without an allocation limit.

ryoqun · 2021-04-14T14:57:04Z

perf/src/recycler.rs

+            if gc.len() > RECYCLER_SHRINK_SIZE
+                && self.size_factor.load(Ordering::Acquire) >= SIZE_FACTOR_AFTER_SHRINK


Hi! I'm called. :) the exponential stuff looks like really cool. I'm still grokking the code... I'm weak at math to be honest...

That being said, I think this is maybe susceptible to eclipse attack? So, I think we need some highwater mark check to force shrinking if gc.len() is too large? If I read the code correctly, there is no upper bound for gc.len()?

Consider this: victim's network traffic shaping is patially under the control by malicious actor like network intermediaries or turbine neighbor (PinnedVec/Recycler is used on that codepath?) (I know this is a bit too theoretical...)

Attack: the malicious actor throttles packets with intermittent spikes to the victim so that size_factor is unintendedly below SIZE_FACTOR_AFTER_SHRINK although gc.len() is swollen because of processing the spike.

Also, I'm still wondering new impl is as safe as older shrinker and limiter one. My gut feeling is that centralized hard-limit at lower layer is safer while I know it contained bugs that frustrated @behzadnouri due to my lack of review at #15320... I'm still trying convincing this cleaner impl is actually ok. :)

Anyway, let's have a happy discussion. the more the merrier.

The goal is to pull rate-limiting and allocation limit out of the recycler to higher level code which has more context. So,

The recycler will just do the recycling and shrinking its internal buffer when necessary (i.e. this pull request); which

Allows simpler code, as the recycler will not bother with tracking number of objects on the fly.

Avoids several bugs with current implementation as described in the pull-request description.

and, then follow up with:

Use things like bounded channels to enforce rate limits.

Use things like load-shedding to limit memory usage; as in for example in gossip:
https://github.com/solana-labs/solana/blob/364af3a3e/core/src/cluster_info.rs#L2811-L2822

@behzadnouri

I like your exponential shrinker impl in fact. it's just beautiful.

That being said, let's focus on whether or not we should rare-limit at lower layer. :)

I feel the merit of your idea of moving these concerns (rate-limit, allocation limit) to higher layer. But, that would impose all of higher level individual components must be equipped with rate-limits and allocation limits, maybe?

That sounds tedious to me. Also, I'm thinking ideally we want to introduce some flow-control somewhere (preferably, at the common lower layer) in the future (@leoluk I'm not forgetting you... #12410). So, I'm slightly inclined to make the lower layer smarter not dumber.

Similar to the general data sanitization strategy in spirit, I'd rather to have these limits at lower layer if lack of higher level context doesn't introduce too much trouble. I think this is yes to some extent for now.

@behzadnouri @carllin Could you share you opinion on this after listening to me? Thanks in advance. :)

Firstly, solana-validator is kind of (insanely-fast) steaming-processor at the highest level in this packet processing aspect. It's constantly fed with saturated 1G or 10G incoming bandwidth. And those bytes are untrusted, so must be cryptographically-verified. So, we want to fed the opaque blobs/bytes to GPU for that end as smooth as possible before any further processing.

What I want to say here is that:

The lower layer adequately knows the upper bound of input rate. Even it can be aware of the nature of shared bandwidth between other incoming packet sources

Higher layer does create higher-level struct from bytes but won't create more than what's coming from the wire (unless we have some data amplification bug). That means I think the higher layer isn't necessarily smarter than lower layer. (But the higher layer must adjust their structs lifetimes not to pressure mem by pruning older data in their definition)

we don't care about out-going packets as for rate-limiting. In other words, higher layer doesn't create PinnedVec out of nowhere internally at undetermined rate; PinnedVec is always tied to the incoming network link bandwidth.

we need some tighter control over the mem region for gpu pinning.

And here's PinnedVec and Recyler toys for that end, which tries to reuse same buffer over and over for better data-locality and to avoid allocation cost and reduce calling potentially costly pinning api call for Cuda.

From this high-level design philosophy, I think it's actually bad coding pattern for clone()-ing or consuming off PinnedVecs. Doing these isn't efficient and it also crates another risk of potential mem bloating hidden buffer. But the current code allows that, confusing its motto. Also, its implementation part actually had bugs as @behzadnouri has pointed with this pr.

So, here's my draft patch to remove these PinnedVec blob escapes, so fixing the bugs from another different angle: https://github.com/ryoqun/solana/tree/revert-alloc-limit-no-clone

At the patch, I'm making sure PinnedVec is something which should only be referenced, not moved or cloned. So, Recycler can clearer understanding of current allocation situation for possible smarter decision in the future code (if we decide to do so).

I wanted to remove .clones() altogether to begin with at #15320 (comment). but blame me, no time... ;)
Also, note that the patch is in draft: it only complies with cargo clippy --bin solana-validator. lol. I'm just showing that my idea is possible with the current codebase.

So fun to have these engineering discussions among the team. :)

Also this is a handy diff view for my branch: https://github.com/solana-labs/solana/compare/99d12b8...ryoqun:revert-alloc-limit-no-clone?expand=1

@sakridge by the way, the above comment makes sense as the oldest member who wrote the relevant code? That's what I sensed only from reading the code without any briefing. lol

Also, as for coalscing packets for banking stage, it might make sense to create thin struct like BankPacket[s] or TxPacket[s] (bad naming) to follow the higher-layer specific structs design.

Anyway, this is my 2 cents.

I disagree.

I think recycler just better to do the recycling, and leave the rate-limiting, allocation limits, load shedding, etc to higher level code which has more context, and can do so explicitly.

Putting the allocation limit in the recycler is a fundamentally wrong design to me, leading to very hard to debug bugs as the ones which wasted me several days.

and, please note that you were not aware of these bugs. I was the one who was finding these bugs, and there may be even more corner cases around.

@ryoqun Yea, I think your comment makes sense in terms of the reasons for PinnedVec and Recycler and how they are designed to keep allocations around for re-use.

I would say at the time rust has no support for custom allocators for Vec so the semantics around PinnedVec are a bit different from the normal rust semantics. One has to be careful to consume the memory in the right way to get it recycled and mapped/unmapped in the right way unlike a normal Vec with fungible memory space.

I agree with @behzadnouri these rate checks probably should be at a higher level since recycler may not have the information for shedding and rate limiting for all the protocols that it may use. The best pruning/rate limiting algorithms may be protocol specific in terms of what packets you want to filter and prioritize, so recycler is not the right place even if it is convenient.

@behzadnouri thanks for additional points. Okay. point taken. let's remove the limit from recycler.

If I regard the recycler like a system-provided malloc (which basically should never fail), I can understand it's odd place to add limit as you say. Previously, I was thinking it's like an io scheduler/network shaping queue or something backed by shared common resource (with limited throughput).

Anyway, our goal here is just to build very robust and fast stream processor under completely byzantine environment as fast as possible. I'm looking forward to seeing you to implement proper limiters on respective higher-level protocols in follow-up prs. :)

@sakridge Thanks for chiming in! Glad to hear my understanding isn't far off at least. :)

carllin

Looks good, thanks for taking the change!

behzadnouri force-pushed the revert-alloc-limit branch from 4f9d397 to a7d5277 Compare April 9, 2021 17:25

behzadnouri force-pushed the revert-alloc-limit branch from a7d5277 to 7bd8b45 Compare April 9, 2021 19:56

behzadnouri marked this pull request as ready for review April 9, 2021 20:02

behzadnouri force-pushed the revert-alloc-limit branch from 7bd8b45 to aca5a27 Compare April 9, 2021 20:29

behzadnouri requested review from carllin and sakridge April 9, 2021 22:04

carllin requested a review from ryoqun April 9, 2021 23:11

behzadnouri force-pushed the revert-alloc-limit branch 2 times, most recently from 33319fc to 39da365 Compare April 12, 2021 16:07

carllin reviewed Apr 13, 2021

View reviewed changes

behzadnouri force-pushed the revert-alloc-limit branch from 39da365 to 99d12b8 Compare April 13, 2021 15:04

behzadnouri requested a review from carllin April 13, 2021 15:06

behzadnouri added 2 commits April 14, 2021 09:21

Revert "Add limit and shrink policy for recycler (solana-labs#15320)"

aa4300f

This reverts commit c2e8814.

behzadnouri force-pushed the revert-alloc-limit branch from 99d12b8 to b3c5e6c Compare April 14, 2021 13:28

ryoqun reviewed Apr 14, 2021

View reviewed changes

carllin approved these changes Apr 16, 2021

View reviewed changes

behzadnouri merged commit cfe7a43 into solana-labs:master Apr 18, 2021

behzadnouri deleted the revert-alloc-limit branch April 18, 2021 19:29

ryoqun mentioned this pull request Aug 2, 2021

stake: Allow stakes with unmatched credits observed to merge #18985

Merged

behzadnouri mentioned this pull request May 19, 2022

Introduce memory limits for channels of packet batches #25192

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adds a shrink policy to the recycler without an allocation limit #16457

adds a shrink policy to the recycler without an allocation limit #16457

behzadnouri commented Apr 9, 2021 •

edited

Loading

codecov bot commented Apr 9, 2021 •

edited

Loading

behzadnouri commented Apr 9, 2021 •

edited

Loading

carllin commented Apr 9, 2021 •

edited

Loading

carllin Apr 13, 2021 •

edited

Loading

behzadnouri Apr 13, 2021

carllin Apr 13, 2021 •

edited

Loading

behzadnouri Apr 13, 2021

carllin Apr 13, 2021 •

edited

Loading

behzadnouri Apr 13, 2021

carllin Apr 14, 2021

behzadnouri Apr 14, 2021

ryoqun Apr 14, 2021

behzadnouri Apr 14, 2021

ryoqun Apr 15, 2021

ryoqun Apr 15, 2021

ryoqun Apr 15, 2021

behzadnouri Apr 15, 2021

sakridge Apr 15, 2021 •

edited

Loading

ryoqun Apr 16, 2021

carllin left a comment

		if gc.len() > RECYCLER_SHRINK_SIZE
		&& self.size_factor.load(Ordering::Acquire) >= SIZE_FACTOR_AFTER_SHRINK

adds a shrink policy to the recycler without an allocation limit #16457

adds a shrink policy to the recycler without an allocation limit #16457

Conversation

behzadnouri commented Apr 9, 2021 • edited Loading

Problem

Summary of Changes

codecov bot commented Apr 9, 2021 • edited Loading

Codecov Report

behzadnouri commented Apr 9, 2021 • edited Loading

carllin commented Apr 9, 2021 • edited Loading

carllin Apr 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carllin Apr 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carllin Apr 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sakridge Apr 15, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carllin left a comment

Choose a reason for hiding this comment

behzadnouri commented Apr 9, 2021 •

edited

Loading

codecov bot commented Apr 9, 2021 •

edited

Loading

behzadnouri commented Apr 9, 2021 •

edited

Loading

carllin commented Apr 9, 2021 •

edited

Loading

carllin Apr 13, 2021 •

edited

Loading

carllin Apr 13, 2021 •

edited

Loading

carllin Apr 13, 2021 •

edited

Loading

sakridge Apr 15, 2021 •

edited

Loading