-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[http] Adaptive timeouts for idle HTTP client connections waiting for a request #11427
Comments
High/low memory thresholds and connection counts sounds like something that should be done through the overload manager. There is already a resource monitor for watching memory usage, and it shouldn't be too bad to add one for the number of open connections. We could then wire that into an action that modifies either all or a subset of connection timeouts. |
@akonradi yup that's the plan. Our goal is to thread the overload manager through more places and use it for active/passive adaptive thresholds. cc @tonya11en |
Has there been any work on adding a resource monitor for connection counts? If not I'd be happy to take that on. |
There has, let's sync up on this briefly offline. |
One thing to think about is how to implement adaptive timeouts. The current practice of using independent timers to track request timeouts makes reducing timeouts under memory pressure challenging. You can't rely on the timeout value being known at the time you want to schedule the timer. A potential solution may involve use of a timer list for operations that are waiting for timeouts with the same parameters, plus logic on how to interpret the scaled/adjusted timeouts. |
I've been thinking about this for a while and after a couple discombobulated PRs, I've come up with a way to implement adaptive timeouts fairly efficiently. TL;DR: have the thread-local Overload Manager state hand out timers that can be scaled after the fact, and use a new range trigger and "scale" action to reduce those timeouts when load increases. Read on for the gory details. Range timerThe existing Overload Manager uses Timer objects to track connection timeouts. Since the Timer interface isn't especially amenable to scaling, my plan is to introduce a new RangeTimer class like this that requires the caller to specify a minimum and maximum timeout value. Scaled timer managerThis is a new object that acts as a factory for RangeTimers which can be scaled in response to load. I imagine an API that looks something like this: class ScaledTimerManager {
public:
ScaledTimerManager(Dispatcher& dispatcher);
/**
* Creates a new range timer whose actual timeout values depend on the current scale factor.
*/
RangeTimerPtr createRangeTimer(TimerCb);
/**
* Updates the scale factor used for new and previously created range timers.
* @param scale_factor must be in the range [0, 1].
*/
void setScaleFactor(float scale_factor);
}; An efficient implementation of this class is discussed further down. 'Scaling' state for overload manager actionsThe existing Overload Manager deals with actions that are either on or off. Since timeout scaling works best when done gradually in response to load, I'd like to rename 'inactive' to scaling and 'active' to saturated. The result is a thin wrapper around a float that is clamped to the range [0, 1], where [0, 1) = scaling and 1 = saturated. The currently implemented actions can simply check for saturation instead of for the 'active' state. Overload Manager Range triggerThe existing threshold trigger will continue to produce message RangeTrigger {
// If the resource pressure is greater than this value, the trigger will enter the
// :ref:`scaling <arch_overview_overload_manager-triggers-state>` state.
double min_value = 1 [(validate.rules).double = {lte: 1.0 gte: 0.0}];
// If the resource pressure is greater than this value, the trigger will enter saturation.
double max_value = 2 [(validate.rules).double = {lte: 1.0 gte: 0.0}];
} The range trigger will produce action states that scale linearly between 0 (inactive) and 1 (saturated) as resource pressure varies between When the (still-to-be-defined) "scale connection timeouts" overload action is enabled, the ThreadLocalOverloadStateImpl object will update its ScaledTimerManager with the new scale value as it changes, thus reducing timeouts.
|
Tell me more about how the scaling of timers in a bucket would work. Let's say that we are operating in the bucket with timers that have timeouts (120 sec, 240 secs] At time N+121 we process an event to scale timers down to 50% and run expired timers. Computing the scaled deadline using an efficient algorithm like expiring timers at N+121+bucket_upper_range*50% (also known as N+241) would result in all 3 of the timers above triggering. Is that acceptable? If not, how do we make case above more correct while also remaining efficient? |
Throwing out another alternative, apologies if this is old ground. To simplify, could we just have two clocks effectively, one that is regular real-time for regular timers and another that is scaled. We might have a minimum/maximum scaling factor for the clock, but all timers on the scaled clock are effectively in some EDF queue and maintain relative order. Let's say you want to maintain min bounds in this world. You could provide this by rescheduling if an event happens too early for its min bound. When you rescale the clock, each pending timer might need a correctional rescheduling if it happens before its min bound in real-time. This might appear to be O(n), but I'd argue it's actually O(1) with amortized analysis (you pay at most one reschedule per timer event). If you reduce the scaling factor for the variable speed clock, then you don't pay any extra cost. If you increase it, each time you do this you could potentially force a reschedule. So the complexity might be O(m) where m is the number of times you reschedule within the max deadline of the set of timers. I'm thinking m should be small, since you probably don't want to be continuously churning anyway. Addition: borrowing from Alex's idea above for treating the portion of time before the min bound and after differently, what if we place a ScaledTimer first in the real-time queue, up to its min bound. Once that is hit, then it gets transferred to the variable time queue. Max bound is enforced by maintaining on the real-time queue an event for the ScaledTimer. The ScaledTimer fires once in variable mode once either its max bound timer expires or the scaled one. This has no rescheduling cost and comes at the expense of maintaining possible up to two timer events for every real timer. |
@antoniovicente That's consistent with what I'm proposing. Since timers are only added to a bucket once their min time has elapsed, they can be legally triggered anywhere between 0 and their requested max value. The actual triggering sequence is determined only by the time at which the timer is added to a bucket and by which bucket it gets placed in. Bucket intervals should be thought of as [min, max), where all timers placed in them will trigger in So for your example (assuming a [120, 241) bucket), the target times for each timer would be
At N+120, the first timer would trigger. At N+121 the scaling factor changes to 50% and the second timer triggers because its scaled deadline is N + 59 + (.5 * 120) = N + 119. The third timer would then trigger at N + 88 + 60 = N + 148, in another 27 seconds. |
@htuch I had been assuming that scaled time was always at least as fast as regular time so I don't think we need the sentinel timer. I'm not familiar with EDF scheduling but it looks like the implementation in Envoy (thanks for the pointer) picks the next task in O(log n) time. Maybe this is a good time to ask what exactly the time complexities should be for this implementation? I've gone through a couple iterations of design that range from O(n log n) rescaling to O(1). The buckets in this proposal allow for all operations to be O(1) at the cost of imprecision (a 5-10 second timer can trigger after 6 seconds, even if the scaling factor is 1). I could imagine a std::set or heap-based implementation that does all operations in O(log n) but is much more correct with respect to scaling the timeouts in the requested range. |
I think that the case you're describing ends up rounding the specified timeout up to the top of the range in the bucket. I think that does work better since at least it ensures that timers don't fire much earlier than intended. They do fire later than expected. We should not do an O(n) recomputation of deadlines in each of the buckets when the scaling factor changes. |
I'd rather fire the timers early, but within their [min, max] window than much later, since with the proposed bucketing a timer scheduled for [min, max] could end up firing almost at Right, O(n) definitely seems bad. O(log n) seems acceptable though? |
I don't know if early or late is preferable, that said it makes some sense to me to have timers trigger at the right value when not in overload which may argue for an implementation similar to the one in my example above, which does results in timers triggering early when under partial overload. Use of a smaller scaling factor per buckets may be helpful as a way to reduce the delta between time when timers should and do trigger. Use of 1.4 instead of 2 would provide finer granularity at the expense of up to 2x as many buckets.
Can we do better than log n? Earlier I mentioned something about tracking which buckets have items added to them recently. Focusing on the buckets with higher add rates and presumably the most items in them provides the most benefit. Another possible avenue to consider is to try to reduce the total number of distinct timeout values and thus avoid the need for bucketting. |
Question: in the initial implementation do we even need to modify existing timers? I'm wondering if only new timers is good enough which would give time for old timers to age out? The system wouldn't be early as responsive but it would be much simpler. |
Use of 120 as the deadline for all of these 3 timers despite the requested values being higher seems very strange. |
Yeah if we only instantiate buckets when they're used, and empty buckets are mostly free, then an exponential growth factor of 1.4 isn't that bad, especially if there are very few distinct [min, max] values.
Better for which operations? The ones I'm looking at are enabling a timer, disabling a timer, changing the scale factor, and executing a callback.
That's a fundamental property of the bucketing strategy. It relies on quantization to be efficient. And while strange, it doesn't violate the proposed RangeTimer contract: the timers are being triggered somewhere in the requested range, just earlier than they could have been. @mattklein123 Antonio and I discussed this a bit, and I think the opinion is that when dealing with connection timeouts that are on the scale of minutes, reducing timeouts for future timers won't necessarily do much to actually address overload. @htuch I thought a bit more about it, and I think my suggestion of using an ordered container for tracking timers by their current trigger time (taking the scaling factor into account) isn't workable, since two timers' trigger times can be reordered by a change in scaling factor. I think you're onto something with thinking about this in terms of a separate scaled time point/clock, and will think about that some more. |
I've been talking with a couple of y'all offline about how to implement the ScaledTimerManager. I wanted to circle back and see if the rest of the proposal seemed reasonable though, specifically the augmentation of the Overload Manager's action state and addition of a range trigger. + @eziskind who designed the OM and provided feedback on #11697 |
I've written an alternate implementation based on the suggestion from Harvey about scaling time itself. You can see this approach here. It uses the same ScaledTimerManager API as the bucketing approach. I'd appreciate feedback about the larger proposal outlined above (ignoring the Bonus section) regarding the Overload Manager, which is unaffected by the choice of scaled timer implementations. |
The proposal in #11427 (comment) looks good from my perspective. The only piece of feedback I have is about the "'Scaling' state for overload manager actions" section: You may want to talk about OverloadActionState::isSaturated() which is the new way to determine if it is time to stop accept or other saturation actions. Boolean actions only care if state is saturated or not saturated. Consider removing "inactive" from the list of possible states. The "Bonus: how to implement ScaledTimerManager" section may be out of date since you have been exploring several different options on how to implement scaled timers. I made some high level comments on the variable-speed clock prototype mentioned in the previous comment. |
That's a good idea. I guess there's really no value in having a separate "inactive" state since existing actions will treat the "inactive" and "scaling" states identically. I've updated the proposal above. |
Thanks, I've sent #12352 for the OM changes, which can be made independent of the scaled timers. |
/assign @akonradi |
@akonradi would it be possible to write a short doc on the recommended option, and the alternatives considered? This thread is getting pretty long and I'm having a hard time keeping track. Thank you! |
I've written a short design doc, available here that discusses scaled timers, the recommended option, and alternatives. Please take a look, and feel free to leave feedback here or there. |
Add a ScaledRangeTimerManager class that produces RangeTimers that can be adjusted after the fact by changing the scaling factor on the manager. This class implements the dynamic queues approach suggested in envoyproxy#11427#issuecomment-691154144. The code is tested but not yet integrated anywhere. Signed-off-by: Alex Konradi <akonradi@google.com>
Now that #13129 is almost in, it's time to talk about how scaled timers should be configured. Here are some options:
I lean towards 3.a since it avoids adding a bunch of extra config fields, keeps the overload config together, and doesn't make any existing fields or values redundant (like 3.c would). |
@akonradi I like 3.a or 3.b. 3.c is redundant somewhat and 1/2 adds a lot of complexity for a feature that is likely not used by most users and other xDS clients won't want to implement. CC @envoyproxy/api-shepherds |
I prefer 3.a over 3.b since it's easier to eyeball a percent and say "that looks reasonable" without the context of what the max is. |
Yeah I have a strong preference for (3). I like 3a if I had to pick one, though it seems like we could eventually have 3a and 3b, etc. if we wanted. |
I have some worry about there being a big separation between the place original timeout and threshold/min are configured. The most worrying timeouts would be ones per cluster where it may be appropriate to use different thresholds by cluster. I guess that max(min_override, threshold * full_timeout)) would probably work in general. That said, the first 2 timeouts that we want to apply scaling logic for are timeouts waiting for SSL handshake and timeout waiting for HTTP/1.1 request headers. Use of approach 3a or 3b for those seems totally fine. |
I agree about config discoverability concerns. My feeling is let's try 3a/3b for these 2 cases and see how it looks/feels and then we can go from there? |
Another thing to consider: the two timeouts mentioned above don't exist yet. #13341 implements the first of the two timeouts. We have the option to introduce a timeout config message with either max and optional, or timeout and optional scale. 3a/3b still seems fine since it does allow us to make forward progress. |
Yeah good point. It would be fine to introduce a new config wrapper for this type of timeout, though then we might end up in a situation in which some have the new message and some use 3a/3b and we would end up with deprecations. I could go either way. |
I'm good with a timeout message. I'm fine also with limited deprecation if the number of sites is small, i.e. I can count it on one hand. I wouldn't advise an API-wide replacement with new message and deprecation as that is a lot of churn for all places we have timeouts. It's important to keep in mind that most users will never use this feature and all known xDS clients other than Envoy (i.e. gRPC, but others are coming) are unlikely to support this. |
I talked with alex about this offline and understand his proposal better now. I think he's going for option 3a/3b. |
I've added a prototype that supports options 3a & 3b here. |
@akonradi can you enumerate the set of timer types? I'm wondering how large/small it is? |
I was thinking
I know Antonio has also mentioned an additional timer for (and I'm badly paraphrasing) the TLS handshake, but I don't know that we'd want to scale that during overload. |
OK, so two of them are new and can be done with a dedicated type, the other two wouldn't be so bad to deprecate/migrate, but I'm OK with 3a. |
We want a timeout that covers from the accept of a new SSL connetion until SSL handshake complete. Yes, we should scale that timeout during overload. |
IIRC #13475 added scaling for this timeout
Not sure what plans we have for this case. I know a timeout exists, do we want to provide a way to scale this timeout? I think we could go either way here.
Do we have plans to scale this timeout? If we expect this to be configured to a small value most of the time, I think it would be fine to omit scaling.
Issue #11426 covers the SSL connection case. Based on the replies to the questions above, we should decide if there's work remaining on this issue, or it is time to close it and declare victory. |
Scaling the stream idle timeout should be pretty trivial, and should help with resource attacks since otherwise the connection idle timeout is useless for preventing resource attacks. I don't have a strong opinion about the header processing timeout since it's supposed to be small. |
From discussion on #14290: Should we consider scaling the drain timeout for HTTP connections? From Antonio:
|
HTTP1 downstream connections are either waiting for the first byte of the request headers, parsing request headers or handling the active request. We benefit from relatively long timeout for connections with no active request as it allows for effective prefetching and connection reuse. We do expect a relatively short time between when the first and last byte of the request headers arrive which could be covered by a shorter timeout.
Similar logic could also be applied to H2 connections with 0 active requests.
The timeouts for idle connections with no active requests and connections parsing headers could be reduced under could be reduced based on memory pressure to increase resiliency to memory exhaustion attacks and improving the proxy's ability to accept legitimate connections/requests. Config strawman: min/max timeouts and high/low memory thresholds at which the timeouts apply or high/low connection count at which timeouts apply.
The text was updated successfully, but these errors were encountered: