-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
overload: scale selected timers in response to load #13475
overload: scale selected timers in response to load #13475
Conversation
This will allow mocking for users of the timer manager. Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
Add the API protos and implementation framework for scaling timers based on resoure pressure with a scaled trigger. Signed-off-by: Alex Konradi <akonradi@google.com>
Allow Envoy to drop HTTP downstreams with idle connections in response to overload conditions. Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
@@ -39,13 +40,23 @@ class OverloadActionState { | |||
*/ | |||
using OverloadActionCb = std::function<void(OverloadActionState)>; | |||
|
|||
enum class OverloadTimerType { | |||
// Timers created with this type will never be scaled. This should only be used for testing. | |||
UnscaledRealTimer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused enum value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this in an intermediate commit for testing. It could be removed but is convenient for testing without a real timer type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add ForTest to the end of this enum name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
ScaledTimerMinimum minimum = | ||
minimum_it != timer_minimums_.end() ? minimum_it->second : ScaledTimerMinimum(1.0); | ||
return std::make_unique<FixedMinimumScaledTimer>( | ||
scaled_timer_manager_->createTimer(std::move(callback)), minimum); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this method do when there is no configured minimum? I think it should result in a timer that has minimum equal to max, or result in creation of a regular timer instead of a scaled one in order to preserve behavior before we introduced scaled timers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default minimum is ScaledTimerMinimum(1.0) which makes min = max. It seemed more straightforward to avoid special casing, though the behavior of the returned range timer is degenerate and it triggers exactly when its wrapper TimerPtr triggers.
auto minimum_it = timer_minimums_.find(timer_type); | ||
ScaledTimerMinimum minimum = | ||
minimum_it != timer_minimums_.end() ? minimum_it->second : ScaledTimerMinimum(1.0); | ||
return std::make_unique<FixedMinimumScaledTimer>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the implementation of FixedMinimumScaledTimer and ScaledTimerMinimum be moved to scaled timer manager, and have scaled timer manager return instances of TimerPtr interface instead?
I think this may allow us to avoid having an unique_ptr scaled timer inside FixedMinimumScaled which would result in reduction in allocations and memory usage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no objections, though in that case we'd want to get rid of the RangeTimer interface as well. Yeah I'm not a big fan of my existing implementation where we wrap a TimerPtr with a RangeTimerPtr and then again in a TimerPtr. I'm a little unclear on how that would work though: would ScaledTimerManager::createTimer take either a minimum duration or scaling factor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's start by moving this wrapper to scaled timer manager
Have ScaledTimerManager::createTimer take ScaledTimerMinimum as an input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Split out as #13524
- name: "envoy.resource_monitors.fake_resource1" | ||
scaled: | ||
scaling_threshold: 0.9 | ||
saturation_threshold: 0.9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be significant whitespace change here due to the change in syntax used to represent the protos. Is this necessary? Consider doing this change in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessary in this PR; I'll break it up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Syntax change is now in #13518
test/mocks/event/mocks.h
Outdated
MOCK_METHOD(bool, enabled, (), (override)); | ||
|
||
const ScopeTrackedObject* scope_{}; | ||
bool enabled_{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scope_ and enabled_ seem unused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, overzealous copy-paste. I'll remove them in the next commit.
This will allow mocking for users of the timer manager. Signed-off-by: Alex Konradi <akonradi@google.com>
Add a method to ScaledRangeTimerManager to return a TimerPtr that wraps a RangeTimerImpl. While the wrapper could be implemented externally by wrapping a RangeTimerPtr, wrapping the impl class is more efficient since it requires less indirection and fewer heap allocations. Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
message ScaleTimersOverloadActionConfig { | ||
enum TimerType { | ||
// Adjusts the idle timer for downstream HTTP connections that takes effect when there are no active streams. | ||
HTTP_DOWNSTREAM_CONNECTION_IDLE = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth mentioning this override in route_components.proto near:
google.protobuf.Duration idle_timeout = 24;
For discoverability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, add here a reference to the timeout parameter that is affected by this min override.
Explicitly provide the template parameters to absl::visit since using deduction causes Windows builds to fail. It appears that there is an internal instantiation of std::variant_size, which doesn't work with subclasses of absl::variant. Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
@@ -89,6 +116,9 @@ message OverloadAction { | |||
// DNS to ensure uniqueness. | |||
string name = 1 [(validate.rules).string = {min_len: 1}]; | |||
|
|||
// Configuration for the action being instantiated. | |||
google.protobuf.Any typed_config = 3; | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Consider ordering by field id, since there's no obvious reason to put typed_config next to the name parameter.
virtual RangeTimerPtr createTimer(TimerCb callback) PURE; | ||
|
||
/** | ||
* Sets the scale factor for all timers created through this manager. The value should be between |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change "should" to "must"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
/** | ||
* Sets the scale factor for all timers created through this manager. The value should be between | ||
* 0 and 1, inclusive. The scale factor affects the amount of time timers spend in their target |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.0 and 1.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -39,13 +40,23 @@ class OverloadActionState { | |||
*/ | |||
using OverloadActionCb = std::function<void(OverloadActionState)>; | |||
|
|||
enum class OverloadTimerType { | |||
// Timers created with this type will never be scaled. This should only be used for testing. | |||
UnscaledRealTimer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add ForTest to the end of this enum name?
* of their range by the owner of the manager object. Users of this class can call createTimer() to | ||
* receive a new RangeTimer object that they can then enable or disable at will (but only on the | ||
* same dispatcher), and setScaleFactor() to change the scaling factor. The current scale factor is | ||
* applied to all timers, including those that are created later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current wording makes it unclear if updates to scale apply to timers that are already created.
Consider: Updates to the current scale factor are applied to all timers, including those created in the past.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -80,6 +82,31 @@ message Trigger { | |||
} | |||
} | |||
|
|||
message ScaleTimersOverloadActionConfig { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no mentions of this config corresponding to the "envoy.overload_actions.reduce_timeouts" action.
Should envoy.overload_actions.reduce_timeouts be mentioned in this config message comment?
Should there be some doc updates that mention how to configure this action?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
// A set of timer scaling rules to be applied. | ||
repeated ScaleTimer timer_scale_factors = 1 [(validate.rules).repeated = {min_items: 1}]; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add release notes describing this new feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -99,6 +100,30 @@ class NamedOverloadActionSymbolTable { | |||
std::vector<std::string> names_; | |||
}; | |||
|
|||
class ScaledTimerMinimum { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class and method comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Obsolete.
|
||
std::chrono::milliseconds compute(std::chrono::milliseconds ms) const; | ||
|
||
struct ScaleFactor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these 2 structs can be private now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Obsolete.
scale_timer.has_min_timeout() | ||
? ScaledTimerMinimum(std::chrono::milliseconds( | ||
DurationUtil::durationToMilliseconds(scale_timer.min_timeout()))) | ||
: ScaledTimerMinimum(scale_timer.min_scale().value() / 100); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/ 100.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done for clarity, though I think type promotion makes this functionally identical.
…rface Signed-off-by: Alex Konradi <akonradi@google.com>
…rface Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
// Sets the minimum duration as a percentage of the maximum value. | ||
type.v3.Percent min_scale = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this might be a way to shoot yourself in the foot as it's not as intuitive and easy to compare as absolute time -- unless you end up doing the calculation yourself in which case maybe just use the min_timeout to set the absolute duration.
If this gets set too low, I could see the aggressive timeouts causing cascading failures i.e. user requests times out so they end up refreshing the page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, though a lot of the overload manger actions let you shoot yourself in the foot. Having a scale factor allows this to apply meaningfully to timers that can have different values across listeners.
Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
…imeouts Signed-off-by: Alex Konradi <akonradi@google.com>
: action_symbol_table_(action_symbol_table), timer_minimums_(timer_minimums), | ||
actions_(action_symbol_table.size(), OverloadActionState(0)), | ||
scaled_timer_action_(action_symbol_table.lookup(OverloadActionNames::get().ReduceTimeouts)), | ||
scaled_timer_manager_(std::move(scaled_timer_manager_)) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scaled_timer_manager_(std::move(scaled_timer_manager)) {}
Also fix formatting Signed-off-by: Alex Konradi <akonradi@google.com>
The admin handler won't ever call the method to create a scaled timer since it won't have the timeout set. Signed-off-by: Alex Konradi <akonradi@google.com>
…imeouts Signed-off-by: Alex Konradi <akonradi@google.com>
The macos check looks to be the same GOAWAY issue seen elsewhere. Would merging master help, or should I just kick off a retest? |
/retest |
Retrying Azure Pipelines: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very cool. Just some API/doc comments from a review.
/wait
@@ -977,6 +977,10 @@ message RouteAction { | |||
// fires, the stream is terminated with a 408 Request Timeout error code if no | |||
// upstream response header has been received, otherwise a stream reset | |||
// occurs. | |||
// | |||
// If the overload action "envoy.overload_actions.reduce_timeouts" is configured, this timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you ref link "overload action" to the overload manager docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, done.
// Typed configuration for the "envoy.overload_actions.reduce_timeouts" action. | ||
message ScaleTimersOverloadActionConfig { | ||
enum TimerType { | ||
UNSPECIFIED = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the reason this is specified to force the user to explicitly configure the timeout they want? I think that's OK but can you add a small comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, done. I would have liked to just start at 1 but the compiler (or maybe lint checker?) complains.
oneof overload_adjust { | ||
option (validate.required) = true; | ||
|
||
// Sets the minimum duration as an absolute value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From reading the docs, it's not immediately clear to me how min_timeout interacts with the scaling. Can you flesh this out a bit more to better describe how everything fits together in terms of scaling? An example would probably help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Elaborated in the docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks the example is very helpful. Can you potentially ref link to the docs somehow from here just so that people can go back and forth if they are confused what this does?
/wait
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, done.
Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks LGTM can you merge main?
/wait
* master: (83 commits) tls: Typesafe tls slots (envoyproxy#13789) docs(example): Correct URL for caching example page (envoyproxy#13810) [fuzz] Made health check fuzz more efficient (envoyproxy#13747) rtds: properly scope rtds stats (envoyproxy#13764) http: fixing a bug with IPv6 hosts (envoyproxy#13798) connection: Remember transport socket read resumption requests and replay them when re-enabling read. (envoyproxy#13772) network: adding some accessors for ALPN work. (envoyproxy#13785) docs: added a step about how to handle platform specific extensions (envoyproxy#13759) Fix identation in ip transparency code snippet (envoyproxy#13743) wasm: enable WAVM's stack unwinding feature (envoyproxy#13792) log: set route name for direct response (envoyproxy#13683) Use nghttp2 as external dependsncy in protocol_constraints_lib (envoyproxy#13763) [Windows] Update windows dev docs (envoyproxy#13741) cel: patch thread safety issue (envoyproxy#13739) Windows: Fix ssl_socket_test (envoyproxy#13264) apple dns: add fake api test suite (envoyproxy#13780) overload: scale selected timers in response to load (envoyproxy#13475) examples: Add dynamic configuration (control plane) sandbox (envoyproxy#13746) Removed exception in getResponseStatus() (envoyproxy#13314) network: add timeout for transport connect (envoyproxy#13610) ... Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Commit Message: Scale timers in response to overload state
Additional Description:
Add a new overload action and use the new typed_config field to adjust timers based on resource pressure.
Risk Level: low - everything is disabled by default
Testing: ran unit and integration tests
Docs Changes: documented new overload action
Release Notes: reference new overload action
Towards #11427