Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

overload: scale selected timers in response to load #13475

Merged
merged 29 commits into from
Oct 28, 2020

Conversation

akonradi
Copy link
Contributor

@akonradi akonradi commented Oct 9, 2020

Commit Message: Scale timers in response to overload state
Additional Description:
Add a new overload action and use the new typed_config field to adjust timers based on resource pressure.
Risk Level: low - everything is disabled by default
Testing: ran unit and integration tests
Docs Changes: documented new overload action
Release Notes: reference new overload action

Towards #11427

This will allow mocking for users of the timer manager.

Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
Add the API protos and implementation framework for scaling timers based
on resoure pressure with a scaled trigger.

Signed-off-by: Alex Konradi <akonradi@google.com>
Allow Envoy to drop HTTP downstreams with idle connections in response
to overload conditions.

Signed-off-by: Alex Konradi <akonradi@google.com>
@repokitteh-read-only
Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/envoy/.
CC @envoyproxy/api-watchers: FYI only for changes made to api/envoy/.

🐱

Caused by: #13475 was opened by akonradi.

see: more, trace.

@akonradi akonradi marked this pull request as draft October 9, 2020 21:28
Signed-off-by: Alex Konradi <akonradi@google.com>
@@ -39,13 +40,23 @@ class OverloadActionState {
*/
using OverloadActionCb = std::function<void(OverloadActionState)>;

enum class OverloadTimerType {
// Timers created with this type will never be scaled. This should only be used for testing.
UnscaledRealTimer,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused enum value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this in an intermediate commit for testing. It could be removed but is convenient for testing without a real timer type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add ForTest to the end of this enum name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

ScaledTimerMinimum minimum =
minimum_it != timer_minimums_.end() ? minimum_it->second : ScaledTimerMinimum(1.0);
return std::make_unique<FixedMinimumScaledTimer>(
scaled_timer_manager_->createTimer(std::move(callback)), minimum);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this method do when there is no configured minimum? I think it should result in a timer that has minimum equal to max, or result in creation of a regular timer instead of a scaled one in order to preserve behavior before we introduced scaled timers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default minimum is ScaledTimerMinimum(1.0) which makes min = max. It seemed more straightforward to avoid special casing, though the behavior of the returned range timer is degenerate and it triggers exactly when its wrapper TimerPtr triggers.

auto minimum_it = timer_minimums_.find(timer_type);
ScaledTimerMinimum minimum =
minimum_it != timer_minimums_.end() ? minimum_it->second : ScaledTimerMinimum(1.0);
return std::make_unique<FixedMinimumScaledTimer>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the implementation of FixedMinimumScaledTimer and ScaledTimerMinimum be moved to scaled timer manager, and have scaled timer manager return instances of TimerPtr interface instead?

I think this may allow us to avoid having an unique_ptr scaled timer inside FixedMinimumScaled which would result in reduction in allocations and memory usage

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no objections, though in that case we'd want to get rid of the RangeTimer interface as well. Yeah I'm not a big fan of my existing implementation where we wrap a TimerPtr with a RangeTimerPtr and then again in a TimerPtr. I'm a little unclear on how that would work though: would ScaledTimerManager::createTimer take either a minimum duration or scaling factor?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's start by moving this wrapper to scaled timer manager

Have ScaledTimerManager::createTimer take ScaledTimerMinimum as an input.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split out as #13524

- name: "envoy.resource_monitors.fake_resource1"
scaled:
scaling_threshold: 0.9
saturation_threshold: 0.9
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be significant whitespace change here due to the change in syntax used to represent the protos. Is this necessary? Consider doing this change in a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary in this PR; I'll break it up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Syntax change is now in #13518

MOCK_METHOD(bool, enabled, (), (override));

const ScopeTrackedObject* scope_{};
bool enabled_{};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scope_ and enabled_ seem unused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, overzealous copy-paste. I'll remove them in the next commit.

This will allow mocking for users of the timer manager.

Signed-off-by: Alex Konradi <akonradi@google.com>
Add a method to ScaledRangeTimerManager to return a TimerPtr that wraps
a RangeTimerImpl. While the wrapper could be implemented externally by
wrapping a RangeTimerPtr, wrapping the impl class is more efficient
since it requires less indirection and fewer heap allocations.

Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
message ScaleTimersOverloadActionConfig {
enum TimerType {
// Adjusts the idle timer for downstream HTTP connections that takes effect when there are no active streams.
HTTP_DOWNSTREAM_CONNECTION_IDLE = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth mentioning this override in route_components.proto near:
google.protobuf.Duration idle_timeout = 24;

For discoverability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, add here a reference to the timeout parameter that is affected by this min override.

Explicitly provide the template parameters to absl::visit since
using deduction causes Windows builds to fail. It appears that there is
an internal instantiation of std::variant_size, which doesn't work with
subclasses of absl::variant.

Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
@@ -89,6 +116,9 @@ message OverloadAction {
// DNS to ensure uniqueness.
string name = 1 [(validate.rules).string = {min_len: 1}];

// Configuration for the action being instantiated.
google.protobuf.Any typed_config = 3;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Consider ordering by field id, since there's no obvious reason to put typed_config next to the name parameter.

virtual RangeTimerPtr createTimer(TimerCb callback) PURE;

/**
* Sets the scale factor for all timers created through this manager. The value should be between
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change "should" to "must"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


/**
* Sets the scale factor for all timers created through this manager. The value should be between
* 0 and 1, inclusive. The scale factor affects the amount of time timers spend in their target
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.0 and 1.0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -39,13 +40,23 @@ class OverloadActionState {
*/
using OverloadActionCb = std::function<void(OverloadActionState)>;

enum class OverloadTimerType {
// Timers created with this type will never be scaled. This should only be used for testing.
UnscaledRealTimer,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add ForTest to the end of this enum name?

* of their range by the owner of the manager object. Users of this class can call createTimer() to
* receive a new RangeTimer object that they can then enable or disable at will (but only on the
* same dispatcher), and setScaleFactor() to change the scaling factor. The current scale factor is
* applied to all timers, including those that are created later.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current wording makes it unclear if updates to scale apply to timers that are already created.

Consider: Updates to the current scale factor are applied to all timers, including those created in the past.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -80,6 +82,31 @@ message Trigger {
}
}

message ScaleTimersOverloadActionConfig {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no mentions of this config corresponding to the "envoy.overload_actions.reduce_timeouts" action.

Should envoy.overload_actions.reduce_timeouts be mentioned in this config message comment?
Should there be some doc updates that mention how to configure this action?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


// A set of timer scaling rules to be applied.
repeated ScaleTimer timer_scale_factors = 1 [(validate.rules).repeated = {min_items: 1}];
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add release notes describing this new feature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -99,6 +100,30 @@ class NamedOverloadActionSymbolTable {
std::vector<std::string> names_;
};

class ScaledTimerMinimum {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

class and method comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obsolete.


std::chrono::milliseconds compute(std::chrono::milliseconds ms) const;

struct ScaleFactor {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these 2 structs can be private now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obsolete.

scale_timer.has_min_timeout()
? ScaledTimerMinimum(std::chrono::milliseconds(
DurationUtil::durationToMilliseconds(scale_timer.min_timeout())))
: ScaledTimerMinimum(scale_timer.min_scale().value() / 100);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ 100.0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done for clarity, though I think type promotion makes this functionally identical.

…rface

Signed-off-by: Alex Konradi <akonradi@google.com>
…rface

Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
@akonradi akonradi marked this pull request as ready for review October 20, 2020 17:50
Comment on lines +105 to +106
// Sets the minimum duration as a percentage of the maximum value.
type.v3.Percent min_scale = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this might be a way to shoot yourself in the foot as it's not as intuitive and easy to compare as absolute time -- unless you end up doing the calculation yourself in which case maybe just use the min_timeout to set the absolute duration.

If this gets set too low, I could see the aggressive timeouts causing cascading failures i.e. user requests times out so they end up refreshing the page.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, though a lot of the overload manger actions let you shoot yourself in the foot. Having a scale factor allows this to apply meaningfully to timers that can have different values across listeners.

Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
…imeouts

Signed-off-by: Alex Konradi <akonradi@google.com>
antoniovicente
antoniovicente previously approved these changes Oct 23, 2020
: action_symbol_table_(action_symbol_table), timer_minimums_(timer_minimums),
actions_(action_symbol_table.size(), OverloadActionState(0)),
scaled_timer_action_(action_symbol_table.lookup(OverloadActionNames::get().ReduceTimeouts)),
scaled_timer_manager_(std::move(scaled_timer_manager_)) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scaled_timer_manager_(std::move(scaled_timer_manager)) {}

Also fix formatting

Signed-off-by: Alex Konradi <akonradi@google.com>
antoniovicente
antoniovicente previously approved these changes Oct 23, 2020
The admin handler won't ever call the method to create a scaled timer
since it won't have the timeout set.

Signed-off-by: Alex Konradi <akonradi@google.com>
…imeouts

Signed-off-by: Alex Konradi <akonradi@google.com>
antoniovicente
antoniovicente previously approved these changes Oct 23, 2020
@mattklein123 mattklein123 self-assigned this Oct 24, 2020
@akonradi
Copy link
Contributor Author

The macos check looks to be the same GOAWAY issue seen elsewhere. Would merging master help, or should I just kick off a retest?

@akonradi
Copy link
Contributor Author

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #13475 (comment) was created by @akonradi.

see: more, trace.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very cool. Just some API/doc comments from a review.

/wait

@@ -977,6 +977,10 @@ message RouteAction {
// fires, the stream is terminated with a 408 Request Timeout error code if no
// upstream response header has been received, otherwise a stream reset
// occurs.
//
// If the overload action "envoy.overload_actions.reduce_timeouts" is configured, this timeout
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you ref link "overload action" to the overload manager docs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, done.

// Typed configuration for the "envoy.overload_actions.reduce_timeouts" action.
message ScaleTimersOverloadActionConfig {
enum TimerType {
UNSPECIFIED = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the reason this is specified to force the user to explicitly configure the timeout they want? I think that's OK but can you add a small comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, done. I would have liked to just start at 1 but the compiler (or maybe lint checker?) complains.

oneof overload_adjust {
option (validate.required) = true;

// Sets the minimum duration as an absolute value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From reading the docs, it's not immediately clear to me how min_timeout interacts with the scaling. Can you flesh this out a bit more to better describe how everything fits together in terms of scaling? An example would probably help.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elaborated in the docs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks the example is very helpful. Can you potentially ref link to the docs somehow from here just so that people can go back and forth if they are confused what this does?

/wait

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, done.

Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks LGTM can you merge main?

/wait

@mattklein123 mattklein123 merged commit a68755d into envoyproxy:master Oct 28, 2020
@akonradi akonradi deleted the overload-scaled-timeouts branch October 29, 2020 16:32
mpuncel added a commit to mpuncel/envoy that referenced this pull request Oct 30, 2020
* master: (83 commits)
  tls: Typesafe tls slots (envoyproxy#13789)
  docs(example): Correct URL for caching example page (envoyproxy#13810)
  [fuzz] Made health check fuzz more efficient (envoyproxy#13747)
  rtds: properly scope rtds stats (envoyproxy#13764)
  http: fixing a bug with IPv6 hosts (envoyproxy#13798)
  connection: Remember transport socket read resumption requests and replay them when re-enabling read. (envoyproxy#13772)
  network: adding some accessors for ALPN work. (envoyproxy#13785)
  docs: added a step about how to handle platform specific extensions (envoyproxy#13759)
  Fix identation in ip transparency code snippet (envoyproxy#13743)
  wasm: enable WAVM's stack unwinding feature (envoyproxy#13792)
  log: set route name for direct response (envoyproxy#13683)
  Use nghttp2 as external dependsncy in protocol_constraints_lib (envoyproxy#13763)
  [Windows] Update windows dev docs (envoyproxy#13741)
  cel: patch thread safety issue (envoyproxy#13739)
  Windows: Fix ssl_socket_test (envoyproxy#13264)
  apple dns: add fake api test suite (envoyproxy#13780)
  overload: scale selected timers in response to load (envoyproxy#13475)
  examples: Add dynamic configuration (control plane) sandbox (envoyproxy#13746)
  Removed exception in getResponseStatus() (envoyproxy#13314)
  network: add timeout for transport connect (envoyproxy#13610)
  ...

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants