Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: add support for range trigger overload actions #12352

Merged
merged 18 commits into from
Aug 12, 2020

Conversation

akonradi
Copy link
Contributor

@akonradi akonradi commented Jul 29, 2020

Commit Message:
Change the Overload Manager API to extend the overload action state with non-binary values. This will allow future overload actions to take effect in response to increasing load, instead of to the existing inactive/active binary values. This PR also adds a range field to the overload trigger config, though no actions currently trigger on the 'scaling' state.

Additional Description:
The existing behavior is preserved by replacing all places where OverloadActionState::Active was being used with OverloadActionState::saturated(). This is a minimal re-hashing of #11697 for ##11427.

Risk Level: medium
Testing: ran unit and integration tests
Docs Changes: none
Release Notes: add "scaling" trigger for OverloadManager actions

Extend the existing overload action states with a 'scaling' state with a
value in the range [0, 1]. The scaling state is not used yet, but will
be used to support actions that can take partial effect even if resource
pressure is below saturation.

Signed-off-by: Alex Konradi <akonradi@google.com>
Add a Range trigger that will produce overload action states that are in
between the inactive and saturated states. This will enable the creation
of actions that take more effect as load increases.

Signed-off-by: Alex Konradi <akonradi@google.com>
@repokitteh-read-only
Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/envoy/.
CC @envoyproxy/api-watchers: FYI only for changes made to api/envoy/.

🐱

Caused by: #12352 was opened by akonradi.

see: more, trace.

Signed-off-by: Alex Konradi <akonradi@google.com>
@akonradi
Copy link
Contributor Author

/cc @eziskind @antoniovicente

Rename the gauge to match the documentation

Signed-off-by: Alex Konradi <akonradi@google.com>
Both stats are present for all actions.

Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
Correct test for stat name change in prior commit

Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
Comment on lines 43 to 44
saturated, the resource pressure is at or above the critical point; drastic action should be taken
scaling, the resource pressure is below the critical point; action may be taken
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also have an "inactive" state so we can distinguish when envoy is not at all in overload so no action will be taken vs "scaling" when envoy is in partial overload so some scaled action will be taken? You refer to such an "inactive" state in the OverloadAction proto below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it from code after talking to @antoniovicente. I think the expectation is that actions that respond to active/inactive can just check for "active or not" and actions that look at the scaling value can check for "value() == 0"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then you should probably remove the reference to the "inactive" state in the proto comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last argument for the "inactive" state - it might not be so useful in the code that implements the overload actions but it would be useful in logging and monitoring to distinguish between "inactive" and "scaling". During normal operation I think we'd expect all overload actions to be in "inactive" state and it would be useful to know when we transition out of that state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I keep going back and forth. It feels awkward in the documentation to say 0 is still scaling. WDYT @eziskind of doing this in a follow-on PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, a follow-on PR is fine with me.

eziskind
eziskind previously approved these changes Jul 31, 2020
Copy link
Contributor

@eziskind eziskind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@antoniovicente antoniovicente left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking pretty good. Some nits below.

:widths: 1, 2

saturated, the resource pressure is at or above the critical point; drastic action should be taken
scaling, the resource pressure is below the critical point; action may be taken
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this description does not mention the role that the min threshold plays. I'm having some trouble with terminology here, I keep thinking about the linear region of amplifier circuits.

include/envoy/server/overload_manager.h Outdated Show resolved Hide resolved
source/server/overload_manager_impl.cc Show resolved Hide resolved
source/server/overload_manager_impl.cc Outdated Show resolved Hide resolved
if (old_state.isSaturated() != state.isSaturated()) {
ENVOY_LOG(info, "Overload action {} became {}", action,
(state.isSaturated() ? "saturated" : "scaling"));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is some logging appropriate for scaling state transitions?

Possibly LOG_EVERY_N_SEC when scaling, and/or LOG when entering and exiting scaling mode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a good idea. Does Envoy have a LOG_EVERY_N_SEC macro?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't seem like it. At least not yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also be at debug level I think.

test/server/overload_manager_impl_test.cc Outdated Show resolved Hide resolved
Signed-off-by: Alex Konradi <akonradi@google.com>
…ined-states

Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
Explicitly document the meaning of the scaling and saturated ranges in
the overload manager overview, and list the two types of triggers
available and the meaning of each.

Signed-off-by: Alex Konradi <akonradi@google.com>
antoniovicente
antoniovicente previously approved these changes Aug 5, 2020
source/server/overload_manager_impl.cc Outdated Show resolved Hide resolved
is_active ? OverloadActionState::Active : OverloadActionState::Inactive;
ENVOY_LOG(info, "Overload action {} became {}", action,
is_active ? "active" : "inactive");
const auto state = action_it->second.getState();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is additional work needed to optimize update propagation? It seems that each action is propagated independently to workers. updateResourcePressure returns true a lot more often than before. Is there some potential for aggregation?

How often does updateResourcePressure run anyway? Should the code below wait for propagation to complete before returning as a way to ensure that at most 1 update is in progress?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I think it should be straightforward to aggregate a bunch of <action, state> pairs in this loop and then send them to workers with a single call to runOnAllThreads. Maybe best to do this in a follow-on PR?

Re: how often this is called - our best practices doc for use as an edge proxy has an example config where we set this to 250ms (https://www.envoyproxy.io/docs/envoy/latest/configuration/best_practices/edge)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think there are a couple optimizations we can apply here. Right now we post an event to each worker for each resource update, but we can do a better job of coalescing those. updateResourcePressure returns true more often only if the proxy is configured with a range trigger, which none should be yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to address this an Elisha's other comment about the 'inactive' state in a follow-on PR.

Signed-off-by: Alex Konradi <akonradi@google.com>
@akonradi
Copy link
Contributor Author

akonradi commented Aug 6, 2020

@snowp Can I get a pass on this from you or another maintainer? Also, who should I talk to for an API review?

Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
@akonradi
Copy link
Contributor Author

@mattklein123 @snowp this should be ready for another round

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API/docs LGTM with one small doc request. Very nice! @antoniovicente any further code comments? I think @snowp is out this week so happy to merge if you are satisfied otherwise.

/wait

* - :ref:`threshold <envoy_v3_api_msg_config.overload.v3.ThresholdTrigger>`
- Sets the action state to 1 (= *saturated*) when the resource pressure is above a threshold, and to 0 otherwise.
* - :ref:`scaled <envoy_v3_api_msg_config.overload.v3.ScaledTrigger>`
- Sets the action state to 0 when the resource pressure is below the min threshold,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace min/max with the new names and for bonus points ref link them

Gotta get the bonus points.

Signed-off-by: Alex Konradi <akonradi@google.com>
mattklein123
mattklein123 previously approved these changes Aug 10, 2020
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, just a few minor comments.
/wait

saturated_threshold_(config.saturation_threshold()),
state_(OverloadActionState::inactive()) {
if (scaling_threshold_ >= saturated_threshold_) {
throw EnvoyException("min_value must be less than max_value");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can you adjust this exception message to match the field names? I.e. scaling/saturated_threshold.

bool isSaturated() const { return action_value_ == 1; }

private:
float action_value_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because then OverloadActionState isn't assignable.

if (old_state.isSaturated() != state.isSaturated()) {
ENVOY_LOG(info, "Overload action {} became {}", action,
(state.isSaturated() ? "saturated" : "scaling"));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also be at debug level I think.

"min_value must be less than max_value.*");
}

TEST_F(OverloadManagerImplTest, ScaledTriggerMaxEqualsMin) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add one liner // above these tests explaining the intuition of what is being tested?

@repokitteh-read-only
Copy link

🙀 Error while processing event:

evaluation error
error: function _github_call error: DELETE https://api.github.com/repos/envoyproxy/envoy/issues/12352/labels/api: 404 Label does not exist []:
 Traceback (most recent call last):
  bootstrap:143: in <toplevel>
  bootstrap:137: in _main
  bootstrap:64: in _call
  bootstrap:15: in _call1
  github.com/envoyproxy/envoy/ci/repokitteh/modules/ownerscheck.star:209: in _pr_review
  github.com/envoyproxy/envoy/ci/repokitteh/modules/ownerscheck.star:146: in _reconcile
  github:240: in issue_unlabel
  github:134: in call
Error: function _github_call error: DELETE https://api.github.com/repos/envoyproxy/envoy/issues/12352/labels/api: 404 Label does not exist []
🐱

Caused by: a #12352 (review) was submitted by @htuch.

see: more, trace.

Signed-off-by: Alex Konradi <akonradi@google.com>
@mattklein123
Copy link
Member

LGTM assigning over to @htuch to verify his comments and merge. Thank you!

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants