-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
outlier: tooling for success rate ejection #618
Conversation
include/envoy/upstream/upstream.h
Outdated
* -1 means that the host did not have enough request volume to calculate success rate | ||
* or the cluster did not have enough hosts to run through success rate outlier ejection. | ||
*/ | ||
virtual double successRate() const PURE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both of these should come directly out of the outlier detector sink via here: https://github.com/lyft/envoy/blob/ce8eb8634f0c6aa24485e17fbd563ecc18d73982/include/envoy/upstream/host_description.h#L51
Don't add to host object.
docs/operations/admin.rst
Outdated
@@ -35,6 +35,7 @@ modify different aspects of the server. | |||
weight, Integer, Load balancing weight (1-100) | |||
zone, String, Service zone | |||
canary, Boolean, Whether the host is a canary | |||
success_rate, Double, Request success rate (0-100). -1 if there was not enough request volume to calculate it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you link to the info about "not enough request volume"?
@@ -65,8 +78,9 @@ class EventLogger { | |||
* Log an ejection event. | |||
* @param host supplies the host that generated the event. | |||
* @param type supplies the type of the event. | |||
* @param enforced is true if the ejection took place, false if only logging took place. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
run on sentence.
@@ -266,6 +271,9 @@ double Utility::successRateEjectionThreshold( | |||
variance /= valid_success_rate_hosts.size(); | |||
double stdev = std::sqrt(variance); | |||
|
|||
// Set the Detector's member variable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: full stop.
@@ -48,6 +48,19 @@ class DetectorHostSink { | |||
* @return the last time this host was unejected, if the host has been unejected previously. | |||
*/ | |||
virtual const Optional<SystemTime>& lastUnejectionTime() PURE; | |||
|
|||
/** | |||
* @return the success rate of the host in the last calculated interval, in the range -1-100. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1-100 is a bit misleading, as it's [0;100] with -1 as a separate value. I like the way you cover this in docs, maybe do the same here.
docs/operations/admin.rst
Outdated
@@ -35,6 +35,9 @@ modify different aspects of the server. | |||
weight, Integer, Load balancing weight (1-100) | |||
zone, String, Service zone | |||
canary, Boolean, Whether the host is a canary | |||
success_rate, Double, "Request success rate (0-100). -1 if there was not enough |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link to some docs here (or write it here) that describe what interval this is over.
* Set the success rate of the host. | ||
* @param new_success_rate the new success rate calculated for the host in the last interval. | ||
*/ | ||
virtual void successRate(double new_success_rate) PURE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't need to be in public interface.
@@ -241,7 +245,8 @@ void DetectorImpl::onConsecutive5xxWorker(HostSharedPtr host) { | |||
const double Utility::SUCCESS_RATE_STDEV_FACTOR = 1.9; | |||
|
|||
double Utility::successRateEjectionThreshold( | |||
double success_rate_sum, const std::vector<HostSuccessRatePair>& valid_success_rate_hosts) { | |||
double success_rate_sum, const std::vector<HostSuccessRatePair>& valid_success_rate_hosts, | |||
double& success_rate_average) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of passing an out param, can you return a struct. It's easier to grok what is going on.
@@ -330,7 +348,8 @@ void DetectorImpl::runCallbacks(HostSharedPtr host) { | |||
} | |||
} | |||
|
|||
void EventLoggerImpl::logEject(HostDescriptionConstSharedPtr host, EjectionType type) { | |||
void EventLoggerImpl::logEject(HostDescriptionConstSharedPtr host, EjectionType type, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be useful to log some of the SR stats here also to see the SR, the cutoff, etc.?
include/envoy/upstream/upstream.h
Outdated
@@ -300,6 +301,8 @@ class Cluster : public virtual HostSet { | |||
*/ | |||
virtual ClusterInfoConstSharedPtr info() const PURE; | |||
|
|||
virtual Outlier::DetectorSharedPtr outlierDetector() const PURE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment, this should return a pointer to a const detector.
@@ -246,7 +255,8 @@ class Utility { | |||
*/ | |||
static double | |||
successRateEjectionThreshold(double success_rate_sum, | |||
const std::vector<HostSuccessRatePair>& valid_success_rate_hosts); | |||
const std::vector<HostSuccessRatePair>& valid_success_rate_hosts, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update comment
source/server/http/admin.cc
Outdated
Upstream::Outlier::DetectorSharedPtr outlier_detector, | ||
Buffer::Instance& response) { | ||
if (outlier_detector) { | ||
response.add(fmt::format("{}::outlier::success_rate_average::{}", cluster_name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put all these into docs?
docs/intro/arch_overview/outlier.rst
Outdated
If ``action`` is ``eject``, species the type of ejection that took place. Currently this can | ||
only be ``5xx``. | ||
If ``action`` is ``eject``, specifies the type of ejection that took place. Currently this can be: | ||
``5xx``, and ``SuccessRate``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No comma after 5xx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought you were a supporter of the oxford comma ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, when there are 3 or more things, not 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Comma still needs to be deleted)
docs/intro/arch_overview/outlier.rst
Outdated
|
||
host_success_rate | ||
If ``action`` is ``eject``, and ``type`` is ``SuccessRate``, specifies the host's success rate | ||
at the time of the ejection event on a ``0-100`` range. A ``-1`` value means that a success |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1 can only happen if type is 5xx, right? This is kind of confusing. Can we just not write out host_succes_rate in the case of 5xx? Same for all the ones below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1 can also happen if the type is success rate, i.e if there is no enough request volume. However, you are right it would never happen in the logs. Because, if a host is ejected it is because there was enough data to do that. -1 would happen in the admin /clusters
endpoint, not here.
zone, String, Service zone | ||
canary, Boolean, Whether the host is a canary | ||
success_rate, Double, "Request success rate (0-100). -1 if there was not enough | ||
:ref:`request volume<config_cluster_manager_cluster_outlier_detection_success_rate_request_volume>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you actually look at these docs rendered? This is almost definitely not correct and will look broken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I rendered them and they look correct
include/envoy/upstream/upstream.h
Outdated
* @return a shared pointer to the cluster's outlier detector. If an outlier detector has not been | ||
* installed, returns a nullptr. | ||
*/ | ||
virtual const Outlier::DetectorSharedPtr outlierDetector() const PURE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return const shared pointer has no meaning. You need to return a pointer to a const Detector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see this is a const pointer not a pointer to a const object. What we want is to return a pointer to a const object that way we can't change the object through the pointer.
host->cluster().name(), host->address()->asString(), | ||
typeToString(type), host->outlierDetector().numEjections(), enforced, | ||
host->outlierDetector().successRate(), detector.successRateAverage(), | ||
detector.successRateEjectionThreshold())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add break here for next person that comes along.
@@ -237,14 +247,21 @@ class EventLoggerImpl : public EventLogger { | |||
*/ | |||
class Utility { | |||
public: | |||
struct EjectionPair { | |||
EjectionPair(double success_rate_average, double ejection_threshold) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need constructor here. You can just do aggregate init ( {...} )
/** | ||
* This function returns the success rate threshold for success rate outlier detection. If a | ||
* host's success rate is under this threshold the host is an outlier. | ||
* @param success_rate_sum is the sum of the data in the success_rate_data vector. | ||
* @param success_rate_data is the vector containing the individual success rate data points. | ||
* @param valid_success_rate_hosts is the vector containing the individual success rate data | ||
* points. | ||
* @return the success rate threshold. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment wrong
Looks good other than comma still in docs. Do you have 100% coverage? |
363fc10
to
3825f62
Compare
docs/intro/arch_overview/outlier.rst
Outdated
"host_success_rate": "...", | ||
"cluster_success_rate_average": "...", | ||
"cluster_success_rate_ejection_threshold": "..." | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: new line
docs/intro/arch_overview/outlier.rst
Outdated
If ``action`` is ``eject``, species the type of ejection that took place. Currently this can | ||
only be ``5xx``. | ||
If ``action`` is ``eject``, specifies the type of ejection that took place. Currently this can be: | ||
``5xx`` and ``SuccessRate``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second sentence doesn't make sense. Can you rewrite it? Also colon after be isn't needed.
docs/intro/arch_overview/outlier.rst
Outdated
} | ||
|
||
time | ||
The time that the event took place. | ||
|
||
secs_since_last_action | ||
The time in seconds since the last action (either an ejection or unejection) | ||
took place. This time will be -1 for the first ejection given there is no | ||
took place. This time will be ``-1`` for the first ejection given there is no |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change this time will be
to the value will be
docs/intro/arch_overview/outlier.rst
Outdated
reason and then re-added). | ||
|
||
enforced | ||
If ``action`` is ``eject``, specifies if the ejection was actually performed (``true``), or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this sentence is too long and confusing. Please break it up.
@@ -95,19 +96,22 @@ class DetectorHostSinkImpl : public DetectorHostSink { | |||
: detector_(detector), host_(host) { | |||
// Point the success_rate_accumulator_bucket_ pointer to a bucket. | |||
updateCurrentSuccessRateBucket(); | |||
success_rate_ = -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to initializer list
/** | ||
* This function returns the success rate threshold for success rate outlier detection. If a | ||
* host's success rate is under this threshold the host is an outlier. | ||
* This function returns the an EjectionPair for success rate outlier detection. The pair contains |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra the
after returns
* host's success rate is under this threshold the host is an outlier. | ||
* This function returns the an EjectionPair for success rate outlier detection. The pair contains | ||
* the average success rate of all valid hosts in the cluster, and the ejection threshold. | ||
* If a host's success rate is under this threshold the host is an outlier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit full stop at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comma after the first sentence
* @param success_rate_data is the vector containing the individual success rate data points. | ||
* @return the success rate threshold. | ||
* @param valid_success_rate_hosts is the vector containing the individual success rate data | ||
* points. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit formating.
|
||
// Need to update the writer bucket to keep the data valid. | ||
host.second->updateCurrentSuccessRateBucket(); | ||
// Refresh host success rate stat for the /clusters endpoint. If there is a new valid value it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comma after the first sentence in the if sentence.
* This function returns the success rate threshold for success rate outlier detection. If a | ||
* host's success rate is under this threshold the host is an outlier. | ||
* This function returns the an EjectionPair for success rate outlier detection. The pair contains | ||
* the average success rate of all valid hosts in the cluster, and the ejection threshold. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no comma needed before the and
@@ -277,6 +281,10 @@ void DetectorImpl::processSuccessRateEjections() { | |||
std::vector<HostSuccessRatePair> valid_success_rate_hosts; | |||
double success_rate_sum = 0; | |||
|
|||
// Reset the Detector's success rate mean and stdev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit full stop
@@ -212,6 +219,8 @@ class DetectorImpl : public Detector, public std::enable_shared_from_this<Detect | |||
std::list<ChangeStateCb> callbacks_; | |||
std::unordered_map<HostSharedPtr, DetectorHostSinkImpl*> host_sinks_; | |||
EventLoggerSharedPtr event_logger_; | |||
double success_rate_average_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need default values for these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
including success_rate_ejection_threshold_
virtual void logUneject(HostDescriptionConstSharedPtr host) PURE; | ||
}; | ||
|
||
typedef std::shared_ptr<EventLogger> EventLoggerSharedPtr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: new line after type def
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good after few new line nits
* Integrate with new mixer control library. * Update to use splitted control lib. * remove quota config files. * Update per comment
Signed-off-by: Mandar U Jog <mjog@google.com>
Fixes envoyproxy/envoy-mobile#617, which resulted in duplicate symbols from protobuf being present in the final Envoy Mobile binary. The [root cause](bazelbuild/bazel#10484) is being [patched in Bazel](bazelbuild/bazel#10493). In the meantime, we can revert these two commits in protobuf ([first](protocolbuffers/protobuf@7b28278), [second](protocolbuffers/protobuf@a03d332)) and apply the diff as a patch until we're able to update Bazel with the fix. More context on the protobuf issue is available [here](protocolbuffers/protobuf#7046). Notes: - This should be reverted after updating Bazel with the fix - The version of protobuf specified in the `http_archive` should be in lockstep with upstream Envoy's version - Upstream Envoy's `@envoy//bazel:protobuf.patch` patch is also included here Signed-off-by: Michael Rebello <me@michaelrebello.com> Signed-off-by: JP Simard <jp@jpsim.com>
Fixes envoyproxy/envoy-mobile#617, which resulted in duplicate symbols from protobuf being present in the final Envoy Mobile binary. The [root cause](bazelbuild/bazel#10484) is being [patched in Bazel](bazelbuild/bazel#10493). In the meantime, we can revert these two commits in protobuf ([first](protocolbuffers/protobuf@7b28278), [second](protocolbuffers/protobuf@a03d332)) and apply the diff as a patch until we're able to update Bazel with the fix. More context on the protobuf issue is available [here](protocolbuffers/protobuf#7046). Notes: - This should be reverted after updating Bazel with the fix - The version of protobuf specified in the `http_archive` should be in lockstep with upstream Envoy's version - Upstream Envoy's `@envoy//bazel:protobuf.patch` patch is also included here Signed-off-by: Michael Rebello <me@michaelrebello.com> Signed-off-by: JP Simard <jp@jpsim.com>
No description provided.