-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
outlier: tooling for success rate ejection #618
Changes from 8 commits
41f5a06
ce8eb86
edaf832
6ddc858
99df9b4
1f3ff02
5953641
24cedc8
84a8f97
b1e5845
3825f62
7d3f240
08f1fcd
a401824
0bd0487
c103237
eaedc71
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -76,15 +76,20 @@ being ejected and for what reasons. The log uses a JSON format with one object p | |
"upstream_url": "...", | ||
"action": "...", | ||
"type": "...", | ||
"num_ejections": "..." | ||
"num_ejections": "...", | ||
"enforced": "...", | ||
"host_success_rate": "...", | ||
"cluster_success_rate_average": "...", | ||
"cluster_success_rate_ejection_threshold": "..." | ||
|
||
} | ||
|
||
time | ||
The time that the event took place. | ||
|
||
secs_since_last_action | ||
The time in seconds since the last action (either an ejection or unejection) | ||
took place. This time will be -1 for the first ejection given there is no | ||
took place. This time will be ``-1`` for the first ejection given there is no | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. change |
||
action before the first ejection. | ||
|
||
cluster | ||
|
@@ -98,12 +103,39 @@ action | |
brought back into service. | ||
|
||
type | ||
If ``action`` is ``eject``, species the type of ejection that took place. Currently this can | ||
only be ``5xx``. | ||
If ``action`` is ``eject``, specifies the type of ejection that took place. Currently this can be: | ||
``5xx``, and ``SuccessRate``. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No comma after 5xx There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I thought you were a supporter of the oxford comma ;) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, when there are 3 or more things, not 2. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (Comma still needs to be deleted) |
||
|
||
num_ejections | ||
The number of times the host has been ejected (local to that Envoy and gets reset if the host | ||
gets removed from the upstream cluster for any reason and then re-added). | ||
If ``action`` is ``eject``, specifies the number of times the host has been ejected | ||
(local to that Envoy and gets reset if the host gets removed from the upstream cluster for any | ||
reason and then re-added). | ||
|
||
enforced | ||
If ``action`` is ``eject``, specifies if the ejection was actually performed (``true``), or | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this sentence is too long and confusing. Please break it up. |
||
if the action was only logged but the host was not ejected (``false``). | ||
|
||
host_success_rate | ||
If ``action`` is ``eject``, and ``type`` is ``SuccessRate``, specifies the host's success rate | ||
at the time of the ejection event on a ``0-100`` range. A ``-1`` value means that a success | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. -1 can only happen if type is 5xx, right? This is kind of confusing. Can we just not write out host_succes_rate in the case of 5xx? Same for all the ones below. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. -1 can also happen if the type is success rate, i.e if there is no enough request volume. However, you are right it would never happen in the logs. Because, if a host is ejected it is because there was enough data to do that. -1 would happen in the admin |
||
rate was not calculated for the host in the last | ||
:ref:`interval<config_cluster_manager_cluster_outlier_detection_interval_ms>`. | ||
|
||
.. _arch_overview_outlier_detection_ejection_event_logging_cluster_success_rate_average: | ||
|
||
cluster_success_rate_average | ||
If ``action`` is ``eject``, and ``type`` is ``SuccessRate``, specifies the average success | ||
rate of the hosts in the cluster at the time of the ejection event on a ``0-100`` range. | ||
A ``-1`` value means that an average success rate was not calculated in the last | ||
:ref:`interval<config_cluster_manager_cluster_outlier_detection_interval_ms>`. | ||
|
||
.. _arch_overview_outlier_detection_ejection_event_logging_cluster_success_rate_ejection_threshold: | ||
|
||
cluster_success_rate_ejection_threshold | ||
If ``action`` is ``eject``, and ``type`` is ``SuccessRate``, specifies success rate ejection | ||
threshold at the time of the ejection event. A ``-1`` value means that an ejection threshold | ||
was not calculated in the last | ||
:ref:`interval<config_cluster_manager_cluster_outlier_detection_interval_ms>`. | ||
|
||
Configuration reference | ||
----------------------- | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -19,22 +19,34 @@ modify different aspects of the server. | |
|
||
List out all configured :ref:`cluster manager <arch_overview_cluster_manager>` clusters. This | ||
information includes all discovered upstream hosts in each cluster along with per host statistics. | ||
This is useful for debugging service discovery issues. The per host statistics include: | ||
|
||
.. csv-table:: | ||
:header: Name, Type, Description | ||
:widths: 1, 1, 2 | ||
|
||
cx_total, Counter, Total connections | ||
cx_active, Gauge, Total active connections | ||
cx_connect_fail, Counter, Total connection failures | ||
rq_total, Counter, Total requests | ||
rq_timeout, Counter, Total timed out requests | ||
rq_active, Gauge, Total active requests | ||
healthy, String, The health status of the host. See below | ||
weight, Integer, Load balancing weight (1-100) | ||
zone, String, Service zone | ||
canary, Boolean, Whether the host is a canary | ||
This is useful for debugging service discovery issues. | ||
|
||
Cluster wide information | ||
- :ref:`circuit breakers<config_cluster_manager_cluster_circuit_breakers>` settings for all priority settings. | ||
|
||
- Information about :ref:`outlier detection<arch_overview_outlier_detection>` if a detector is installed. Currently | ||
:ref:`success rate average<arch_overview_outlier_detection_ejection_event_logging_cluster_success_rate_average>`, | ||
and :ref:`ejection threshold<arch_overview_outlier_detection_ejection_event_logging_cluster_success_rate_ejection_threshold>` are presented. | ||
|
||
Per host statistics | ||
.. csv-table:: | ||
:header: Name, Type, Description | ||
:widths: 1, 1, 2 | ||
|
||
cx_total, Counter, Total connections | ||
cx_active, Gauge, Total active connections | ||
cx_connect_fail, Counter, Total connection failures | ||
rq_total, Counter, Total requests | ||
rq_timeout, Counter, Total timed out requests | ||
rq_active, Gauge, Total active requests | ||
healthy, String, The health status of the host. See below | ||
weight, Integer, Load balancing weight (1-100) | ||
zone, String, Service zone | ||
canary, Boolean, Whether the host is a canary | ||
success_rate, Double, "Request success rate (0-100). -1 if there was not enough | ||
:ref:`request volume<config_cluster_manager_cluster_outlier_detection_success_rate_request_volume>` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Did you actually look at these docs rendered? This is almost definitely not correct and will look broken. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I rendered them and they look correct |
||
in the :ref:`interval<config_cluster_manager_cluster_outlier_detection_interval_ms>` | ||
to calculate it" | ||
|
||
Host health status | ||
A host is either healthy or unhealthy because of one or more different failing health states. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -48,34 +48,16 @@ class DetectorHostSink { | |
* @return the last time this host was unejected, if the host has been unejected previously. | ||
*/ | ||
virtual const Optional<SystemTime>& lastUnejectionTime() PURE; | ||
}; | ||
|
||
typedef std::unique_ptr<DetectorHostSink> DetectorHostSinkPtr; | ||
|
||
enum class EjectionType { Consecutive5xx, SuccessRate }; | ||
|
||
/** | ||
* Sink for outlier detection event logs. | ||
*/ | ||
class EventLogger { | ||
public: | ||
virtual ~EventLogger() {} | ||
|
||
/** | ||
* Log an ejection event. | ||
* @param host supplies the host that generated the event. | ||
* @param type supplies the type of the event. | ||
*/ | ||
virtual void logEject(HostDescriptionConstSharedPtr host, EjectionType type) PURE; | ||
|
||
/** | ||
* Log an unejection event. | ||
* @param host supplies the host that generated the event. | ||
* @return the success rate of the host in the last calculated interval, in the range 0-100. | ||
* -1 means that the host did not have enough request volume to calculate success rate | ||
* or the cluster did not have enough hosts to run through success rate outlier ejection. | ||
*/ | ||
virtual void logUneject(HostDescriptionConstSharedPtr host) PURE; | ||
virtual double successRate() const PURE; | ||
}; | ||
|
||
typedef std::shared_ptr<EventLogger> EventLoggerSharedPtr; | ||
typedef std::unique_ptr<DetectorHostSink> DetectorHostSinkPtr; | ||
|
||
/** | ||
* Interface for an outlier detection engine. Uses per host data to determine which hosts in a | ||
|
@@ -95,9 +77,52 @@ class Detector { | |
* changes state (either ejected or brought back in) due to outlier status. | ||
*/ | ||
virtual void addChangedStateCb(ChangeStateCb cb) PURE; | ||
|
||
/** | ||
* Returns the average success rate of the hosts in the Detector for the last aggregation | ||
* interval. | ||
* @return the average success rate, or -1 if there were not enough hosts with enough request | ||
* volume to proceed with success rate based outlier ejection. | ||
*/ | ||
virtual double successRateAverage() PURE; | ||
|
||
/** | ||
* Returns the success rate threshold used in the last interval. The threshold is used to eject | ||
* hosts based on their success rate. | ||
* @return the threshold, or -1 if there were not enough hosts with enough request volume to | ||
* proceed with success rate based outlier ejection. | ||
*/ | ||
virtual double successRateEjectionThreshold() PURE; | ||
}; | ||
|
||
typedef std::shared_ptr<Detector> DetectorSharedPtr; | ||
|
||
enum class EjectionType { Consecutive5xx, SuccessRate }; | ||
|
||
/** | ||
* Sink for outlier detection event logs. | ||
*/ | ||
class EventLogger { | ||
public: | ||
virtual ~EventLogger() {} | ||
|
||
/** | ||
* Log an ejection event. | ||
* @param host supplies the host that generated the event. | ||
* @param detector supplies the detector that is doing the ejection. | ||
* @param type supplies the type of the event. | ||
* @param enforced is true if the ejection took place; false, if only logging took place. | ||
*/ | ||
virtual void logEject(HostDescriptionConstSharedPtr host, Detector& detector, EjectionType type, | ||
bool enforced) PURE; | ||
|
||
/** | ||
* Log an unejection event. | ||
* @param host supplies the host that generated the event. | ||
*/ | ||
virtual void logUneject(HostDescriptionConstSharedPtr host) PURE; | ||
}; | ||
|
||
typedef std::shared_ptr<EventLogger> EventLoggerSharedPtr; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: new line after type def |
||
} // Outlier | ||
} // Upstream |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,6 +4,7 @@ | |
#include "envoy/network/connection.h" | ||
#include "envoy/ssl/context.h" | ||
#include "envoy/upstream/load_balancer_type.h" | ||
#include "envoy/upstream/outlier_detection.h" | ||
#include "envoy/upstream/resource_manager.h" | ||
|
||
namespace Upstream { | ||
|
@@ -300,6 +301,12 @@ class Cluster : public virtual HostSet { | |
*/ | ||
virtual ClusterInfoConstSharedPtr info() const PURE; | ||
|
||
/** | ||
* @return a shared pointer to the cluster's outlier detector. If an outlier detector has not been | ||
* installed, returns a nullptr. | ||
*/ | ||
virtual const Outlier::DetectorSharedPtr outlierDetector() const PURE; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. return const shared pointer has no meaning. You need to return a pointer to a const Detector. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah I see this is a const pointer not a pointer to a const object. What we want is to return a pointer to a const object that way we can't change the object through the pointer. |
||
|
||
/** | ||
* Initialize the cluster. This will be called either immediately at creation or after all primary | ||
* clusters have been initialized (determined via initializePhase()). | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -185,7 +185,11 @@ void DetectorImpl::ejectHost(HostSharedPtr host, EjectionType type) { | |
runCallbacks(host); | ||
|
||
if (event_logger_) { | ||
event_logger_->logEject(host, type); | ||
event_logger_->logEject(host, *this, type, true); | ||
} | ||
} else { | ||
if (event_logger_) { | ||
event_logger_->logEject(host, *this, type, false); | ||
} | ||
} | ||
} else { | ||
|
@@ -240,7 +244,7 @@ void DetectorImpl::onConsecutive5xxWorker(HostSharedPtr host) { | |
// outliers. | ||
const double Utility::SUCCESS_RATE_STDEV_FACTOR = 1.9; | ||
|
||
double Utility::successRateEjectionThreshold( | ||
Utility::EjectionPair Utility::successRateEjectionThreshold( | ||
double success_rate_sum, const std::vector<HostSuccessRatePair>& valid_success_rate_hosts) { | ||
// This function is using mean and standard deviation as statistical measures for outlier | ||
// detection. First the mean is calculated by dividing the sum of success rate data over the | ||
|
@@ -266,7 +270,7 @@ double Utility::successRateEjectionThreshold( | |
variance /= valid_success_rate_hosts.size(); | ||
double stdev = std::sqrt(variance); | ||
|
||
return mean - (SUCCESS_RATE_STDEV_FACTOR * stdev); | ||
return EjectionPair(mean, (mean - (SUCCESS_RATE_STDEV_FACTOR * stdev))); | ||
} | ||
|
||
void DetectorImpl::processSuccessRateEjections() { | ||
|
@@ -277,6 +281,10 @@ void DetectorImpl::processSuccessRateEjections() { | |
std::vector<HostSuccessRatePair> valid_success_rate_hosts; | ||
double success_rate_sum = 0; | ||
|
||
// Reset the Detector's success rate mean and stdev | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit full stop |
||
success_rate_average_ = -1; | ||
success_rate_ejection_threshold_ = -1; | ||
|
||
// Exit early if there are not enough hosts. | ||
if (host_sinks_.size() < success_rate_minimum_hosts) { | ||
return; | ||
|
@@ -286,7 +294,6 @@ void DetectorImpl::processSuccessRateEjections() { | |
valid_success_rate_hosts.reserve(host_sinks_.size()); | ||
|
||
for (const auto& host : host_sinks_) { | ||
host.second->updateCurrentSuccessRateBucket(); | ||
// Don't do work if the host is already ejected. | ||
if (!host.first->healthFlagGet(Host::HealthFlag::FAILED_OUTLIER_CHECK)) { | ||
Optional<double> host_success_rate = | ||
|
@@ -296,15 +303,18 @@ void DetectorImpl::processSuccessRateEjections() { | |
valid_success_rate_hosts.emplace_back( | ||
HostSuccessRatePair(host.first, host_success_rate.value())); | ||
success_rate_sum += host_success_rate.value(); | ||
host.second->successRate(host_success_rate.value()); | ||
} | ||
} | ||
} | ||
|
||
if (valid_success_rate_hosts.size() >= success_rate_minimum_hosts) { | ||
double ejection_threshold = | ||
Utility::EjectionPair ejection_pair = | ||
Utility::successRateEjectionThreshold(success_rate_sum, valid_success_rate_hosts); | ||
success_rate_average_ = ejection_pair.success_rate_average_; | ||
success_rate_ejection_threshold_ = ejection_pair.ejection_threshold_; | ||
for (const auto& host_success_rate_pair : valid_success_rate_hosts) { | ||
if (host_success_rate_pair.success_rate_ < ejection_threshold) { | ||
if (host_success_rate_pair.success_rate_ < success_rate_ejection_threshold_) { | ||
stats_.ejections_success_rate_.inc(); | ||
ejectHost(host_success_rate_pair.host_, EjectionType::SuccessRate); | ||
} | ||
|
@@ -317,6 +327,12 @@ void DetectorImpl::onIntervalTimer() { | |
|
||
for (auto host : host_sinks_) { | ||
checkHostForUneject(host.first, host.second, now); | ||
|
||
// Need to update the writer bucket to keep the data valid. | ||
host.second->updateCurrentSuccessRateBucket(); | ||
// Refresh host success rate stat for the /clusters endpoint. If there is a new valid value it | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. comma after the first sentence in the if sentence. |
||
// will get updated in processSuccessRateEjections(). | ||
host.second->successRate(-1); | ||
} | ||
|
||
processSuccessRateEjections(); | ||
|
@@ -330,25 +346,54 @@ void DetectorImpl::runCallbacks(HostSharedPtr host) { | |
} | ||
} | ||
|
||
void EventLoggerImpl::logEject(HostDescriptionConstSharedPtr host, EjectionType type) { | ||
void EventLoggerImpl::logEject(HostDescriptionConstSharedPtr host, Detector& detector, | ||
EjectionType type, bool enforced) { | ||
// TODO(mattklein123): Log friendly host name (e.g., instance ID or DNS name). | ||
// clang-format off | ||
static const std::string json = | ||
static const std::string json_5xx = | ||
std::string("{{") + | ||
"\"time\": \"{}\", " + | ||
"\"secs_since_last_action\": \"{}\", " + | ||
"\"cluster\": \"{}\", " + | ||
"\"upstream_url\": \"{}\", " + | ||
"\"action\": \"eject\", " + | ||
"\"type\": \"{}\", " + | ||
"\"num_ejections\": {}" + | ||
"\"num_ejections\": \"{}\", " + | ||
"\"enforced\": \"{}\"" + | ||
"}}\n"; | ||
|
||
static const std::string json_success_rate = | ||
std::string("{{") + | ||
"\"time\": \"{}\", " + | ||
"\"secs_since_last_action\": \"{}\", " + | ||
"\"cluster\": \"{}\", " + | ||
"\"upstream_url\": \"{}\", " + | ||
"\"action\": \"eject\", " + | ||
"\"type\": \"{}\", " + | ||
"\"num_ejections\": \"{}\", " + | ||
"\"enforced\": \"{}\", " + | ||
"\"host_success_rate\": \"{}\", " + | ||
"\"cluster_average_success_rate\": \"{}\", " + | ||
"\"cluster_success_rate_ejection_threshold\": \"{}\"" + | ||
"}}\n"; | ||
// clang-format on | ||
SystemTime now = time_source_.currentSystemTime(); | ||
file_->write(fmt::format(json, AccessLogDateTimeFormatter::fromTime(now), | ||
secsSinceLastAction(host->outlierDetector().lastUnejectionTime(), now), | ||
host->cluster().name(), host->address()->asString(), typeToString(type), | ||
host->outlierDetector().numEjections())); | ||
|
||
switch (type) { | ||
case EjectionType::Consecutive5xx: | ||
file_->write(fmt::format(json_5xx, AccessLogDateTimeFormatter::fromTime(now), | ||
secsSinceLastAction(host->outlierDetector().lastUnejectionTime(), now), | ||
host->cluster().name(), host->address()->asString(), | ||
typeToString(type), host->outlierDetector().numEjections(), enforced)); | ||
break; | ||
case EjectionType::SuccessRate: | ||
file_->write(fmt::format(json_success_rate, AccessLogDateTimeFormatter::fromTime(now), | ||
secsSinceLastAction(host->outlierDetector().lastUnejectionTime(), now), | ||
host->cluster().name(), host->address()->asString(), | ||
typeToString(type), host->outlierDetector().numEjections(), enforced, | ||
host->outlierDetector().successRate(), detector.successRateAverage(), | ||
detector.successRateEjectionThreshold())); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add break here for next person that comes along. |
||
} | ||
} | ||
|
||
void EventLoggerImpl::logUneject(HostDescriptionConstSharedPtr host) { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: new line