Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

outlier: tooling for success rate ejection #618

Merged
merged 17 commits into from
Mar 31, 2017
44 changes: 38 additions & 6 deletions docs/intro/arch_overview/outlier.rst
Original file line number Diff line number Diff line change
Expand Up @@ -76,15 +76,20 @@ being ejected and for what reasons. The log uses a JSON format with one object p
"upstream_url": "...",
"action": "...",
"type": "...",
"num_ejections": "..."
"num_ejections": "...",
"enforced": "...",
"host_success_rate": "...",
"cluster_success_rate_average": "...",
"cluster_success_rate_ejection_threshold": "..."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: new line

}

time
The time that the event took place.

secs_since_last_action
The time in seconds since the last action (either an ejection or unejection)
took place. This time will be -1 for the first ejection given there is no
took place. This time will be ``-1`` for the first ejection given there is no
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change this time will be to the value will be

action before the first ejection.

cluster
Expand All @@ -98,12 +103,39 @@ action
brought back into service.

type
If ``action`` is ``eject``, species the type of ejection that took place. Currently this can
only be ``5xx``.
If ``action`` is ``eject``, specifies the type of ejection that took place. Currently this can be:
``5xx``, and ``SuccessRate``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No comma after 5xx

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought you were a supporter of the oxford comma ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, when there are 3 or more things, not 2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Comma still needs to be deleted)


num_ejections
The number of times the host has been ejected (local to that Envoy and gets reset if the host
gets removed from the upstream cluster for any reason and then re-added).
If ``action`` is ``eject``, specifies the number of times the host has been ejected
(local to that Envoy and gets reset if the host gets removed from the upstream cluster for any
reason and then re-added).

enforced
If ``action`` is ``eject``, specifies if the ejection was actually performed (``true``), or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sentence is too long and confusing. Please break it up.

if the action was only logged but the host was not ejected (``false``).

host_success_rate
If ``action`` is ``eject``, and ``type`` is ``SuccessRate``, specifies the host's success rate
at the time of the ejection event on a ``0-100`` range. A ``-1`` value means that a success
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 can only happen if type is 5xx, right? This is kind of confusing. Can we just not write out host_succes_rate in the case of 5xx? Same for all the ones below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 can also happen if the type is success rate, i.e if there is no enough request volume. However, you are right it would never happen in the logs. Because, if a host is ejected it is because there was enough data to do that. -1 would happen in the admin /clusters endpoint, not here.

rate was not calculated for the host in the last
:ref:`interval<config_cluster_manager_cluster_outlier_detection_interval_ms>`.

.. _arch_overview_outlier_detection_ejection_event_logging_cluster_success_rate_average:

cluster_success_rate_average
If ``action`` is ``eject``, and ``type`` is ``SuccessRate``, specifies the average success
rate of the hosts in the cluster at the time of the ejection event on a ``0-100`` range.
A ``-1`` value means that an average success rate was not calculated in the last
:ref:`interval<config_cluster_manager_cluster_outlier_detection_interval_ms>`.

.. _arch_overview_outlier_detection_ejection_event_logging_cluster_success_rate_ejection_threshold:

cluster_success_rate_ejection_threshold
If ``action`` is ``eject``, and ``type`` is ``SuccessRate``, specifies success rate ejection
threshold at the time of the ejection event. A ``-1`` value means that an ejection threshold
was not calculated in the last
:ref:`interval<config_cluster_manager_cluster_outlier_detection_interval_ms>`.

Configuration reference
-----------------------
Expand Down
44 changes: 28 additions & 16 deletions docs/operations/admin.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,22 +19,34 @@ modify different aspects of the server.

List out all configured :ref:`cluster manager <arch_overview_cluster_manager>` clusters. This
information includes all discovered upstream hosts in each cluster along with per host statistics.
This is useful for debugging service discovery issues. The per host statistics include:

.. csv-table::
:header: Name, Type, Description
:widths: 1, 1, 2

cx_total, Counter, Total connections
cx_active, Gauge, Total active connections
cx_connect_fail, Counter, Total connection failures
rq_total, Counter, Total requests
rq_timeout, Counter, Total timed out requests
rq_active, Gauge, Total active requests
healthy, String, The health status of the host. See below
weight, Integer, Load balancing weight (1-100)
zone, String, Service zone
canary, Boolean, Whether the host is a canary
This is useful for debugging service discovery issues.

Cluster wide information
- :ref:`circuit breakers<config_cluster_manager_cluster_circuit_breakers>` settings for all priority settings.

- Information about :ref:`outlier detection<arch_overview_outlier_detection>` if a detector is installed. Currently
:ref:`success rate average<arch_overview_outlier_detection_ejection_event_logging_cluster_success_rate_average>`,
and :ref:`ejection threshold<arch_overview_outlier_detection_ejection_event_logging_cluster_success_rate_ejection_threshold>` are presented.

Per host statistics
.. csv-table::
:header: Name, Type, Description
:widths: 1, 1, 2

cx_total, Counter, Total connections
cx_active, Gauge, Total active connections
cx_connect_fail, Counter, Total connection failures
rq_total, Counter, Total requests
rq_timeout, Counter, Total timed out requests
rq_active, Gauge, Total active requests
healthy, String, The health status of the host. See below
weight, Integer, Load balancing weight (1-100)
zone, String, Service zone
canary, Boolean, Whether the host is a canary
success_rate, Double, "Request success rate (0-100). -1 if there was not enough
:ref:`request volume<config_cluster_manager_cluster_outlier_detection_success_rate_request_volume>`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you actually look at these docs rendered? This is almost definitely not correct and will look broken.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I rendered them and they look correct

in the :ref:`interval<config_cluster_manager_cluster_outlier_detection_interval_ms>`
to calculate it"

Host health status
A host is either healthy or unhealthy because of one or more different failing health states.
Expand Down
71 changes: 48 additions & 23 deletions include/envoy/upstream/outlier_detection.h
Original file line number Diff line number Diff line change
Expand Up @@ -48,34 +48,16 @@ class DetectorHostSink {
* @return the last time this host was unejected, if the host has been unejected previously.
*/
virtual const Optional<SystemTime>& lastUnejectionTime() PURE;
};

typedef std::unique_ptr<DetectorHostSink> DetectorHostSinkPtr;

enum class EjectionType { Consecutive5xx, SuccessRate };

/**
* Sink for outlier detection event logs.
*/
class EventLogger {
public:
virtual ~EventLogger() {}

/**
* Log an ejection event.
* @param host supplies the host that generated the event.
* @param type supplies the type of the event.
*/
virtual void logEject(HostDescriptionConstSharedPtr host, EjectionType type) PURE;

/**
* Log an unejection event.
* @param host supplies the host that generated the event.
* @return the success rate of the host in the last calculated interval, in the range 0-100.
* -1 means that the host did not have enough request volume to calculate success rate
* or the cluster did not have enough hosts to run through success rate outlier ejection.
*/
virtual void logUneject(HostDescriptionConstSharedPtr host) PURE;
virtual double successRate() const PURE;
};

typedef std::shared_ptr<EventLogger> EventLoggerSharedPtr;
typedef std::unique_ptr<DetectorHostSink> DetectorHostSinkPtr;

/**
* Interface for an outlier detection engine. Uses per host data to determine which hosts in a
Expand All @@ -95,9 +77,52 @@ class Detector {
* changes state (either ejected or brought back in) due to outlier status.
*/
virtual void addChangedStateCb(ChangeStateCb cb) PURE;

/**
* Returns the average success rate of the hosts in the Detector for the last aggregation
* interval.
* @return the average success rate, or -1 if there were not enough hosts with enough request
* volume to proceed with success rate based outlier ejection.
*/
virtual double successRateAverage() PURE;

/**
* Returns the success rate threshold used in the last interval. The threshold is used to eject
* hosts based on their success rate.
* @return the threshold, or -1 if there were not enough hosts with enough request volume to
* proceed with success rate based outlier ejection.
*/
virtual double successRateEjectionThreshold() PURE;
};

typedef std::shared_ptr<Detector> DetectorSharedPtr;

enum class EjectionType { Consecutive5xx, SuccessRate };

/**
* Sink for outlier detection event logs.
*/
class EventLogger {
public:
virtual ~EventLogger() {}

/**
* Log an ejection event.
* @param host supplies the host that generated the event.
* @param detector supplies the detector that is doing the ejection.
* @param type supplies the type of the event.
* @param enforced is true if the ejection took place; false, if only logging took place.
*/
virtual void logEject(HostDescriptionConstSharedPtr host, Detector& detector, EjectionType type,
bool enforced) PURE;

/**
* Log an unejection event.
* @param host supplies the host that generated the event.
*/
virtual void logUneject(HostDescriptionConstSharedPtr host) PURE;
};

typedef std::shared_ptr<EventLogger> EventLoggerSharedPtr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: new line after type def

} // Outlier
} // Upstream
7 changes: 7 additions & 0 deletions include/envoy/upstream/upstream.h
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
#include "envoy/network/connection.h"
#include "envoy/ssl/context.h"
#include "envoy/upstream/load_balancer_type.h"
#include "envoy/upstream/outlier_detection.h"
#include "envoy/upstream/resource_manager.h"

namespace Upstream {
Expand Down Expand Up @@ -300,6 +301,12 @@ class Cluster : public virtual HostSet {
*/
virtual ClusterInfoConstSharedPtr info() const PURE;

/**
* @return a shared pointer to the cluster's outlier detector. If an outlier detector has not been
* installed, returns a nullptr.
*/
virtual const Outlier::DetectorSharedPtr outlierDetector() const PURE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return const shared pointer has no meaning. You need to return a pointer to a const Detector.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see this is a const pointer not a pointer to a const object. What we want is to return a pointer to a const object that way we can't change the object through the pointer.


/**
* Initialize the cluster. This will be called either immediately at creation or after all primary
* clusters have been initialized (determined via initializePhase()).
Expand Down
71 changes: 58 additions & 13 deletions source/common/upstream/outlier_detection_impl.cc
Original file line number Diff line number Diff line change
Expand Up @@ -185,7 +185,11 @@ void DetectorImpl::ejectHost(HostSharedPtr host, EjectionType type) {
runCallbacks(host);

if (event_logger_) {
event_logger_->logEject(host, type);
event_logger_->logEject(host, *this, type, true);
}
} else {
if (event_logger_) {
event_logger_->logEject(host, *this, type, false);
}
}
} else {
Expand Down Expand Up @@ -240,7 +244,7 @@ void DetectorImpl::onConsecutive5xxWorker(HostSharedPtr host) {
// outliers.
const double Utility::SUCCESS_RATE_STDEV_FACTOR = 1.9;

double Utility::successRateEjectionThreshold(
Utility::EjectionPair Utility::successRateEjectionThreshold(
double success_rate_sum, const std::vector<HostSuccessRatePair>& valid_success_rate_hosts) {
// This function is using mean and standard deviation as statistical measures for outlier
// detection. First the mean is calculated by dividing the sum of success rate data over the
Expand All @@ -266,7 +270,7 @@ double Utility::successRateEjectionThreshold(
variance /= valid_success_rate_hosts.size();
double stdev = std::sqrt(variance);

return mean - (SUCCESS_RATE_STDEV_FACTOR * stdev);
return EjectionPair(mean, (mean - (SUCCESS_RATE_STDEV_FACTOR * stdev)));
}

void DetectorImpl::processSuccessRateEjections() {
Expand All @@ -277,6 +281,10 @@ void DetectorImpl::processSuccessRateEjections() {
std::vector<HostSuccessRatePair> valid_success_rate_hosts;
double success_rate_sum = 0;

// Reset the Detector's success rate mean and stdev
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit full stop

success_rate_average_ = -1;
success_rate_ejection_threshold_ = -1;

// Exit early if there are not enough hosts.
if (host_sinks_.size() < success_rate_minimum_hosts) {
return;
Expand All @@ -286,7 +294,6 @@ void DetectorImpl::processSuccessRateEjections() {
valid_success_rate_hosts.reserve(host_sinks_.size());

for (const auto& host : host_sinks_) {
host.second->updateCurrentSuccessRateBucket();
// Don't do work if the host is already ejected.
if (!host.first->healthFlagGet(Host::HealthFlag::FAILED_OUTLIER_CHECK)) {
Optional<double> host_success_rate =
Expand All @@ -296,15 +303,18 @@ void DetectorImpl::processSuccessRateEjections() {
valid_success_rate_hosts.emplace_back(
HostSuccessRatePair(host.first, host_success_rate.value()));
success_rate_sum += host_success_rate.value();
host.second->successRate(host_success_rate.value());
}
}
}

if (valid_success_rate_hosts.size() >= success_rate_minimum_hosts) {
double ejection_threshold =
Utility::EjectionPair ejection_pair =
Utility::successRateEjectionThreshold(success_rate_sum, valid_success_rate_hosts);
success_rate_average_ = ejection_pair.success_rate_average_;
success_rate_ejection_threshold_ = ejection_pair.ejection_threshold_;
for (const auto& host_success_rate_pair : valid_success_rate_hosts) {
if (host_success_rate_pair.success_rate_ < ejection_threshold) {
if (host_success_rate_pair.success_rate_ < success_rate_ejection_threshold_) {
stats_.ejections_success_rate_.inc();
ejectHost(host_success_rate_pair.host_, EjectionType::SuccessRate);
}
Expand All @@ -317,6 +327,12 @@ void DetectorImpl::onIntervalTimer() {

for (auto host : host_sinks_) {
checkHostForUneject(host.first, host.second, now);

// Need to update the writer bucket to keep the data valid.
host.second->updateCurrentSuccessRateBucket();
// Refresh host success rate stat for the /clusters endpoint. If there is a new valid value it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comma after the first sentence in the if sentence.

// will get updated in processSuccessRateEjections().
host.second->successRate(-1);
}

processSuccessRateEjections();
Expand All @@ -330,25 +346,54 @@ void DetectorImpl::runCallbacks(HostSharedPtr host) {
}
}

void EventLoggerImpl::logEject(HostDescriptionConstSharedPtr host, EjectionType type) {
void EventLoggerImpl::logEject(HostDescriptionConstSharedPtr host, Detector& detector,
EjectionType type, bool enforced) {
// TODO(mattklein123): Log friendly host name (e.g., instance ID or DNS name).
// clang-format off
static const std::string json =
static const std::string json_5xx =
std::string("{{") +
"\"time\": \"{}\", " +
"\"secs_since_last_action\": \"{}\", " +
"\"cluster\": \"{}\", " +
"\"upstream_url\": \"{}\", " +
"\"action\": \"eject\", " +
"\"type\": \"{}\", " +
"\"num_ejections\": {}" +
"\"num_ejections\": \"{}\", " +
"\"enforced\": \"{}\"" +
"}}\n";

static const std::string json_success_rate =
std::string("{{") +
"\"time\": \"{}\", " +
"\"secs_since_last_action\": \"{}\", " +
"\"cluster\": \"{}\", " +
"\"upstream_url\": \"{}\", " +
"\"action\": \"eject\", " +
"\"type\": \"{}\", " +
"\"num_ejections\": \"{}\", " +
"\"enforced\": \"{}\", " +
"\"host_success_rate\": \"{}\", " +
"\"cluster_average_success_rate\": \"{}\", " +
"\"cluster_success_rate_ejection_threshold\": \"{}\"" +
"}}\n";
// clang-format on
SystemTime now = time_source_.currentSystemTime();
file_->write(fmt::format(json, AccessLogDateTimeFormatter::fromTime(now),
secsSinceLastAction(host->outlierDetector().lastUnejectionTime(), now),
host->cluster().name(), host->address()->asString(), typeToString(type),
host->outlierDetector().numEjections()));

switch (type) {
case EjectionType::Consecutive5xx:
file_->write(fmt::format(json_5xx, AccessLogDateTimeFormatter::fromTime(now),
secsSinceLastAction(host->outlierDetector().lastUnejectionTime(), now),
host->cluster().name(), host->address()->asString(),
typeToString(type), host->outlierDetector().numEjections(), enforced));
break;
case EjectionType::SuccessRate:
file_->write(fmt::format(json_success_rate, AccessLogDateTimeFormatter::fromTime(now),
secsSinceLastAction(host->outlierDetector().lastUnejectionTime(), now),
host->cluster().name(), host->address()->asString(),
typeToString(type), host->outlierDetector().numEjections(), enforced,
host->outlierDetector().successRate(), detector.successRateAverage(),
detector.successRateEjectionThreshold()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add break here for next person that comes along.

}
}

void EventLoggerImpl::logUneject(HostDescriptionConstSharedPtr host) {
Expand Down
Loading