Move 429 results from Error to Debug #46

bboreham · 2017-07-17T18:41:22Z

429 indicates an over-limit condition which will have been logged by the component that detected it, so we don't need to log it again on the calling side.

Example:

{"log":"time=\"2017-07-15T21:26:35Z\" level=warning msg=\"gRPC /cortex.Ingester/Push (rpc error: code = Code(429) desc = per-metric series limit exceeded) 2.412325ms\" \n",[...]{"name":"ingester"}
{"log":"WARN: 2017/07/17 12:22:03.256129 POST /api/prom/push (429) 11.665082ms\n",[...],"container_name":"authfe"}

429 indicates an over-limit condition which will have been logged by the component that detected it.

rade · 2017-07-18T06:11:51Z

I worry this will make problem investigation harder. Right now, when our monitoring shows a certain error rate for a component - say nHz, the logs for that component will have a corresponding number of error entries (n per second).

So in the example, the error currently shows up in the metrics of the ingester, distributor and authfe. And there will be corresponding error messages in the logs of all three. Whereas with the change, the metrics will be the same, but the error log messages will only show up in the ingester.

We usually investigate errors "front to back": we'd typically first look at the error rate in authfe, look at the logs to figure out what it is, and what, if any, component it originates from, then look at the error rate of that component, then look at its logs, etc, until we get to the bottom.

Only logging errors at the bottom component will break this method of investigation.

middleware/logging.go

@@ -32,7 +32,7 @@ func (l Log) Wrap(next http.Handler) http.Handler {
 		}
 		i := &interceptor{ResponseWriter: w, statusCode: http.StatusOK}
 		next.ServeHTTP(i, r)
-		if 100 <= i.statusCode && i.statusCode < 400 {
+		if 100 <= i.statusCode && (i.statusCode < 400 || i.statusCode < 429) {


jml · 2017-07-18T12:44:27Z

Relates to https://github.com/weaveworks/service-conf/issues/968

jml · 2017-07-18T12:46:03Z

@rade A valid concern. However, note that you can still employ a front-to-back approach by looking at the metrics, chasing down which service the 429s come from.

bboreham · 2017-12-17T17:57:03Z

This PR was obviated by #59 which moved all 400-level errors to debug.

Move 429 results from Error to Debug

cc9de3e

429 indicates an over-limit condition which will have been logged by the component that detected it.

jml reviewed Jul 18, 2017

View reviewed changes

bboreham closed this Dec 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move 429 results from Error to Debug #46

Move 429 results from Error to Debug #46

bboreham commented Jul 17, 2017

rade commented Jul 18, 2017

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

jml commented Jul 18, 2017

jml commented Jul 18, 2017

bboreham commented Dec 17, 2017

Move 429 results from Error to Debug #46

Move 429 results from Error to Debug #46

Conversation

bboreham commented Jul 17, 2017

rade commented Jul 18, 2017

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

jml commented Jul 18, 2017

jml commented Jul 18, 2017

bboreham commented Dec 17, 2017