grpc success/failure stat should count more than 0 as success #1125

richunger · 2017-06-15T21:58:15Z

https://github.com/lyft/envoy/blob/3e62653863125c07fc78fcc4bd967a10f026114b/source/common/grpc/common.cc#L34

We use this success stat to calculate our services' success rate. In the REST world, I've generally seen success rate calculated as 2xx+3xx+4xx/total_count. What envoy does now for grpc seems like the equivalent of just having 2xx in the numerator. For example, FailedPrecondition should not count as a failure in calculating SR, right?

mattklein123 · 2017-06-15T22:00:48Z

I think @fengli79 and @lizan have some thoughts on future work on the gRPC stats. Perhaps they have some thoughts here.

fengli79 · 2017-06-15T22:48:36Z

It depends on how you define the "success".
gRPC always use 200 as response code, and its status is encoded in grpc-status header, a "0" value means the rpc is success and others are failure.
In that sense, the stats is correct.

However, if you want to count FailedPrecondition as success when calculating SR, you have to do that by yourself. It's not feasible, as today, we only record either success or failure.

I'm going to create a PR to add additional stats for each gRPC status code, thus you can use them calculate the SR which meets your business logic.

mattklein123 · 2017-06-16T15:19:18Z

@richunger yeah it's a little unclear to me how FailedPrecondition is not an app error here so it does seem to me like the stats are correct for gross success/fail. In any case, as @fengli79 says we will soon have per code stats like we do for HTTP so can do custom stuff as needed.

richunger · 2017-06-16T15:30:27Z

Yeah, per-code stats seem like the best path overall.
IMHO FailedPrecondition is a client error (the moral equivalent of an HTTP 400), and not a failure of the service being measured.

mattklein123 · 2017-06-16T18:24:58Z

@richunger I agree with you that from app perspective, it's an error. From sever perspective, it's not. Let's get the per code stats in place then we can figure out how to change our internal alarms to be better.

mattklein123 · 2017-06-16T22:42:33Z

@richunger calling this fixed via #1125. Will get deployed next week. We can discuss alarm/graph changes internally.

Description: Hardcoding these isn't ideal, but it works for now. Down the road, we should map dynamically at runtime. Risk Level: Low Testing: Local with app filters Signed-off-by: Mike Schore <mike.schore@gmail.com> Signed-off-by: JP Simard <jp@jpsim.com>

mattklein123 added this to the 1.4.0 milestone Jun 16, 2017

fengli79 mentioned this issue Jun 16, 2017

Adds additional stats per gRPC status. #1129

Merged

mattklein123 closed this as completed Jun 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grpc success/failure stat should count more than 0 as success #1125

grpc success/failure stat should count more than 0 as success #1125

richunger commented Jun 15, 2017

mattklein123 commented Jun 15, 2017

fengli79 commented Jun 15, 2017

mattklein123 commented Jun 16, 2017

richunger commented Jun 16, 2017

mattklein123 commented Jun 16, 2017

mattklein123 commented Jun 16, 2017

grpc success/failure stat should count more than 0 as success #1125

grpc success/failure stat should count more than 0 as success #1125

Comments

richunger commented Jun 15, 2017

mattklein123 commented Jun 15, 2017

fengli79 commented Jun 15, 2017

mattklein123 commented Jun 16, 2017

richunger commented Jun 16, 2017

mattklein123 commented Jun 16, 2017

mattklein123 commented Jun 16, 2017