Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpc success/failure stat should count more than 0 as success #1125

Closed
richunger opened this issue Jun 15, 2017 · 6 comments
Closed

grpc success/failure stat should count more than 0 as success #1125

richunger opened this issue Jun 15, 2017 · 6 comments
Milestone

Comments

@richunger
Copy link

https://github.com/lyft/envoy/blob/3e62653863125c07fc78fcc4bd967a10f026114b/source/common/grpc/common.cc#L34

We use this success stat to calculate our services' success rate. In the REST world, I've generally seen success rate calculated as 2xx+3xx+4xx/total_count. What envoy does now for grpc seems like the equivalent of just having 2xx in the numerator. For example, FailedPrecondition should not count as a failure in calculating SR, right?

@mattklein123
Copy link
Member

I think @fengli79 and @lizan have some thoughts on future work on the gRPC stats. Perhaps they have some thoughts here.

@fengli79
Copy link
Contributor

It depends on how you define the "success".
gRPC always use 200 as response code, and its status is encoded in grpc-status header, a "0" value means the rpc is success and others are failure.
In that sense, the stats is correct.

However, if you want to count FailedPrecondition as success when calculating SR, you have to do that by yourself. It's not feasible, as today, we only record either success or failure.

I'm going to create a PR to add additional stats for each gRPC status code, thus you can use them calculate the SR which meets your business logic.

@mattklein123
Copy link
Member

@richunger yeah it's a little unclear to me how FailedPrecondition is not an app error here so it does seem to me like the stats are correct for gross success/fail. In any case, as @fengli79 says we will soon have per code stats like we do for HTTP so can do custom stuff as needed.

@richunger
Copy link
Author

Yeah, per-code stats seem like the best path overall.
IMHO FailedPrecondition is a client error (the moral equivalent of an HTTP 400), and not a failure of the service being measured.

@mattklein123
Copy link
Member

@richunger I agree with you that from app perspective, it's an error. From sever perspective, it's not. Let's get the per code stats in place then we can figure out how to change our internal alarms to be better.

@mattklein123
Copy link
Member

@richunger calling this fixed via #1125. Will get deployed next week. We can discuss alarm/graph changes internally.

jpsim pushed a commit that referenced this issue Nov 28, 2022
Description: Hardcoding these isn't ideal, but it works for now. Down the road, we should map dynamically at runtime.
Risk Level: Low
Testing: Local with app filters

Signed-off-by: Mike Schore <mike.schore@gmail.com>
Signed-off-by: JP Simard <jp@jpsim.com>
jpsim pushed a commit that referenced this issue Nov 29, 2022
Description: Hardcoding these isn't ideal, but it works for now. Down the road, we should map dynamically at runtime.
Risk Level: Low
Testing: Local with app filters

Signed-off-by: Mike Schore <mike.schore@gmail.com>
Signed-off-by: JP Simard <jp@jpsim.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants