Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry mechanism does not work if proxy server returns 503 on gRPC call #6780

Closed
OrangeFlag opened this issue Oct 10, 2024 · 7 comments
Closed
Labels
Bug Something isn't working

Comments

@OrangeFlag
Copy link

Describe the bug
The retry mechanism in OkHttpGrpcSender fails to retry when Envoy returns a 503 HTTP status code for gRPC requests. The gRPC call fails without retries being triggered, which should be handled according to the provided retry policy. This issue arises because of the logic at this line of code, where retries are not attempted if the HTTP response is not successful.

Steps to reproduce
To make the issue easier to reproduce, you can use WireMock to simulate the proxy server behavior:

  1. Set up WireMock to return a 503 status code for certain gRPC requests.
    • Install and configure WireMock to mock the gRPC service.
    • Create a rule in WireMock to respond with a 503 Service Unavailable for requests.
  2. Configure the OpenTelemetry SDK to use OkHttpGrpcSender for exporting spans.
  3. Run the application, making sure requests go through the WireMock server.
  4. Observe that retries are not triggered when WireMock returns the 503 status code, and the request fails immediately.

Or use envoy for more production like behaviour:

  1. Set up Envoy as a proxy server in front of a opentelemetry collector.
  2. Configure Envoy to return a 503 Service Unavailable when the backend is unavailable or as part of error injection for testing.
  3. Use grpc exporter in OpenTelemetry (v1.41.0) to send gRPC requests through the proxy server.
  4. Observe that retries are not triggered even though a retry policy is configured, and the gRPC request fails immediately after receiving the 503 HTTP status code.

What did you expect to see?
When the proxy server (Envoy) returns a 503 Service Unavailable, the retry mechanism should trigger based on the provided RetryPolicy and automatically retry the gRPC call.

What did you see instead?
No retries were attempted. The following log is observed:
Failed to export spans. Server is UNAVAILABLE. Make sure your collector is running and reachable from this network. Full error message: no healthy upstream.
There are no logs indicating retries were made.

What version and what artifacts are you using?
Artifacts: opentelemetry-sdk, opentelemetry-exporter-sender-okhttp
Version: 1.41.0

Environment
OS: Ubuntu Jammy (22.04 LTS)
Runtime: openjdk 17

Additional context
The issue seems to originate from the following code in OkHttpGrpcSender:

if (!response.isSuccessful()) {
      return false;
}

This logic prevents retries from being attempted when the response code is anything other than 200. However, non-200 HTTP status codes like 503 should be treated as retryable. It may be beneficial to use RetryUtil.retryableHttpResponseCodes along with response.isSuccessful() to determine whether a request should be retried based on transient HTTP error codes.

@OrangeFlag OrangeFlag added the Bug Something isn't working label Oct 10, 2024
@jack-berg
Copy link
Member

My understanding is that for a server to be grpc "compliant", it needs to always return gRPC status codes even in error situations. So a load balancer that returns 503 would need to be adjusted to return a grpc-status header containing the grpc status code.

This is part of the reason why the http/protobuf variant of OTLP is now preferred over gRPC. If you can't ensure that all the components in your stack always return valid gRPC responses, consider switching to OTLP http/protobuf, which works better with the standard types of responses expected from load balancers and other off the shelf components.

@OrangeFlag
Copy link
Author

it needs to always return gRPC status codes even in error situations.

It returns the grpc status, but the retry code ignores it

@OrangeFlag
Copy link
Author

Maybe it's possible to simplify and not check http code in grpc retrier?

@jack-berg
Copy link
Member

Maybe it's possible to simplify and not check http code in grpc retrier?

Seems reasonable. Though testing may be challenging. Care to submit a PR?

Code in question is here.

response.isSuccessful() returns false unless the HTTP status code is 200. According to the grpc spec valid grpc responses have a status of 200, but

Implementations should expect broken deployments to send non-200 HTTP status codes in responses

I think its reasonable for us to interpret a valid grpc-status header even if the HTTP response status code is non-200.

@JiwonKKang
Copy link
Contributor

Hello,

Would it be alright if I submitted a PR for this issue ? I’d love to contribute, and I want to ensure it’s okay to proceed.

Thank you very much!

@jack-berg
Copy link
Member

PRs are welcome @JiwonKKang. Thanks!

@jack-berg
Copy link
Member

Resolved in #6829.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants