Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry Policy #7

Closed
jamsajones opened this issue Nov 28, 2018 · 6 comments
Closed

Retry Policy #7

jamsajones opened this issue Nov 28, 2018 · 6 comments
Assignees

Comments

@jamsajones
Copy link

A Retry Policy in App Mesh enables clients to protect themselves from intermittent network failures, or intermittent server-side failures. A Retry Policy is an immutable entity in App Mesh that allows users to specify the conditions under which a retry is attempted, including HTTP status codes that will trigger a retry. A Retry Policy also has parameters specifying how many times to retry, and the timeout to use per retry.

Once a Retry Policy is created, it can be attached to one or more Virtual Nodes as part of the backends. Each backend in a Virtual Node can have its own retry policy.

@coultn coultn changed the title Implement Retries Retry Policy Feb 12, 2019
@abby-fuller abby-fuller transferred this issue from aws/aws-app-mesh-examples Mar 27, 2019
@ivitjuk
Copy link

ivitjuk commented Apr 17, 2019

Summary

We would like to propose and request feedback on the new Retry Policy API. Main change is addition of the retryPolicy field inside the existing Route Action spec.

By adding the retryPolicy field, route owners will be able to define:

  • Allowed time per retry in milliseconds
  • Maximum number of allowed retries
  • Set of events to retry on

This change approximately corresponds to the Envoy Retry Policy API. Most notable difference is in the way retry-able events are specified. We diverge slightly from the Envoy's approach and try to classify the events according to the layer they occur: tcp, http or grpc. Retry policy schema bellow demonstrates that. In the schema, together with the list of App Mesh event names, we also provide their mappings to the Envoy retry events.

Most interesting event in the schema is the HTTP code expansion field. If field such as “1xx” is added to the list of http events to retry on, “xx” will be expanded to a full list of IANA supported HTTP codes.

Retry Policy Schema

“retryPolicy”: {
    "perRetryTimeoutMilis": <number>,
    "maxRetries": <number>,
    "retryOn": {        
        // AppMesh Event              Envoy Translation 
        "tcp": [
            "connection-error"       // retry_on: connect-failure
        ],
        "http": [
            "server-error",          // retriable_status_codes: [ "500", "501", "505", "506", "507", 508", "509", "510", "511" ]
            "gateway-error",         // retriable_status_codes: [ "502", "503, "504" ]
            "client-error" ,         // retriable_status_codes: [ "409" ]
            "stream-error",          // retry_on: refused-stream (h2)
            "<1xx|2xx|3xx|4xx|5xx>", // retriable_status_codes: "xx" is expanded to valid IANA HTTP codes
        ],
        "grpc": [
            "cancelled",             // retry_on: cancelled (gRPC code 1)
            "deadline-exceeded",     // retry_on: deadline-exceeded (gRPC code 4)
            "internal",              // retry_on: internal (gRPC code 13)
            "resource-exhausted",    // retry_on: resource-exhausted (gRPC code 8)
            "unavailable"            // retry_on: unavailable (gRPC code 14)
        ]
    }
}

Example

Bellow we provide a full example of how would a route definition look like with included retry policy. Retry policy bellow would perform up to 3 retries each taking no more than 1000ms. Events that would be retries are: tcp connection failure, and http codes: 500, 501, 505, 506, 507, 508, 509, 510, 511.

$ cat route.json

{
  "meshName": "simple-app",
  "routeName": "simple-route",
  "spec": {
    "httpRoute": {
      "action": {
        "weightedTargets": [
          {
            "virtualNode": "service-v1",
            "weight": 90
          },
          {
            "virtualNode": "service-v2",
            "weight": 10
          }
        ],
        "match": {
           "prefix": "/"
        },
        "retryPolicy":{
           "perRetryTimeoutMilis": 1000,
           "maxRetries": 3,
           "retryOn": {        
                "tcp": [
                    "connection-error"
                ],
                "http": [
                     "server-error"
                ]
            }
        }       
      }
    }
  },
  "virtualRouterName": "service-router"
}

$ aws appmesh create-route --cli-input-json file://route.json

@shubharao
Copy link

This feature is now launched in our preview channel. Please try it and let us know what you think. Documentation: https://docs.aws.amazon.com/app-mesh/latest/userguide/route-retry-policy.html
Example: https://github.com/aws/aws-app-mesh-examples/tree/master/blogs/http-retry-policy

@ewbankkit
Copy link

@shubharao I've noticed that the API has the ability to specify PerRetryTimeout as an s or ms Value but always returns the Value in ms (as @ivitjuk's original spec seems to suggest). Is this the final behavior of the API?
I'm building Terraform support for this feature and it looks like the best way is to expose an attribute per_retry_timeout_millis with a default value of 15000 rather than expose a complex attribute per_retry_timeout with unit and value sub-attributes.

@bigdefect
Copy link
Contributor

@ewbankkit Thanks for reporting this, I'll put up a bug issue. The final behavior will be to correctly round trip the input value, we're working on the fix. I'd recommend implementing the Duration type as we're looking to continue using it. A potential workaround for terraform for now would be to only support the millisecond unit, until we fix the round trip.

Are you implementing the preview api features into your standard app mesh models, or are you planning to have separate support for the preview channel? Given that preview apis are subject to change, that could cause breaking changes.

@ewbankkit
Copy link

ewbankkit commented Aug 6, 2019

@efe-selcuk Thanks for the response.
Right now I'm making the changes in a branch in my fork with the expectation that once the feature is released that I'll cherry pick over the relevant commits.
Terraform right now doesn't support the idea of a preview channel, even support for preview services with APIs in the public SDK are problematic as their resources may get incorporated into the main provider and we'd like to have more relaxed backwards compatibility guarantees for those resources while the service is in preview.
There has been some discussion - hashicorp/terraform-provider-aws#7659 (comment) hashicorp/terraform-provider-aws#8035 - around a possible preview/beta provider like there is for GCP.

@shubharao
Copy link

Closing this as retry policies for HTTP is shipped! https://aws.amazon.com/about-aws/whats-new/2019/09/aws-app-mesh-now-supports-retry-policies/

@shubharao shubharao added the Roadmap: Accepted We are planning on doing this work. label Sep 27, 2019
@shubharao shubharao self-assigned this Sep 27, 2019
@shubharao shubharao removed the Roadmap: Accepted We are planning on doing this work. label Sep 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants