Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add exponential back-off to retryStrategy #1782

Merged
merged 6 commits into from
Dec 5, 2019

Conversation

simster7
Copy link
Member

Closes: #700. Supports duration, factor, maxDuration.

Example:

# This example demonstrates the use of retry back offs
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: retry-backoff-
spec:
  entrypoint: retry-backoff
  templates:
  - name: retry-backoff
    retryStrategy:
      limit: 10
      backoff:
        duration: 1       # Default unit is seconds. Could also be a Duration, e.g.: "2m", "6h", "1d"
        factor: 2
        maxDuration: "1m" # Default unit is seconds. Could also be a Duration, e.g.: "2m", "6h", "1d"
    container:
      image: python:alpine3.6
      command: ["python", -c]
      # fail with a 66% probability
      args: ["import random; import sys; exit_code = random.choice([1, 1]); sys.exit(exit_code)"]

When waiting to retry (see parent node message):

Name:                retry-container-69ggh
Namespace:           argo
ServiceAccount:      default
Status:              Running
Created:             Tue Nov 19 12:44:05 -0800 (30 seconds ago)
Started:             Tue Nov 19 12:44:05 -0800 (30 seconds ago)
Duration:            30 seconds

STEP                                             PODNAME                           DURATION  MESSAGE
 ● retry-container-69ggh (retry-container)                                                   Retrying in 12 seconds
 ├-✖ retry-container-69ggh(0) (retry-container)  retry-container-69ggh-4109076362  3s        failed with exit code 1
 ├-✖ retry-container-69ggh(1) (retry-container)  retry-container-69ggh-283735303   4s        failed with exit code 1
 ├-✖ retry-container-69ggh(2) (retry-container)  retry-container-69ggh-1894828012  3s        failed with exit code 1
 └-✖ retry-container-69ggh(3) (retry-container)  retry-container-69ggh-216919017   3s        failed with exit code 1

When max duration limit exceeded:

Name:                retry-container-69ggh
Namespace:           argo
ServiceAccount:      default
Status:              Failed
Message:             Max duration limit exceeded
Created:             Tue Nov 19 12:44:05 -0800 (1 minute ago)
Started:             Tue Nov 19 12:44:05 -0800 (1 minute ago)
Finished:            Tue Nov 19 12:45:08 -0800 (now)
Duration:            1 minute 3 seconds

STEP                                             PODNAME                           DURATION  MESSAGE
 ✖ retry-container-69ggh (retry-container)                                                   Max duration limit exceeded
 ├-✖ retry-container-69ggh(0) (retry-container)  retry-container-69ggh-4109076362  3s        failed with exit code 1
 ├-✖ retry-container-69ggh(1) (retry-container)  retry-container-69ggh-283735303   4s        failed with exit code 1
 ├-✖ retry-container-69ggh(2) (retry-container)  retry-container-69ggh-1894828012  3s        failed with exit code 1
 ├-✖ retry-container-69ggh(3) (retry-container)  retry-container-69ggh-216919017   3s        failed with exit code 1
 └-✖ retry-container-69ggh(4) (retry-container)  retry-container-69ggh-2363615606  3s        failed with exit code 1

// See if we have waited past the deadline
if time.Now().Before(waitingDeadline) {
retryMessage := fmt.Sprintf("Retrying in %s", humanize.Duration(time.Until(waitingDeadline)))
return woc.markNodePhase(node.Name, node.Phase, retryMessage), false, nil
Copy link
Member Author

@simster7 simster7 Nov 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of an interesting trade-off here is that since we are changing the parentNode message, the workflow will be immediately re-queued. This makes re-queuing the Workflow with a duration (using woc.requeue(time.Until(waitingDeadline))) moot, and introduces some "polling" behavior in which the workflow is operated on once a second—mostly to update the "Retrying in ..." message.

If we decide that this polling behavior is not acceptable, then could make the "Retrying in ..." message static (e.g. "Retrying at [TIME]") and we could then re-queue with a duration as explained above.

limit: 10
backoff:
duration: 1 # Default unit is seconds. Could also be a Duration, e.g.: "2m", "6h", "1d"
factor: 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the bikeshedding, but should this not be called "base" instead, given that it is, in fact, the base of the exponential function?
Also, good work :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duration is actually what K8s uses :) I think because technically factor is optional; if left at 0 then the backoff would simply apply the same duration wait

@simster7 simster7 added this to the v2.4.3 milestone Dec 4, 2019
@sarabala1979
Copy link
Member

Can you resolve the conflicts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

retryStrategy needs exponential backoff controls
3 participants