Add exponential back-off to retryStrategy #1782

simster7 · 2019-11-19T20:56:14Z

Closes: #700. Supports duration, factor, maxDuration.

Example:

# This example demonstrates the use of retry back offs
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: retry-backoff-
spec:
  entrypoint: retry-backoff
  templates:
  - name: retry-backoff
    retryStrategy:
      limit: 10
      backoff:
        duration: 1       # Default unit is seconds. Could also be a Duration, e.g.: "2m", "6h", "1d"
        factor: 2
        maxDuration: "1m" # Default unit is seconds. Could also be a Duration, e.g.: "2m", "6h", "1d"
    container:
      image: python:alpine3.6
      command: ["python", -c]
      # fail with a 66% probability
      args: ["import random; import sys; exit_code = random.choice([1, 1]); sys.exit(exit_code)"]

When waiting to retry (see parent node message):

Name:                retry-container-69ggh
Namespace:           argo
ServiceAccount:      default
Status:              Running
Created:             Tue Nov 19 12:44:05 -0800 (30 seconds ago)
Started:             Tue Nov 19 12:44:05 -0800 (30 seconds ago)
Duration:            30 seconds

STEP                                             PODNAME                           DURATION  MESSAGE
 ● retry-container-69ggh (retry-container)                                                   Retrying in 12 seconds
 ├-✖ retry-container-69ggh(0) (retry-container)  retry-container-69ggh-4109076362  3s        failed with exit code 1
 ├-✖ retry-container-69ggh(1) (retry-container)  retry-container-69ggh-283735303   4s        failed with exit code 1
 ├-✖ retry-container-69ggh(2) (retry-container)  retry-container-69ggh-1894828012  3s        failed with exit code 1
 └-✖ retry-container-69ggh(3) (retry-container)  retry-container-69ggh-216919017   3s        failed with exit code 1

When max duration limit exceeded:

Name:                retry-container-69ggh
Namespace:           argo
ServiceAccount:      default
Status:              Failed
Message:             Max duration limit exceeded
Created:             Tue Nov 19 12:44:05 -0800 (1 minute ago)
Started:             Tue Nov 19 12:44:05 -0800 (1 minute ago)
Finished:            Tue Nov 19 12:45:08 -0800 (now)
Duration:            1 minute 3 seconds

STEP                                             PODNAME                           DURATION  MESSAGE
 ✖ retry-container-69ggh (retry-container)                                                   Max duration limit exceeded
 ├-✖ retry-container-69ggh(0) (retry-container)  retry-container-69ggh-4109076362  3s        failed with exit code 1
 ├-✖ retry-container-69ggh(1) (retry-container)  retry-container-69ggh-283735303   4s        failed with exit code 1
 ├-✖ retry-container-69ggh(2) (retry-container)  retry-container-69ggh-1894828012  3s        failed with exit code 1
 ├-✖ retry-container-69ggh(3) (retry-container)  retry-container-69ggh-216919017   3s        failed with exit code 1
 └-✖ retry-container-69ggh(4) (retry-container)  retry-container-69ggh-2363615606  3s        failed with exit code 1

simster7 · 2019-11-19T21:05:48Z

workflow/controller/operator.go

+		// See if we have waited past the deadline
+		if time.Now().Before(waitingDeadline) {
+			retryMessage := fmt.Sprintf("Retrying in %s", humanize.Duration(time.Until(waitingDeadline)))
+			return woc.markNodePhase(node.Name, node.Phase, retryMessage), false, nil


A bit of an interesting trade-off here is that since we are changing the parentNode message, the workflow will be immediately re-queued. This makes re-queuing the Workflow with a duration (using woc.requeue(time.Until(waitingDeadline))) moot, and introduces some "polling" behavior in which the workflow is operated on once a second—mostly to update the "Retrying in ..." message.

If we decide that this polling behavior is not acceptable, then could make the "Retrying in ..." message static (e.g. "Retrying at [TIME]") and we could then re-queue with a duration as explained above.

epa095 · 2019-11-20T08:48:21Z

examples/retry-backoff.yaml

+      limit: 10
+      backoff:
+        duration: 1       # Default unit is seconds. Could also be a Duration, e.g.: "2m", "6h", "1d"
+        factor: 2


Sorry for the bikeshedding, but should this not be called "base" instead, given that it is, in fact, the base of the exponential function?
Also, good work :-)

Duration is actually what K8s uses :) I think because technically factor is optional; if left at 0 then the backoff would simply apply the same duration wait

sarabala1979 · 2019-12-04T20:03:16Z

Can you resolve the conflicts?

simster7 added 3 commits November 19, 2019 12:53

WIP

b75f6ee

Done

168a29d

Added example

4a19445

simster7 commented Nov 19, 2019

View reviewed changes

epa095 reviewed Nov 20, 2019

View reviewed changes

simster7 added this to the v2.4.3 milestone Dec 4, 2019

simster7 added 3 commits December 4, 2019 12:06

Merge branch 'master' into backoff

1da2940

Clearer code

562fd64

Fix bug where factor == 0

907734e

sarabala1979 approved these changes Dec 5, 2019

View reviewed changes

sarabala1979 merged commit 5274afb into argoproj:master Dec 5, 2019

sarabala1979 modified the milestones: v2.4.3, v2.5 Dec 5, 2019

hongye-sun mentioned this pull request Apr 23, 2020

[FR] Exponential backoff support kubeflow/pipelines#3590

Closed

Ark-kun mentioned this pull request Apr 27, 2020

Argo upgrade - features and bugs kubeflow/pipelines#3621

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add exponential back-off to retryStrategy #1782

Add exponential back-off to retryStrategy #1782

simster7 commented Nov 19, 2019

simster7 Nov 19, 2019 •

edited

Loading

epa095 Nov 20, 2019

simster7 Nov 20, 2019

sarabala1979 commented Dec 4, 2019

Add exponential back-off to retryStrategy #1782

Add exponential back-off to retryStrategy #1782

Conversation

simster7 commented Nov 19, 2019

simster7 Nov 19, 2019 • edited Loading

Choose a reason for hiding this comment

epa095 Nov 20, 2019

Choose a reason for hiding this comment

simster7 Nov 20, 2019

Choose a reason for hiding this comment

sarabala1979 commented Dec 4, 2019

simster7 Nov 19, 2019 •

edited

Loading