Fix the workflow state inconsistency when use steps with retry limit #1141

xianlubird · 2018-12-27T12:31:18Z

I have a yaml like this below

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: steps-
spec:
  entrypoint: hello-hello-hello

  templates:
  - name: hello-hello-hello
    steps:
    - - name: hello1            #hello1 is run before the following steps
        template: whalesay
    - - name: hello2a           #double dash => run after previous step
        template: retry-container
        withItems: [1, 2, 3]

  - name: whalesay
    container:
      image: python:alpine3.6
      command: ["python", -c]
      args: ["import time; time.sleep(3)"]

  - name: retry-container
    retryStrategy:
      limit: 2
    container:
      image: python:alpine3.6
      command: ["python", -c]
      args: ["import random; import sys; exit_code = random.choice([1, 1]); sys.exit(exit_code)"]

Every time argo submit test.yaml , the pod will all failed & completed, but the workflow status will be Running forever, this's a bug that in steps with retry.
The result here

Name:                steps-k65m7
Namespace:           default
ServiceAccount:      default
Status:              Running
Created:             Thu Dec 27 20:26:17 +0800 (19 seconds ago)
Started:             Thu Dec 27 20:26:17 +0800 (19 seconds ago)
Duration:            19 seconds

STEP                      PODNAME                 DURATION  MESSAGE
 ● steps-k65m7
 ├---✔ hello1             steps-k65m7-3995602950  4s
 └-·-✖ hello2a(0:1)                                         No more retries left
   | ├-✖ hello2a(0:1)(0)  steps-k65m7-3038008680  2s        failed with exit code 1
   | ├-✖ hello2a(0:1)(1)  steps-k65m7-1091657781  2s        failed with exit code 1
   | └-✖ hello2a(0:1)(2)  steps-k65m7-420405926   2s        failed with exit code 1
   ├-● hello2a(1:2)
   | ├-✖ hello2a(1:2)(0)  steps-k65m7-96146110    3s        failed with exit code 1
   | ├-✖ hello2a(1:2)(1)  steps-k65m7-565772347   3s        failed with exit code 1
   | └-● hello2a(1:2)(2)  steps-k65m7-2176865056  4s
   └-✖ hello2a(2:3)                                         No more retries left
     ├-✖ hello2a(2:3)(0)  steps-k65m7-3542037624  1s        failed with exit code 1
     ├-✖ hello2a(2:3)(1)  steps-k65m7-1595686725  4s        failed with exit code 1
     └-✖ hello2a(2:3)(2)  steps-k65m7-1998202486  4s        failed with exit code 1

You can see that hello2a(1:2) has no No more retries left, the workflow will Running forever.
In my fix, it will turn it to be right state. The new result will be like this

Name:                steps-k65m7
Namespace:           default
ServiceAccount:      default
Status:              Failed
Message:             child 'steps-k65m7-372465681' failed
Created:             Thu Dec 27 20:26:17 +0800 (19 seconds ago)
Started:             Thu Dec 27 20:26:17 +0800 (19 seconds ago)
Finished:            Thu Dec 27 20:26:36 +0800 (now)
Duration:            19 seconds

STEP                      PODNAME                 DURATION  MESSAGE
 ✖ steps-k65m7                                              child 'steps-k65m7-372465681' failed
 ├---✔ hello1             steps-k65m7-3995602950  4s
 └-·-✖ hello2a(0:1)                                         No more retries left
   | ├-✖ hello2a(0:1)(0)  steps-k65m7-3038008680  2s        failed with exit code 1
   | ├-✖ hello2a(0:1)(1)  steps-k65m7-1091657781  2s        failed with exit code 1
   | └-✖ hello2a(0:1)(2)  steps-k65m7-420405926   2s        failed with exit code 1
   ├-✖ hello2a(1:2)                                         No more retries left
   | ├-✖ hello2a(1:2)(0)  steps-k65m7-96146110    3s        failed with exit code 1
   | ├-✖ hello2a(1:2)(1)  steps-k65m7-565772347   3s        failed with exit code 1
   | └-✖ hello2a(1:2)(2)  steps-k65m7-2176865056  2s        failed with exit code 1
   └-✖ hello2a(2:3)                                         No more retries left
     ├-✖ hello2a(2:3)(0)  steps-k65m7-3542037624  1s        failed with exit code 1
     ├-✖ hello2a(2:3)(1)  steps-k65m7-1595686725  4s        failed with exit code 1
     └-✖ hello2a(2:3)(2)  steps-k65m7-1998202486  4s        failed with exit code 1

@jessesuen WDYT ?

jessesuen · 2019-01-03T00:42:53Z

@xianlubird it looks like this fix doesn't quite work and the bug may be simply intermittent or timing related:

The use of withItems does not seem to affect the behavior. For example, this workflow can also reproduce the problem, and fails even after applying your fix:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: steps-
spec:
  entrypoint: hello-hello-hello

  templates:
  - name: hello-hello-hello
    steps:
    - - name: hello1            #hello1 is run before the following steps
        template: whalesay
    - - name: hello2a           #double dash => run after previous step
        template: retry-container
      - name: hello2b           #double dash => run after previous step
        template: retry-container
      - name: hello2c           #double dash => run after previous step
        template: retry-container

  - name: whalesay
    container:
      image: python:alpine3.6
      command: ["python", -c]
      args: ["import time; time.sleep(3)"]

  - name: retry-container
    retryStrategy:
      limit: 2
    container:
      image: python:alpine3.6
      command: ["python", -c]
      args: ["import random; import sys; exit_code = random.choice([1, 1]); sys.exit(exit_code)"]

Which in one run, resulted in:

$ argo get steps-kffhp
Name:                steps-kffhp
Namespace:           default
ServiceAccount:      default
Status:              Running
Created:             Fri Dec 28 10:36:56 -0800 (5 days ago)
Started:             Wed Jan 02 16:38:57 -0800 (2 minutes ago)
Duration:            2 minutes 45 seconds

STEP                 PODNAME                 DURATION  MESSAGE
 ● steps-kffhp
 ├---✔ hello1        steps-kffhp-1029671323  5d
 └-·-✖ hello2a                                         No more retries left
   | ├-✖ hello2a(0)  steps-kffhp-2335944551  5d        failed with exit code 1
   | ├-✖ hello2a(1)  steps-kffhp-4013853546  5d        failed with exit code 1
   | └-✖ hello2a(2)  steps-kffhp-121696201   5d        failed with exit code 1
   ├-✖ hello2b                                         No more retries left
   | ├-✖ hello2b(0)  steps-kffhp-1732123524  5d        failed with exit code 1
   | ├-✖ hello2b(1)  steps-kffhp-54214529    5d        failed with exit code 1
   | └-✖ hello2b(2)  steps-kffhp-188288386   5d        failed with exit code 1
   └-● hello2c
     ├-✖ hello2c(0)  steps-kffhp-61636745    5d        failed with exit code 1
     ├-✖ hello2c(1)  steps-kffhp-3886977804  5d        failed with exit code 1
     └-✖ hello2c(2)  steps-kffhp-2275885095  5d        failed with exit code 1

jessesuen · 2019-01-03T01:29:04Z

I pinpointed the issue. There is an optimization being made when executing a step group, which we shouldn't be doing:

https://github.com/argoproj/argo/blob/master/workflow/controller/steps.go#L206

The issue is that the break statement should not be happening and we should continue iterating the loop so that we can re-evaluate the retry node:

Here is a simplified workflow which can reliably reproduce the problem:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: failed-retries-
spec:
  entrypoint: failed-retries

  templates:
  - name: failed-retries
    steps:
    - - name: fail
        template: fail
      - name: delayed-fail
        template: delayed-fail

  - name: fail
    retryStrategy:
      limit: 1
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["exit 1"]

  - name: delayed-fail
    retryStrategy:
      limit: 1
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["sleep 1; exit 1"]

I created a PR which will fix it:
#1148

jessesuen · 2019-01-03T02:13:50Z

Closing as fix was implemented in #1148

Fix the workflow state inconsistency when use steps with retry limit

1344209

jessesuen closed this Jan 3, 2019

xianlubird mentioned this pull request Jan 26, 2019

Dag workflow will stay running forever in some long run situation. #1190

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the workflow state inconsistency when use steps with retry limit #1141

Fix the workflow state inconsistency when use steps with retry limit #1141

xianlubird commented Dec 27, 2018

jessesuen commented Jan 3, 2019 •

edited

Loading

jessesuen commented Jan 3, 2019

jessesuen commented Jan 3, 2019

Fix the workflow state inconsistency when use steps with retry limit #1141

Fix the workflow state inconsistency when use steps with retry limit #1141

Conversation

xianlubird commented Dec 27, 2018

jessesuen commented Jan 3, 2019 • edited Loading

jessesuen commented Jan 3, 2019

jessesuen commented Jan 3, 2019

jessesuen commented Jan 3, 2019 •

edited

Loading