Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the workflow state inconsistency when use steps with retry limit #1141

Closed
wants to merge 1 commit into from

Conversation

xianlubird
Copy link
Member

I have a yaml like this below

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: steps-
spec:
  entrypoint: hello-hello-hello

  templates:
  - name: hello-hello-hello
    steps:
    - - name: hello1            #hello1 is run before the following steps
        template: whalesay
    - - name: hello2a           #double dash => run after previous step
        template: retry-container
        withItems: [1, 2, 3]

  - name: whalesay
    container:
      image: python:alpine3.6
      command: ["python", -c]
      args: ["import time; time.sleep(3)"]

  - name: retry-container
    retryStrategy:
      limit: 2
    container:
      image: python:alpine3.6
      command: ["python", -c]
      args: ["import random; import sys; exit_code = random.choice([1, 1]); sys.exit(exit_code)"]

Every time argo submit test.yaml , the pod will all failed & completed, but the workflow status will be Running forever, this's a bug that in steps with retry.
The result here

Name:                steps-k65m7
Namespace:           default
ServiceAccount:      default
Status:              Running
Created:             Thu Dec 27 20:26:17 +0800 (19 seconds ago)
Started:             Thu Dec 27 20:26:17 +0800 (19 seconds ago)
Duration:            19 seconds

STEP                      PODNAME                 DURATION  MESSAGE
 ● steps-k65m7
 ├---✔ hello1             steps-k65m7-3995602950  4s
 └-·-✖ hello2a(0:1)                                         No more retries left
   | ├-✖ hello2a(0:1)(0)  steps-k65m7-3038008680  2s        failed with exit code 1
   | ├-✖ hello2a(0:1)(1)  steps-k65m7-1091657781  2s        failed with exit code 1
   | └-✖ hello2a(0:1)(2)  steps-k65m7-420405926   2s        failed with exit code 1
   ├-● hello2a(1:2)
   | ├-✖ hello2a(1:2)(0)  steps-k65m7-96146110    3s        failed with exit code 1
   | ├-✖ hello2a(1:2)(1)  steps-k65m7-565772347   3s        failed with exit code 1
   | └-● hello2a(1:2)(2)  steps-k65m7-2176865056  4s
   └-✖ hello2a(2:3)                                         No more retries left
     ├-✖ hello2a(2:3)(0)  steps-k65m7-3542037624  1s        failed with exit code 1
     ├-✖ hello2a(2:3)(1)  steps-k65m7-1595686725  4s        failed with exit code 1
     └-✖ hello2a(2:3)(2)  steps-k65m7-1998202486  4s        failed with exit code 1

You can see that hello2a(1:2) has no No more retries left, the workflow will Running forever.
In my fix, it will turn it to be right state. The new result will be like this

Name:                steps-k65m7
Namespace:           default
ServiceAccount:      default
Status:              Failed
Message:             child 'steps-k65m7-372465681' failed
Created:             Thu Dec 27 20:26:17 +0800 (19 seconds ago)
Started:             Thu Dec 27 20:26:17 +0800 (19 seconds ago)
Finished:            Thu Dec 27 20:26:36 +0800 (now)
Duration:            19 seconds

STEP                      PODNAME                 DURATION  MESSAGE
 ✖ steps-k65m7                                              child 'steps-k65m7-372465681' failed
 ├---✔ hello1             steps-k65m7-3995602950  4s
 └-·-✖ hello2a(0:1)                                         No more retries left
   | ├-✖ hello2a(0:1)(0)  steps-k65m7-3038008680  2s        failed with exit code 1
   | ├-✖ hello2a(0:1)(1)  steps-k65m7-1091657781  2s        failed with exit code 1
   | └-✖ hello2a(0:1)(2)  steps-k65m7-420405926   2s        failed with exit code 1
   ├-✖ hello2a(1:2)                                         No more retries left
   | ├-✖ hello2a(1:2)(0)  steps-k65m7-96146110    3s        failed with exit code 1
   | ├-✖ hello2a(1:2)(1)  steps-k65m7-565772347   3s        failed with exit code 1
   | └-✖ hello2a(1:2)(2)  steps-k65m7-2176865056  2s        failed with exit code 1
   └-✖ hello2a(2:3)                                         No more retries left
     ├-✖ hello2a(2:3)(0)  steps-k65m7-3542037624  1s        failed with exit code 1
     ├-✖ hello2a(2:3)(1)  steps-k65m7-1595686725  4s        failed with exit code 1
     └-✖ hello2a(2:3)(2)  steps-k65m7-1998202486  4s        failed with exit code 1

@jessesuen WDYT ?

@jessesuen
Copy link
Member

jessesuen commented Jan 3, 2019

@xianlubird it looks like this fix doesn't quite work and the bug may be simply intermittent or timing related:

The use of withItems does not seem to affect the behavior. For example, this workflow can also reproduce the problem, and fails even after applying your fix:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: steps-
spec:
  entrypoint: hello-hello-hello

  templates:
  - name: hello-hello-hello
    steps:
    - - name: hello1            #hello1 is run before the following steps
        template: whalesay
    - - name: hello2a           #double dash => run after previous step
        template: retry-container
      - name: hello2b           #double dash => run after previous step
        template: retry-container
      - name: hello2c           #double dash => run after previous step
        template: retry-container

  - name: whalesay
    container:
      image: python:alpine3.6
      command: ["python", -c]
      args: ["import time; time.sleep(3)"]

  - name: retry-container
    retryStrategy:
      limit: 2
    container:
      image: python:alpine3.6
      command: ["python", -c]
      args: ["import random; import sys; exit_code = random.choice([1, 1]); sys.exit(exit_code)"]

Which in one run, resulted in:

$ argo get steps-kffhp
Name:                steps-kffhp
Namespace:           default
ServiceAccount:      default
Status:              Running
Created:             Fri Dec 28 10:36:56 -0800 (5 days ago)
Started:             Wed Jan 02 16:38:57 -0800 (2 minutes ago)
Duration:            2 minutes 45 seconds

STEP                 PODNAME                 DURATION  MESSAGE
 ● steps-kffhp
 ├---✔ hello1        steps-kffhp-1029671323  5d
 └-·-✖ hello2a                                         No more retries left
   | ├-✖ hello2a(0)  steps-kffhp-2335944551  5d        failed with exit code 1
   | ├-✖ hello2a(1)  steps-kffhp-4013853546  5d        failed with exit code 1
   | └-✖ hello2a(2)  steps-kffhp-121696201   5d        failed with exit code 1
   ├-✖ hello2b                                         No more retries left
   | ├-✖ hello2b(0)  steps-kffhp-1732123524  5d        failed with exit code 1
   | ├-✖ hello2b(1)  steps-kffhp-54214529    5d        failed with exit code 1
   | └-✖ hello2b(2)  steps-kffhp-188288386   5d        failed with exit code 1
   └-● hello2c
     ├-✖ hello2c(0)  steps-kffhp-61636745    5d        failed with exit code 1
     ├-✖ hello2c(1)  steps-kffhp-3886977804  5d        failed with exit code 1
     └-✖ hello2c(2)  steps-kffhp-2275885095  5d        failed with exit code 1

@jessesuen
Copy link
Member

I pinpointed the issue. There is an optimization being made when executing a step group, which we shouldn't be doing:

https://github.com/argoproj/argo/blob/master/workflow/controller/steps.go#L206

The issue is that the break statement should not be happening and we should continue iterating the loop so that we can re-evaluate the retry node:

Here is a simplified workflow which can reliably reproduce the problem:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: failed-retries-
spec:
  entrypoint: failed-retries

  templates:
  - name: failed-retries
    steps:
    - - name: fail
        template: fail
      - name: delayed-fail
        template: delayed-fail

  - name: fail
    retryStrategy:
      limit: 1
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["exit 1"]

  - name: delayed-fail
    retryStrategy:
      limit: 1
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["sleep 1; exit 1"]

I created a PR which will fix it:
#1148

@jessesuen
Copy link
Member

Closing as fix was implemented in #1148

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants