fix: quick fail after pod termination #1865

whynowy · 2019-12-17T00:49:49Z

When a k8s node where the POD runs has an issue, the POD will go to "terminating" state - which is actually "running" phase but with "DeletionTimestamp", and stuck there. This fix quick fails it when this kind of situation is detected.

workflow/controller/operator.go

whynowy · 2019-12-18T00:28:53Z

Addressing @jessesuen's concern - what happens after setting activeDeadlineSeconds in the spec, it might be going through termination process.

activeDeadlineSeconds can be set in the spec in 2 places, workflow level or template level.
See following example:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: sleep-
spec:
  ##### activeDeadlineSeconds: 30
  entrypoint: sleep-n-sec
  arguments:
    parameters:
    - name: seconds
      value: 60
  templates:
  - name: sleep-n-sec
    #####   activeDeadlineSeconds: 30
    inputs:
      parameters:
      - name: seconds
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["echo sleeping for {{inputs.parameters.seconds}} seconds; sleep {{inputs.parameters.seconds}}; echo done"]

a.) If it's set in workflow level, when it comes to the deadline, the main container will be killed by the wait container through docker kill --signal TERM xxxxx, and then the pod goes to Error status (failed phase), it's not through termination process, thus DeletionTimestamp is still nil, the workflow goes to failed.

b.) If activeDeadlineSeconds is set on the template level, it will be translated to POD spec, when the time comes, the pod will be set to DeadlineExceeded status (still failed phase), not going through termination process as well, the workflow also goes to failed.

In summary, there's no impact for existing features with this change.

alexec · 2019-12-18T00:53:01Z

@jessesuen do you want to review this?

alexec · 2019-12-19T01:39:30Z

I've taken the liberty of syncing you with master so you have the new test infra.

Would you like to add an e2e test for this?

workflow/controller/operator.go

codecov · 2019-12-19T04:24:30Z

Codecov Report

❗ No coverage uploaded for pull request base (master@cd3bd23). Click here to learn what that means.
The diff coverage is 38.09%.

@@            Coverage Diff            @@
##             master    #1865   +/-   ##
=========================================
  Coverage          ?   11.14%           
=========================================
  Files             ?       35           
  Lines             ?    23536           
  Branches          ?        0           
=========================================
  Hits              ?     2624           
  Misses            ?    20576           
  Partials          ?      336

Impacted Files	Coverage Δ
workflow/controller/operator.go	`56.4% <38.09%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cd3bd23...11473f9. Read the comment docs.

When a k8s node where the POD runs has an issue, the POD will go to "terminating" state - which is actually "running" phase but with "DeletionTimestamp", and stuck there. This fix quick fails it when this kind of situation is detected.

whynowy · 2019-12-20T00:49:24Z

@alexec - e2e test added, could you please review it again?

alexec

LGTM

alexec · 2019-12-20T00:57:42Z

test/e2e/functional_test.go

+		Workflow("@expectedfailures/pod-termination-failure.yaml").
+		When().
+		SubmitWorkflow().
+		WaitForWorkflow(120 * time.Second).


minor - change to 60s

salanki · 2019-12-20T02:32:24Z

Thanks to everyone involved!

alexec reviewed Dec 17, 2019

View reviewed changes

workflow/controller/operator.go Show resolved Hide resolved

alexec self-assigned this Dec 19, 2019

alexec reviewed Dec 19, 2019

View reviewed changes

workflow/controller/operator.go Outdated Show resolved Hide resolved

whynowy added 4 commits December 19, 2019 16:15

fix: quick fail after pod termination

c3ef649

When a k8s node where the POD runs has an issue, the POD will go to "terminating" state - which is actually "running" phase but with "DeletionTimestamp", and stuck there. This fix quick fails it when this kind of situation is detected.

Add test cases

f3a2a50

Remove redundant log

d84a95c

use nodefailed instand of nodeerror, add e2e test

7228c8e

whynowy force-pushed the issue1832 branch from d824040 to 7228c8e Compare December 20, 2019 00:26

pre-pull bitnami/kubectl image

11473f9

alexec approved these changes Dec 20, 2019

View reviewed changes

alexec merged commit ce78227 into argoproj:master Dec 20, 2019

whynowy deleted the issue1832 branch December 20, 2019 05:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: quick fail after pod termination #1865

fix: quick fail after pod termination #1865

whynowy commented Dec 17, 2019

whynowy commented Dec 18, 2019

alexec commented Dec 18, 2019

alexec commented Dec 19, 2019

codecov bot commented Dec 19, 2019 •

edited

Loading

whynowy commented Dec 20, 2019

alexec left a comment

alexec Dec 20, 2019

salanki commented Dec 20, 2019

fix: quick fail after pod termination #1865

fix: quick fail after pod termination #1865

Conversation

whynowy commented Dec 17, 2019

whynowy commented Dec 18, 2019

alexec commented Dec 18, 2019

alexec commented Dec 19, 2019

codecov bot commented Dec 19, 2019 • edited Loading

Codecov Report

whynowy commented Dec 20, 2019

alexec left a comment

Choose a reason for hiding this comment

alexec Dec 20, 2019

Choose a reason for hiding this comment

salanki commented Dec 20, 2019

codecov bot commented Dec 19, 2019 •

edited

Loading