Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: quick fail after pod termination #1865

Merged
merged 5 commits into from
Dec 20, 2019
Merged

Conversation

whynowy
Copy link
Member

@whynowy whynowy commented Dec 17, 2019

Fixes: #1832

When a k8s node where the POD runs has an issue, the POD will go to "terminating" state - which is actually "running" phase but with "DeletionTimestamp", and stuck there. This fix quick fails it when this kind of situation is detected.

@whynowy
Copy link
Member Author

whynowy commented Dec 18, 2019

Addressing @jessesuen's concern - what happens after setting activeDeadlineSeconds in the spec, it might be going through termination process.

activeDeadlineSeconds can be set in the spec in 2 places, workflow level or template level.
See following example:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: sleep-
spec:
  ##### activeDeadlineSeconds: 30
  entrypoint: sleep-n-sec
  arguments:
    parameters:
    - name: seconds
      value: 60
  templates:
  - name: sleep-n-sec
    #####   activeDeadlineSeconds: 30
    inputs:
      parameters:
      - name: seconds
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["echo sleeping for {{inputs.parameters.seconds}} seconds; sleep {{inputs.parameters.seconds}}; echo done"]

a.) If it's set in workflow level, when it comes to the deadline, the main container will be killed by the wait container through docker kill --signal TERM xxxxx, and then the pod goes to Error status (failed phase), it's not through termination process, thus DeletionTimestamp is still nil, the workflow goes to failed.

b.) If activeDeadlineSeconds is set on the template level, it will be translated to POD spec, when the time comes, the pod will be set to DeadlineExceeded status (still failed phase), not going through termination process as well, the workflow also goes to failed.

In summary, there's no impact for existing features with this change.

@alexec
Copy link
Contributor

alexec commented Dec 18, 2019

@jessesuen do you want to review this?

@alexec alexec self-assigned this Dec 19, 2019
@alexec
Copy link
Contributor

alexec commented Dec 19, 2019

I've taken the liberty of syncing you with master so you have the new test infra.

Would you like to add an e2e test for this?

@codecov
Copy link

codecov bot commented Dec 19, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@cd3bd23). Click here to learn what that means.
The diff coverage is 38.09%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #1865   +/-   ##
=========================================
  Coverage          ?   11.14%           
=========================================
  Files             ?       35           
  Lines             ?    23536           
  Branches          ?        0           
=========================================
  Hits              ?     2624           
  Misses            ?    20576           
  Partials          ?      336
Impacted Files Coverage Δ
workflow/controller/operator.go 56.4% <38.09%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cd3bd23...11473f9. Read the comment docs.

When a k8s node where the POD runs has an issue, the POD will go
to "terminating" state - which is actually "running" phase but
with "DeletionTimestamp", and stuck there. This fix quick fails it
when this kind of situation is detected.
@whynowy
Copy link
Member Author

whynowy commented Dec 20, 2019

@alexec - e2e test added, could you please review it again?

Copy link
Contributor

@alexec alexec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Workflow("@expectedfailures/pod-termination-failure.yaml").
When().
SubmitWorkflow().
WaitForWorkflow(120 * time.Second).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor - change to 60s

@alexec alexec merged commit ce78227 into argoproj:master Dec 20, 2019
@salanki
Copy link
Contributor

salanki commented Dec 20, 2019

Thanks to everyone involved!

@whynowy whynowy deleted the issue1832 branch December 20, 2019 05:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Workflow steps get stuck in Running state on Kubernets Node Failure
3 participants