Zombie workflows (aka "stuck workflows") #4560

alexec · 2020-11-19T17:13:39Z

A "zombie workflow" is one that starts but does not complete. Pods are scheduled and run to completion, but the workflow is not subsequently updated.

It is as if the workflow controller never sees the pod changes.

Impacted users:

All users have been running very large workflows.

Typically:

Zombie workflows are running 5000+ pods at once.
insignificant pod change is not seen in the controller logs.
Deadline exceeded is seen in the logs. Increasing the CPU and memory on the Kubernetes master node may fix this.

Things that don't appear to work (rejected hypothesis):

Changing --burst or --qps, QPS settings.
Changing --workflow-workflow or --pod-workers settings. This only impacts concurrent processing.
Increasing the controller's memory or CPU.
Setting ALL_POD_CHANGES_SIGNIFICANT=true. Hypothesis: we're missing significant pod changes.
Setting INFORMER_WRITE_BACK=false.

Questions:

Every 20m workflow should re-reconcile. Did waiting fix it?
Did restarting the controller fix it?
Is that pods didn't start? Or that we don't see their completion?
Does the zombie workflow have the 'workflows.argoproj.io/completed: true` label?

Users should try the following:

Run argoproj/workflow-controller:v2.11.7 - this is faster than v2.11.6 and all previous versions. Suitable for production.
Set GC settings: https://argoproj.github.io/argo/cost-optimisation/#limit-the-total-number-of-workflows-and-pods
Delete any old completed workflows.
Run argoproj/workflow-controller:latest.
Run argoproj/workflow-controller:mot with env MAX_OPERATION_TIME=30s. Make sure it logs defaultRequeueTime=30s maxOperationTime=30s. Hypothesis: we need more time to schedule pods.
Run argoproj/workflow-controller:easyjson. Hypothesis: JSON marshaling is very slow.

If none of this works then we need to investigate deeper.

ebr · 2020-11-19T21:43:44Z

Testing this on :latest.

a workflow is started where withItems is used on a step template, expecting to spawn 8172 nodes with children, for a total of ~40K nodes in the workflow
iteration begins and pods are started, but pauses at item 1256 with the same symptom as originally described (parent node shows "running", one child node has completed, but no further child nodes are being created)[1]
15 minutes later, some of the first iterations have completed, and item 1269 us being processed

further checks (about every 10 minutes) show that node count is going up slowly, and pauses for several minutes with the same symptom [1]

eventually the work stalls around item 1993, ~~with no new pods being created for this workflow, even though the cluster has ample capacity~~. Update: it does progress further, but very very slowly. at unit 2044 right now, 1 hour after submission. Lots of stalled nodes, like in [1], at various stages. I expect it to fully stall eventually.

Deadline exceeded messages are still being logged

[1]

  │ └───✔ get-work-unit         
  ├─● run-pipeline(1988:1988)   
  │ └───✔ get-work-unit         
  ├─● run-pipeline(1989:1989)   
  │ └───✔ get-work-unit         
  ├─● run-pipeline(1990:1990)   
  │ └───✔ get-work-unit         
  ├─● run-pipeline(1991:1991)   
  │ └───✔ get-work-unit         
  ├─● run-pipeline(1992:1992)   
  │ └───✔ get-work-unit         
  └─● run-pipeline(1993:1993)   
    └───✔ get-work-unit

alexec · 2020-11-19T22:28:00Z

Testing this on :latest.

Thank you. Can you try this?

~~Run argoproj/workflow-controller:mot.with env MAX_OPERATION_TIME=30s. Make sure it logs defaultRequeueTime=30s maxOperationTime=30s. Hypothesis: we need more time to schedule pods.~~

Firstly, can you try INFORMER_WRITE_BACK=false?

ebr · 2020-11-19T22:51:35Z

I did have INFORMER_WRITE_BACK=false both set and not set, as I'm also testing #4565 in parallel :). It didn't make a difference.

Seeing same behaviour on :no-sig. I can try :mot now as you describe above.

By the way, i do have a couple of "zombie" workflows sitting here since yesterday, in hopes that one of these tests brings them back to life (i don't care about them otherwise, but they can be useful in this way). So far they haven't budged.

ebr · 2020-11-20T16:32:29Z

Here's another perhaps relevant point: a workflow that is both: 1) stuck, AND 2) has a pod that's running, but is "frozen" (for a reason unrelated to Argo, e.g. the Python process in the container locks up), I can get the workflow "unstuck" temporarily, to some degree, by forcing the pod to fail (exec -it, kill python). The step has a retryPolicy. It is retried, and at the same time a few more of those "stuck" branches get started up. But definitely not all of them. (UPD: the pods have all completed while i was typing this, now there are no frozen pods, but the workflows still have many "stuck" branches).

I confirmed this hack to work on 4 of my stuck workflows. All 4 of them had exactly one frozen pod (likely a coincidence). If a stuck workflow doesn't have any frozen pods, I have no way to manually kick it this way. To be sure, I am not suggesting that frozen containers are causing the workflows to get stuck, this is merely a "lucky" workaround.

@alexec I'm currently on your :grace engineering build as per #4565 testing - that is probably orthogonal, but just mentioning for completeness.

I'm going to try :mot now.

ebr · 2020-11-20T17:16:13Z

on :mot, with MAX_OPERATION_TIME=30s, confirmed in logs: defaultRequeueTime=30s maxOperationTime=30s

everything just lit up like a christmas tree! 😻

all "stuck" steps started up. (even from a 2-day-old workflow)

@alexec not to jinx it, but i think we have a winner!

alexec · 2020-11-20T18:56:13Z

Can I play this back:

:mot with 30s fixed your problem.
:grace did not fix the problem

ebr · 2020-11-20T19:20:52Z

Correct!

tomgoren · 2020-11-20T19:48:34Z

We were able to replicate the findings here - can verify that the CPU usage is way down with hundreds of concurrent workflows running.

alexec · 2020-11-20T20:03:26Z

@tomgoren can you confirm which engineering build you ran please?

tomgoren · 2020-11-20T20:57:24Z

@alexec:

workflow-controller: mot+4998b2d.dirty
  BuildDate: 2020-11-19T17:56:18Z
  GitCommit: 4998b2d6574adfe039b9c037251ecc717e7f1996
  GitTreeState: dirty
  GitTag: latest
  GoVersion: go1.13.15
  Compiler: gc
  Platform: linux/amd64
time="2020-11-20T21:16:29Z" level=info defaultRequeueTime=30s maxOperationTime=30s

alexec · 2020-11-23T20:41:22Z

I've created a POC engineering build that uses S3 to offload and archive workflows to instead of MySQL or Postgres. My hypothesis is that offloading there maybe faster for users running large (5009+ node) workflows, or that archiving maybe more to many users. On top of this, it maybe cheaper for many users. I challenge you to prove me wrong. #4582

ebr · 2020-11-30T20:02:59Z

Offloading to S3 would be very interesting and welcome. Will try it when I can! Over the past weekend we ended up running workflows of enormous size, on the :latest build, and highest node count we were able to get to, even with MAX_OPERATION_TIME=600s, was about 50,000 nodes. It's probably unreasonable to expect Argo to handle more at this point, but if anyone was curious about testing the limits, here you go :) (we ended up having to split our workflows into much smaller chunks, for sanity and observability's sake).

alexec · 2020-11-30T22:14:07Z

50,000 nodes.

Excellent. Can you share a screenshot?

ebr · 2020-12-16T01:36:12Z

We don't have those workflows in the cluster anymore, and I didn't take a screenshot! I'll try to run one of the larger workflows again over the holidays to take the screenshot

alexec · 2020-12-16T23:21:29Z

I think this is now fixe.d

alexec added type/bug epic/reliability labels Nov 19, 2020

ebr mentioned this issue Nov 19, 2020

Workflow steps fail with a 'pod deleted' message. #4565

Closed

alexec mentioned this issue Nov 23, 2020

feat(controller): Use S3 for node offload + workflow archiving. POC #4582

Closed

1 task

alexec mentioned this issue Dec 2, 2020

Difficulty scaling when running many workflows and/or steps #4634

Closed

alexec closed this as completed Dec 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zombie workflows (aka "stuck workflows") #4560

Zombie workflows (aka "stuck workflows") #4560

alexec commented Nov 19, 2020 •

edited

Loading

ebr commented Nov 19, 2020 •

edited

Loading

alexec commented Nov 19, 2020 •

edited

Loading

ebr commented Nov 19, 2020 •

edited

Loading

ebr commented Nov 20, 2020 •

edited

Loading

ebr commented Nov 20, 2020 •

edited

Loading

alexec commented Nov 20, 2020 •

edited

Loading

ebr commented Nov 20, 2020

tomgoren commented Nov 20, 2020

alexec commented Nov 20, 2020

tomgoren commented Nov 20, 2020 •

edited

Loading

alexec commented Nov 23, 2020

ebr commented Nov 30, 2020

alexec commented Nov 30, 2020

ebr commented Dec 16, 2020

alexec commented Dec 16, 2020

Zombie workflows (aka "stuck workflows") #4560

Zombie workflows (aka "stuck workflows") #4560

Comments

alexec commented Nov 19, 2020 • edited Loading

ebr commented Nov 19, 2020 • edited Loading

alexec commented Nov 19, 2020 • edited Loading

ebr commented Nov 19, 2020 • edited Loading

ebr commented Nov 20, 2020 • edited Loading

ebr commented Nov 20, 2020 • edited Loading

alexec commented Nov 20, 2020 • edited Loading

ebr commented Nov 20, 2020

tomgoren commented Nov 20, 2020

alexec commented Nov 20, 2020

tomgoren commented Nov 20, 2020 • edited Loading

alexec commented Nov 23, 2020

ebr commented Nov 30, 2020

alexec commented Nov 30, 2020

ebr commented Dec 16, 2020

alexec commented Dec 16, 2020

alexec commented Nov 19, 2020 •

edited

Loading

ebr commented Nov 19, 2020 •

edited

Loading

alexec commented Nov 19, 2020 •

edited

Loading

ebr commented Nov 19, 2020 •

edited

Loading

ebr commented Nov 20, 2020 •

edited

Loading

ebr commented Nov 20, 2020 •

edited

Loading

alexec commented Nov 20, 2020 •

edited

Loading

tomgoren commented Nov 20, 2020 •

edited

Loading