You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
I have searched existing issues and could not find a match for this bug
This seems related to #13373, but I am reporting it since it seems to explicitly have a problem when interacting with PodDisruptionBudgets and single concurrency run workflows (effectively making them silent killers).
When applying Pod Disruption Budget to a CronWorkflow, occasionally, the workflow will silently continue to run even after the workflow failed due to pod deletion via some means outside of Pod Disruption Budget context (i.e. EKS swaps a node, etc.).
When it happens, some of the task results are left over as well as the pod disruption budget, and the argo cron job does not resolve itself. Since these workflows have concurrencyPolicy: Forbid, this means that we can't run any jobs again.
In logs, this looks like:
Created PDB resource for workflow.
With no subsequent Deleted PDB resource for workflow. message after.
When reviewing the cluster controller logs doe 3.5.7, there is a failure for these missing delete cases of the following:
Error syncing PodDisruptionBudget <worfklow pdb>, requeuing: Operation cannot be fulfilled on poddisruptionbudgets.policy "<worfklow pdb>": the object has been modified; please apply your changes to the latest version and try again
However, when upgraded to 3.5.8, these logs did not show up in the controller. There are some "could not find node" messages, but I don't think that's the issue since the logs are both before and after the deletion time and they line up with a reported issue around noisy logging for tasks.
woc.log.WithField("err", err).Error("Unable to delete PDB resource for workflow.")
returnerr
}
woc.log.Info("Deleted PDB resource for workflow.")
returnnil
The deletePDB should have arrived at a message of "Unable to delete PDB resource for workflow" and by the calling logic, should have set the workflow phase to "Error". Instead, the workflow stays in "Running" and there is no log.
I have looked as best as I can through the source and logs, but I can't find anything glaring that would indicate a failure except that maybe there was a crash (which the Backoff package handles and then re-crashes). The controller never restarted and all other jobs seem to have good functionality through this issue though (and I would've assume there would be a crash log - but maybe that goes to a different location).
Please note that we do not have this problem if we do not have pod disruption budgets on the workflow. There is no problem with the workflow cleaning itself up and then restarting
Version(s)
v3.5.7, v3.5.8
Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
# IMPORTANT - this only happens in our prod EKS cluster, so I don't have a minimal workflow for reproduction. I have not been able to get it to happen in smaller scale envs.apiVersion: argoproj.io/v1alpha1kind: CronWorkflowspec:
workflowSpec:
templates:
- name: long-taskinputs: {}outputs: {}metadata: {}steps:
- - name: long-task-to-doarguments: {}templateRef:
name: base-workflowtemplate: long-task-to-do
- name: queue-bot-tasks-exit-handlerinputs: {}outputs: {}metadata: {}steps:
- - name: queue-bot-tasks-exit-handlerarguments: {}templateRef:
name: base-workflowtemplate: exit-handlerentrypoint: long-taskarguments: {}onExit: exit-handler# This is a very long run processactiveDeadlineSeconds: 21600podDisruptionBudget:
minAvailable: 9999schedule: '*/15 * * * *'concurrencyPolicy: ForbidstartingDeadlineSeconds: 0
Pre-requisites
:latest
image tag (i.e.quay.io/argoproj/workflow-controller:latest
) and can confirm the issue still exists on:latest
. If not, I have explained why, in detail, in my description below.What happened? What did you expect to happen?
This seems related to #13373, but I am reporting it since it seems to explicitly have a problem when interacting with PodDisruptionBudgets and single concurrency run workflows (effectively making them silent killers).
When applying Pod Disruption Budget to a CronWorkflow, occasionally, the workflow will silently continue to run even after the workflow failed due to pod deletion via some means outside of Pod Disruption Budget context (i.e. EKS swaps a node, etc.).
When it happens, some of the task results are left over as well as the pod disruption budget, and the argo cron job does not resolve itself. Since these workflows have
concurrencyPolicy: Forbid
, this means that we can't run any jobs again.In logs, this looks like:
With no subsequent
Deleted PDB resource for workflow.
message after.When reviewing the cluster controller logs doe 3.5.7, there is a failure for these missing delete cases of the following:
However, when upgraded to 3.5.8, these logs did not show up in the controller. There are some "could not find node" messages, but I don't think that's the issue since the logs are both before and after the deletion time and they line up with a reported issue around noisy logging for tasks.
I looked over the code here:
argo-workflows/workflow/controller/operator.go
Lines 3791 to 3807 in 52cca7e
The deletePDB should have arrived at a message of "Unable to delete PDB resource for workflow" and by the calling logic, should have set the workflow phase to "Error". Instead, the workflow stays in "Running" and there is no log.
I have looked as best as I can through the source and logs, but I can't find anything glaring that would indicate a failure except that maybe there was a crash (which the Backoff package handles and then re-crashes). The controller never restarted and all other jobs seem to have good functionality through this issue though (and I would've assume there would be a crash log - but maybe that goes to a different location).
Please note that we do not have this problem if we do not have pod disruption budgets on the workflow. There is no problem with the workflow cleaning itself up and then restarting
Version(s)
v3.5.7, v3.5.8
Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: