Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stuck hooks issue when a sync tasks contains a Job resource with a ttlSecondsAfterFinished field set #21055

Open
3 tasks done
dejanzele opened this issue Dec 4, 2024 · 0 comments · May be fixed by argoproj/gitops-engine#646 or #21113
Labels
bug Something isn't working component:hooks version:2.14 Latest confirmed affected version is 2.14

Comments

@dejanzele
Copy link

Checklist:

  • I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • I've included steps to reproduce the bug.
  • I've pasted the output of argocd version.

Describe the bug

It is pretty common to implement init logic in Kubernetes Job resources and use some kind of hooks to run the init logic before the main application.
An example in Helm would be to annotate the init Job with Helm hooks (i.e. "helm.sh/hook": pre-install and "helm.sh/hook-delete-policy": hook-succeeded,hook-failed) so we, for example, run migrations before the Deployment resource creates the actual application.

ArgoCD has a long-standing bug where if a Job has ttlSecondsAfterFinished set to 0 or a low value, the Job gets deleted before ArgoCD can mark the hook phase as completed, and it gets stuck in the hook phase and cannot progress further.

The infinite loop happens in this part of the code.

When the Job resource gets deleted by the Job controller because of expired TTL, the syncTask for the hook does not have a liveObject anymore, and it cannot call the getOperationPhase function here to get the updated status.

The bug happens in the gitops-engine rather than core ArgoCD.

This issue has been mentioned in a couple of places:

To Reproduce

Helm chart used for testing can be found here.
The chart has the following resources:

  • Deployment
  • Job with a ttlSecondsAfterFinished field and helm hook & delete policy.
  1. Install any version of ArgoCD or run it locally using make start-local
  2. Create the following Application
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: argocd-test
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/dejanzele/argocd-hook-test
    targetRevision: HEAD
    path: hooks
    helm:
      releaseName: argocd-test
      values: |
        job:
          sleepSeconds: 15
          exitCode: 0
          ttlSecondsAfterFinished: 0
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd-test
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
  1. Open the UI, then open the created Application and notice it is constantly syncing and the message is waiting for completion of hook batch/Job/hello-world-job

Expected behavior

PreSync hook completes successfully and the Sync progresses to Healthy.

Screenshots

image

Version

argocd commit 730363f
gitops-engine commit 0371401803996f84bcd70a5f6bb2f0ecc7d7b5d2

Logs

Paste any relevant application logs here.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working component:hooks version:2.14 Latest confirmed affected version is 2.14
Projects
None yet
2 participants