Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CertManager becomes SyncError with ArgoCD v1.1.0-rc1 #1826

Closed
masa213f opened this issue Jun 27, 2019 · 17 comments
Closed

CertManager becomes SyncError with ArgoCD v1.1.0-rc1 #1826

masa213f opened this issue Jun 27, 2019 · 17 comments
Labels
bug Something isn't working

Comments

@masa213f
Copy link
Contributor

Describe the bug
I tried to deploy CertManager stable(v0.8.0) with ArgoCD v1.1.0-rc1.
But CertManager sometimes became SyncError and auto-sync had stopped.

When using Argo CD v1.0.1, this error didn't occur.
I'm afraid v1.1.0-rc1 is degraded...

Detail
In our tryout, CertManager's CRDs (Certificate and Issuer) sometimes become Degraded, and CertManager's task was judged as SyncError.

As far as I searched, the CRDs' status is False immediately after the resource is created.
Just at this time, if ArgoCD health check is executed, the task will be judged as SyncError by following steps.

  1. The resources are judged as Degraded by this Lua script. (This step is unchanged from v1.0.1)
  2. The operation is judged as failed by this code. (Probably spec change in v1.1.0-rc1?)

When doing declarative operations, it sometimes happens that resources are judged as Degraded.
So I hope auto-sync is not stopped in this situation.

Version
v1.1.0-rc1

Logs

When resources are judged as Degraded

time="2019-06-27T05:22:37Z" level=info msg="updating resource result, status: 'Synced' -> 'Synced', phase 'Running' -> 'Failed', message 'issuer.certmanager.k8s.io/cert-manager-webhook-ca created' -> 'Error getting keypair for CA issuer: secret \"cert-manager-webhook-ca\" not found'" application=external-dns kind=Issuer name=cert-manager-webhook-ca namespace=external-dns phase=Sync
time="2019-06-27T05:22:37Z" level=info msg="updating resource result, status: 'Synced' -> 'Synced', phase 'Running' -> 'Failed', message 'certificate.certmanager.k8s.io/cert-manager-webhook-webhook-tls created' -> 'Certificate does not exist'" application=external-dns kind=Certificate name=cert-manager-webhook-webhook-tls namespace=external-dns phase=Sync

When the task becomes SyncError

time="2019-06-27T05:23:39Z" level=info msg=tasks application=external-dns isSelectiveSync=false tasks="[Sync/0 resource /Namespace:external-dns/external-dns obj->obj (Synced,Succeeded,namespace/external-dns configured), Sync/0 resource /ServiceAccount:external-dns/cert-manager obj->obj (Synced,Succeeded,serviceaccount/cert-manager created), Sync/0 resource /ServiceAccount:external-dns/cert-manager-cainjector obj->obj (Synced,Succeeded,serviceaccount/cert-manager-cainjector created), Sync/0 resource /ServiceAccount:external-dns/cert-manager-webhook obj->obj (Synced,Succeeded,serviceaccount/cert-manager-webhook created), Sync/0 resource /ServiceAccount:external-dns/external-dns obj->obj (Synced,Succeeded,serviceaccount/external-dns created), Sync/0 resource apiextensions.k8s.io/CustomResourceDefinition:external-dns/certificates.certmanager.k8s.io obj->obj (Synced,Succeeded,customresourcedefinition.apiextensions.k8s.io/certificates.certmanager.k8s.io created), Sync/0 resource apiextensions.k8s.io/CustomResourceDefinition:external-dns/challenges.certmanager.k8s.io obj->obj (Synced,Succeeded,customresourcedefinition.apiextensions.k8s.io/challenges.certmanager.k8s.io created), Sync/0 resource apiextensions.k8s.io/CustomResourceDefinition:external-dns/clusterissuers.certmanager.k8s.io obj->obj (Synced,Succeeded,customresourcedefinition.apiextensions.k8s.io/clusterissuers.certmanager.k8s.io created), Sync/0 resource apiextensions.k8s.io/CustomResourceDefinition:external-dns/dnsendpoints.externaldns.k8s.io obj->obj (Synced,Succeeded,customresourcedefinition.apiextensions.k8s.io/dnsendpoints.externaldns.k8s.io created), Sync/0 resource apiextensions.k8s.io/CustomResourceDefinition:external-dns/issuers.certmanager.k8s.io obj->obj (Synced,Succeeded,customresourcedefinition.apiextensions.k8s.io/issuers.certmanager.k8s.io created), Sync/0 resource apiextensions.k8s.io/CustomResourceDefinition:external-dns/orders.certmanager.k8s.io obj->obj (Synced,Succeeded,customresourcedefinition.apiextensions.k8s.io/orders.certmanager.k8s.io created), Sync/0 resource rbac.authorization.k8s.io/ClusterRole:external-dns/cert-manager obj->obj (Synced,Succeeded,clusterrole.rbac.authorization.k8s.io/cert-manager reconciled. clusterrole.rbac.authorization.k8s.io/cert-manager configured), Sync/0 resource rbac.authorization.k8s.io/ClusterRole:external-dns/cert-manager-cainjector obj->obj (Synced,Succeeded,clusterrole.rbac.authorization.k8s.io/cert-manager-cainjector reconciled. clusterrole.rbac.authorization.k8s.io/cert-manager-cainjector configured), Sync/0 resource rbac.authorization.k8s.io/ClusterRole:external-dns/cert-manager-edit obj->obj (Synced,Succeeded,clusterrole.rbac.authorization.k8s.io/cert-manager-edit reconciled. clusterrole.rbac.authorization.k8s.io/cert-manager-edit configured), Sync/0 resource rbac.authorization.k8s.io/ClusterRole:external-dns/cert-manager-view obj->obj (Synced,Succeeded,clusterrole.rbac.authorization.k8s.io/cert-manager-view reconciled. clusterrole.rbac.authorization.k8s.io/cert-manager-view configured), Sync/0 resource rbac.authorization.k8s.io/ClusterRole:external-dns/cert-manager-webhook:webhook-requester obj->obj (Synced,Succeeded,clusterrole.rbac.authorization.k8s.io/cert-manager-webhook:webhook-requester reconciled. clusterrole.rbac.authorization.k8s.io/cert-manager-webhook:webhook-requester configured), Sync/0 resource rbac.authorization.k8s.io/ClusterRole:external-dns/external-dns obj->obj (Synced,Succeeded,clusterrole.rbac.authorization.k8s.io/external-dns created), Sync/0 resource rbac.authorization.k8s.io/ClusterRoleBinding:external-dns/cert-manager obj->obj (Synced,Succeeded,clusterrolebinding.rbac.authorization.k8s.io/cert-manager reconciled. clusterrolebinding.rbac.authorization.k8s.io/cert-manager configured), Sync/0 resource rbac.authorization.k8s.io/ClusterRoleBinding:external-dns/cert-manager-cainjector obj->obj (Synced,Succeeded,clusterrolebinding.rbac.authorization.k8s.io/cert-manager-cainjector reconciled. clusterrolebinding.rbac.authorization.k8s.io/cert-manager-cainjector configured), Sync/0 resource rbac.authorization.k8s.io/ClusterRoleBinding:external-dns/cert-manager-webhook:auth-delegator obj->obj (Synced,Succeeded,clusterrolebinding.rbac.authorization.k8s.io/cert-manager-webhook:auth-delegator created), Sync/0 resource rbac.authorization.k8s.io/ClusterRoleBinding:external-dns/external-dns-viewer obj->obj (Synced,Succeeded,clusterrolebinding.rbac.authorization.k8s.io/external-dns-viewer created), Sync/0 resource rbac.authorization.k8s.io/RoleBinding:kube-system/cert-manager-webhook:webhook-authentication-reader obj->obj (Synced,Succeeded,rolebinding.rbac.authorization.k8s.io/cert-manager-webhook:webhook-authentication-reader created), Sync/0 resource /Service:external-dns/cert-manager-webhook obj->obj (Synced,Succeeded,service/cert-manager-webhook created), Sync/0 resource /Service:external-dns/external-dns-metrics obj->obj (Synced,Succeeded,service/external-dns-metrics created), Sync/0 resource apps/Deployment:external-dns/cert-manager obj->obj (Synced,Succeeded,deployment.apps/cert-manager created), Sync/0 resource apps/Deployment:external-dns/cert-manager-cainjector obj->obj (Synced,Succeeded,deployment.apps/cert-manager-cainjector created), Sync/0 resource apps/Deployment:external-dns/cert-manager-webhook obj->obj (Synced,Running,deployment.apps/cert-manager-webhook created), Sync/0 resource apps/Deployment:external-dns/external-dns obj->obj (Synced,Succeeded,deployment.apps/external-dns created), Sync/0 resource apiregistration.k8s.io/APIService:external-dns/v1beta1.admission.certmanager.k8s.io obj->obj (Synced,Running,apiservice.apiregistration.k8s.io/v1beta1.admission.certmanager.k8s.io created), Sync/0 resource admissionregistration.k8s.io/ValidatingWebhookConfiguration:external-dns/cert-manager-webhook obj->obj (Synced,Succeeded,validatingwebhookconfiguration.admissionregistration.k8s.io/cert-manager-webhook created), Sync/0 resource certmanager.k8s.io/Certificate:external-dns/cert-manager-webhook-ca obj->obj (Synced,Succeeded,Certificate is up to date and has not expired), Sync/0 resource certmanager.k8s.io/Issuer:external-dns/cert-manager-webhook-ca obj->obj (Synced,Failed,Error getting keypair for CA issuer: secret \"cert-manager-webhook-ca\" not found), Sync/0 resource certmanager.k8s.io/Issuer:external-dns/cert-manager-webhook-selfsign obj->obj (Synced,Succeeded,issuer.certmanager.k8s.io/cert-manager-webhook-selfsign created), Sync/0 resource certmanager.k8s.io/Certificate:external-dns/cert-manager-webhook-webhook-tls obj->obj (Synced,Failed,Certificate does not exist), Sync/1 resource certmanager.k8s.io/ClusterIssuer:external-dns/clouddns nil->obj (,,)]"
time="2019-06-27T05:23:39Z" level=info msg="updating resource result, status: 'Synced' -> 'Synced', phase 'Running' -> 'Succeeded', message 'deployment.apps/cert-manager-webhook created' -> 'deployment.apps/cert-manager-webhook created'" application=external-dns kind=Deployment name=cert-manager-webhook namespace=external-dns phase=Sync
time="2019-06-27T05:23:39Z" level=info msg="updating resource result, status: 'Synced' -> 'Synced', phase 'Running' -> 'Succeeded', message 'apiservice.apiregistration.k8s.io/v1beta1.admission.certmanager.k8s.io created' -> 'all checks passed'" application=external-dns kind=APIService name=v1beta1.admission.certmanager.k8s.io namespace=external-dns phase=Sync
time="2019-06-27T05:23:39Z" level=info msg="Updating operation state. phase: Running -> Failed, message: 'one or more tasks are running' -> 'one or more synchronization tasks completed unsuccessfully'" application=external-dns
time="2019-06-27T05:23:39Z" level=info msg="sync/terminate complete" application=external-dns
time="2019-06-27T05:23:39Z" level=info msg="Sync operation to 62802b64bf4a3df19bce40e4f44354a39655b5b1 failed: one or more synchronization tasks completed unsuccessfully" application=external-dns reason=OperationCompleted type=Warning
@masa213f masa213f added the bug Something isn't working label Jun 27, 2019
@jessesuen
Copy link
Member

@alexec - I think the health assessment logic is regression from previous behavior. We really should not be assessing health unless we are either:

  1. using sync waves and depend on previous wave
  2. using sync hooks and depend on previous hook to complete

@alexec
Copy link
Contributor

alexec commented Jun 27, 2019

@jessesuen I'm not sure about this. I think what's happening is that the certs become degraded before they become healthy. This can happen in both normal and wave/hook syncs. This would mean that you could not use these in either of those styles at all. I think that's a bug, but a different bug. Let me ponder this.

@alexec alexec self-assigned this Jun 27, 2019
@alexec alexec added this to the v1.1 milestone Jun 27, 2019
@jessesuen
Copy link
Member

jessesuen commented Jun 27, 2019

If I apply a single resource (no waves, no hooks), which is a Certificate, as long as the kubectl apply returned zero exit code, then the sync should be deemed successful regardless if the Certificate is degraded.

Health should only come into play when there are dependencies.

@alexec
Copy link
Contributor

alexec commented Jun 27, 2019

I think we should have a point fix for this in v1.1, but I'd like to address the issue of wave-based syncs that flip into degraded before healthy.

@alexec
Copy link
Contributor

alexec commented Jun 27, 2019

@ishii-masayuki just to check, do you have any hooks in your app?

@masa213f
Copy link
Contributor Author

@alexec @jessesuen
Thank you for the discussion!
Now, we use only waves. We don't use any hooks.

This is our manifest files for CertManager. We sync some resources with no waves, and ClusterIssuer is synced as "wave 1". Because there is a clear dependency to use validation webhook.

We tried a lot, and now we do a workaround like this. Overriding the default Lua scripts to vail the Degraded condition for our resources.

@masa213f
Copy link
Contributor Author

In addition, we also need a custom script for APIService in our app.

APIService is always healthy by default(no health check).
So ArgoCD might progress to the next wave at a slight timing from the webhook's deployment becomes healthy to the APIService's state becomes true. In this case, applying ClusterIssuer will fail.

@jessesuen
Copy link
Member

In addition, we also need a custom script for APIService in our app.

Nice. We should make the health check for APIService a built in (native golang) one so everyone will benefit from this.

@alexec
Copy link
Contributor

alexec commented Jun 28, 2019

I've created a ticket to make it built-in.

@alexec
Copy link
Contributor

alexec commented Jun 28, 2019

This won't be fixed by the related PR.

@alexec
Copy link
Contributor

alexec commented Jun 28, 2019

@ishii-masayuki - you have a workaround. So you don't need a fix anymore?

@alexec alexec removed this from the v1.1 milestone Jun 28, 2019
@masa213f
Copy link
Contributor Author

masa213f commented Jul 1, 2019

@alexec
Thanks.
I think that APIService's health check benefits everyone who uses APIService, including cert-manager.
So I want the fix to be built in.

When I have time, I will make it. Please wait a moment.

@alexec
Copy link
Contributor

alexec commented Jul 1, 2019

thank you @ishii-masayuki - that'd be fantastic!

@alexec alexec added the other label Jul 2, 2019
@ozankabak
Copy link

ozankabak commented Jul 8, 2019

What's the status on this? What is the recommended work-around until the fix is available? Is it simply adding the APIService health check to argocd-cm?

@masa213f
Copy link
Contributor Author

masa213f commented Jul 8, 2019

I'm sorry. I've been a little busy, and I have free time this week.
If this issue is urgent, please fix ...

@alexec
Copy link
Contributor

alexec commented Jul 8, 2019

What's the status on this? What is the recommended work-around until the fix is available? Is it simply adding the APIService health check to argocd-cm?

I believe so.

@alexec alexec removed their assignment Jul 8, 2019
@jessesuen
Copy link
Member

Fixed in #1921

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants