Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Portieris fails because it does not reload the new certificate rotated by cert-manager #463

Open
pre opened this issue Aug 2, 2024 · 1 comment

Comments

@pre
Copy link
Contributor

pre commented Aug 2, 2024

When cert-manager rotates the certificate, the new certificate is not loaded by Portieris.

As a result, Portieris keeps using the old certificate and eventually fails with "remote error: tls: bad certificate".

Portieris v0.13.12 is installed via Helm chart with UseCertManager: true in values.yaml.

Logs

  • Here is the log of three Portieris Pods running in the cluster.
  • Cert-manager rotated the certificate at 05:09:51.
  • Portieris started failing with "remote error: tls: bad certificate".

To debug the issue, I switched the mutation webhook to failurePolicy: Ignore and tried recreating
the Pods. The logs below are about that:

  1. When two Portieris replicas are recreated, they work. If one of them becomes the new leader,
    Portieris will successfully admit the image requests.
  2. By switching back to failurePolicy: Fail, and then terminating these two functional Pods, the old Pod will become the leader.
  3. Once the old Pod becomes the leader, it will again fail with "remote error: tls: bad certificate".

The only way to fix this issue has so far been to temporarily disable the admission webhook, and then recreate the Portieris Pods.

cert-manager

❯ kl -n cert-manager -l app=cert-manager
Defaulted container "cert-manager-controller" out of: cert-manager-controller, install-oneagent (init)
Defaulted container "cert-manager-controller" out of: cert-manager-controller, install-oneagent (init)
I0802 03:32:24.637692       1 reflector.go:351] Caches populated for *v1.ClusterIssuer from k8s.io/client-go@v0.29.0/tools/cache/reflector.go:229
I0802 03:32:24.757350       1 reflector.go:351] Caches populated for *v1.Challenge from k8s.io/client-go@v0.29.0/tools/cache/reflector.go:229
I0802 05:09:51.001795       1 trigger_controller.go:215] "Certificate must be re-issued" logger="cert-manager.certificates-trigger" key="portieris/portieris-certs" reason="Renewing" message="Renewing certificate as renewal was scheduled at 2024-08-02 05:09:51 +0000 UTC"
I0802 05:09:51.001822       1 conditions.go:203] Setting lastTransitionTime for Certificate "portieris-certs" condition "Issuing" to 2024-08-02 05:09:51.001816753 +0000 UTC m=+11379.283695354
I0802 05:09:51.437471       1 controller.go:162] "re-queuing item due to optimistic locking on resource" logger="cert-manager.certificates-key-manager" key="portieris/portieris-certs" error="Operation cannot be fulfilled on certificates.cert-manager.io \"portieris-certs\": the object has been modified; please apply your changes to the latest version and try again"
I0802 05:09:51.762935       1 conditions.go:263] Setting lastTransitionTime for CertificateRequest "portieris-certs-3" condition "Approved" to 2024-08-02 05:09:51.762923817 +0000 UTC m=+11380.044802426
I0802 05:09:52.007428       1 conditions.go:263] Setting lastTransitionTime for CertificateRequest "portieris-certs-3" condition "Ready" to 2024-08-02 05:09:52.007415836 +0000 UTC m=+11380.289294441
I0802 05:09:52.313919       1 controller.go:162] "re-queuing item due to optimistic locking on resource" logger="cert-manager.certificates-readiness" key="portieris/portieris-certs" error="Operation cannot be fulfilled on certificates.cert-manager.io \"portieris-certs\": the object has been modified; please apply your changes to the latest version and try again"
I0802 05:09:52.416001       1 controller.go:162] "re-queuing item due to optimistic locking on resource" logger="cert-manager.certificates-key-manager" key="portieris/portieris-certs" error="Operation cannot be fulfilled on certificates.cert-manager.io \"portieris-certs\": the object has been modified; please apply your changes to the latest version and try again"
I0802 05:09:52.430845       1 controller.go:162] "re-queuing item due to optimistic locking on resource" logger="cert-manager.certificates-readiness" key="portieris/portieris-certs" error="Operation cannot be fulfilled on certificates.cert-manager.io \"portieris-certs\": the object has been modified; please apply your changes to the latest version and try again"

portieris

❯ kg pod
kNAME                         READY   STATUS    RESTARTS   AGE
portieris-86cf58bdbb-8gh2l   1/1     Running   0          10h
portieris-86cf58bdbb-pnw46   1/1     Running   0          2d22h
portieris-86cf58bdbb-sjpqh   1/1     Running   0          10h
❯ kl portieris-86cf58bdbb-sjpqh
Defaulted container "portieris" out of: portieris, install-oneagent (init)
I0802 02:00:55.313589       1 main.go:66] Starting portieris v0.13.12
I0802 02:00:55.313915       1 kube.go:57] No --kubeconfig flag found and KUBECONFIG env variable is NOT set, defaulting to in-cluster kube client config
I0802 02:00:55.314302       1 main.go:76] CA not provided at /etc/certs/ca.pem, will use default system pool
I0802 02:00:55.328370       1 webhook.go:129] Starting policy Webhook on port 8000...
2024/08/02 02:01:50 http: TLS handshake error from 100.96.2.3:47582: read tcp 100.96.1.150:8000->100.96.2.3:47582: read: connection reset by peer
I0802 02:03:48.716874       1 controller.go:64] Processing admission request for CREATE on
I0802 02:03:52.929302       1 controller.go:64] Processing admission request for UPDATE on drain-nodes-28709400-jlkb2
I0802 02:04:27.797460       1 controller.go:64] Processing admission request for CREATE on
I0802 02:04:27.897402       1 controller.go:64] Processing admission request for CREATE on
2024/08/02 02:04:27 http: TLS handshake error from 100.96.2.3:34258: EOF
I0802 02:04:28.155166       1 controller.go:64] Processing admission request for CREATE on
I0802 02:09:43.652576       1 controller.go:64] Processing admission request for UPDATE on frontend-web-68f7685479
I0802 03:00:32.335708       1 controller.go:64] Processing admission request for CREATE on
I0802 04:03:14.259740       1 controller.go:64] Processing admission request for UPDATE on frontend-web
I0802 04:03:14.260296       1 controller.go:176] Getting policy for container image: ourregistry.example.com/our-frontend:git-559036cdfbf19e50f4fd0a6aa5d0ec792c51af70   namespace: frontend-pr-1310
E0802 04:03:14.464155       1 secret.go:68] Error: secrets "default-registry-credentials" not found
E0802 04:03:14.464820       1 controller.go:253] secrets "default-registry-credentials" not found
I0802 04:03:14.464837       1 controller.go:145] Allow for images:  [ourregistry.example.com/our-frontend:git-559036cdfbf19e50f4fd0a6aa5d0ec792c51af70]
I0802 04:12:18.072189       1 controller.go:64] Processing admission request for UPDATE on frontend-web
I0802 04:12:18.074693       1 controller.go:176] Getting policy for container image: ourregistry.example.com/our-frontend:git-b9b4fc632a100839c1f038bba85c859ff6441940   namespace: frontend
I0802 04:12:18.281484       1 controller.go:261] ImagePullSecret frontend/default-registry-credentials found
I0802 04:12:18.281618       1 controller.go:145] Allow for images:  [ourregistry.example.com/our-frontend:git-b9b4fc632a100839c1f038bba85c859ff6441940]
I0802 04:12:18.322135       1 controller.go:64] Processing admission request for CREATE on frontend-web-757968dc7f
I0802 04:12:18.828037       1 controller.go:64] Processing admission request for CREATE on
I0802 04:12:42.601730       1 controller.go:64] Processing admission request for UPDATE on frontend-web-6d6c57748f
2024/08/02 06:34:02 http: TLS handshake error from 100.96.2.3:40672: remote error: tls: bad certificate

❯ kl portieris-86cf58bdbb-8gh2l
Defaulted container "portieris" out of: portieris, install-oneagent (init)
I0802 02:00:17.457744       1 main.go:66] Starting portieris v0.13.12
I0802 02:00:17.458043       1 kube.go:57] No --kubeconfig flag found and KUBECONFIG env variable is NOT set, defaulting to in-cluster kube client config
I0802 02:00:17.458598       1 main.go:76] CA not provided at /etc/certs/ca.pem, will use default system pool
I0802 02:00:17.474531       1 webhook.go:129] Starting policy Webhook on port 8000...
2024/08/02 06:40:34 http: TLS handshake error from 100.96.2.3:32860: remote error: tls: bad certificate
2024/08/02 06:41:48 http: TLS handshake error from 100.96.2.3:55028: remote error: tls: bad certificate
[..]
2024/08/02 12:07:56 http: TLS handshake error from 100.96.2.3:49854: remote error: tls: bad certificate
2024/08/02 12:08:41 http: TLS handshake error from 100.96.2.3:53730: remote error: tls: bad certificate
2024/08/02 12:08:42 http: TLS handshake error from 100.96.2.3:53742: remote error: tls: bad certificate

[.. At 12:08:44 another Pod portieris-86cf58bdbb-q8n5z was recreated, this is the last failure,
    until request switched to just recreated portieris-86cf58bdbb-q8n5z which processed them successfully]

2024/08/02 12:09:51 http: TLS handshake error from 100.96.2.3:48708: remote error: tls: bad certificate

❯ kl portieris-86cf58bdbb-q8n5z
Defaulted container "portieris" out of: portieris, install-oneagent (init)
I0802 12:08:44.864114       1 main.go:66] Starting portieris v0.13.12
I0802 12:08:44.864384       1 kube.go:57] No --kubeconfig flag found and KUBECONFIG env variable is NOT set, defaulting to in-cluster kube client config
I0802 12:08:44.865084       1 main.go:76] CA not provided at /etc/certs/ca.pem, will use default system pool
I0802 12:08:44.879743       1 webhook.go:129] Starting policy Webhook on port 8000...
I0802 12:09:19.017647       1 controller.go:64] Processing admission request for UPDATE on portieris-86cf58bdbb-s2r5n

[.. request failed 5 seconds ago at old Pod portieris-86cf58bdbb-8gh2l but succeeds now]

I0802 12:09:56.402993       1 controller.go:64] Processing admission request for UPDATE on redis-master-79c8964f6c-nx4j4
I0802 12:09:56.555898       1 controller.go:64] Processing admission request for CREATE on


❯ kg pod
NAME                         READY   STATUS    RESTARTS   AGE
portieris-86cf58bdbb-8gh2l   1/1     Running   0          10h
portieris-86cf58bdbb-flnhz   1/1     Running   0          6m10s
portieris-86cf58bdbb-q8n5z   1/1     Running   0          6m10s

❯ k delete pod portieris-86cf58bdbb-q8n5z &
> k delete pod portieris-86cf58bdbb-flnhz &

Deleting the two recently created functional Pods causes new image admission requests go to
the old Pod portieris-86cf58bdbb-8gh2l.

The old Pod still fails with "remote error: tls: bad certificate".

Certificates

❯ kg certificate
NAME READY SECRET AGE
portieris-certs True portieris-certs 120d

❯ kg secret
NAME TYPE DATA AGE
portieris-certs kubernetes.io/tls 3 120d

Portieris' deployment has:

    volumeMounts:
    - mountPath: /etc/certs
      name: portieris-certs
      readOnly: true

Error

 failed calling webhook "trust.hooks.securityenforcement.admission.cloud.ibm.com": failed to call webhook: Post "https://portieris.portieris.svc:443/admit?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority
@pre
Copy link
Contributor Author

pre commented Oct 9, 2024

A workaround with Portieris:

  • Disable image signature verification in the portieris namespace (to allow Pods be recreated there even when signature verification fails after certificate rotation)
  • Recreate Portieris Pods with stakater/reloader when the certificate has been rotated

I feel bad about the complexity of having a combination of both stakater/reloader and portieris be operational in order to not lock down the cluster due to a bug In Portieris that doesn't seem to get fixed.

Possible alternatives for Portieris

  • Kyverno

    • CNCF graduated
    • generic policy engine
    • supports image admission as a beta feature
  • Connaisseur

    • company backed, not in CNCF
    • image admission only
  • sigstore/policy-controller

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant