Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TLS Handshake error with vault agent injector #275

Closed
rishabh-arya95 opened this issue Aug 3, 2021 · 12 comments · Fixed by #350
Closed

TLS Handshake error with vault agent injector #275

rishabh-arya95 opened this issue Aug 3, 2021 · 12 comments · Fixed by #350
Labels
bug Something isn't working

Comments

@rishabh-arya95
Copy link

rishabh-arya95 commented Aug 3, 2021

I am running the vault agent injector with auto tls enabled and configured an external vault server that is running on my host.

helm install vault hashicorp/vault \
    --set "injector.externalVaultAddr=http://${HOST_PRIVATE_IP}:8200"

Everything was working fine, suddenly after 24 hours, I am getting this bad certificate issue.

I have even tried using vault.hashicorp.com/tls-skip-verify annotation but the result is the same.
These are the agent injector logs.

kubectl logs -f vault-agent-injector-688d969fd6-fnxg5 -n vault
2021-08-02T13:21:51.952Z [INFO]  handler: Starting handler..
2021-08-02T13:21:51.961Z [INFO]  handler.auto-tls: Generated CA
Listening on ":8080"...
2021-08-02T13:21:51.968Z [INFO]  handler.certwatcher: Updated certificate bundle received. Updating certs...
2021-08-02T14:24:33.851Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=10s
2021-08-02T14:29:55.076Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=10s
2021-08-02T14:39:55.056Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=10s
2021-08-02T14:40:23.280Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=10s
2021-08-02T14:43:21.314Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=10s
2021-08-02T14:43:46.477Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=10s
2021-08-02T14:57:04.122Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=10s
2021-08-02T14:57:27.107Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=10s
2021-08-02T15:01:42.361Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=10s
2021-08-02T15:02:09.823Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=10s
2021-08-02T15:11:48.065Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=10s
2021-08-02T15:21:37.688Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=10s
2021-08-02T15:35:41.022Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=10s
2021-08-02T15:43:42.305Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=10s
2021-08-02T15:44:04.952Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=10s
2021-08-02T15:45:44.332Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=10s
2021-08-02T15:50:16.360Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=10s
2021-08-02T15:51:39.813Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=10s
2021-08-02T15:55:00.744Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=10s
2021-08-02T15:55:18.417Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=10s
2021-08-02T16:03:29.690Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=10s
2021-08-02T16:04:04.751Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=10s
2021-08-02T16:04:25.749Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=10s
2021-08-02T16:08:38.321Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=10s
2021-08-02T16:12:49.566Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=10s
2021-08-02T16:19:58.982Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=10s
2021-08-02T16:20:17.434Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=10s
2021-08-02T16:21:46.631Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=10s
2021-08-03T07:49:19.399Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=10s
2021-08-03T07:52:39.347Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=10s
2021-08-03T10:49:37.464Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=10s
2021-08-03T10:50:45.909Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=10s
2021-08-03T11:04:06.630Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=10s
2021-08-03T11:05:08.257Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=10s
2021-08-03T11:11:42.147Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=10s
2021-08-03T11:12:00.599Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=10s
2021-08-03T11:18:34.290Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=10s
2021-08-03T11:20:16.163Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=10s
2021-08-03T11:23:51.858Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=10s
2021-08-03T13:01:44.956Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=10s
2021-08-03T13:36:24.566Z [ERROR] handler: http: TLS handshake error from 172.17.0.1:17971: remote error: tls: bad certificate
2021-08-03T13:51:32.001Z [ERROR] handler: http: TLS handshake error from 172.17.0.1:20802: remote error: tls: bad certificate
2021-08-03T13:56:14.122Z [ERROR] handler: http: TLS handshake error from 172.17.0.1:36180: remote error: tls: bad certificate
2021-08-03T13:57:10.726Z [ERROR] handler: http: TLS handshake error from 172.17.0.1:4425: remote error: tls: bad certificate
2021-08-03T14:01:01.632Z [ERROR] handler: http: TLS handshake error from 172.17.0.1:10077: remote error: tls: bad certificate
2021-08-03T14:01:26.954Z [ERROR] handler: http: TLS handshake error from 172.17.0.1:50072: remote error: tls: bad certificate
2021-08-03T14:01:54.899Z [ERROR] handler: http: TLS handshake error from 172.17.0.1:17536: remote error: tls: bad certificate
2021-08-03T14:12:29.850Z [ERROR] handler: http: TLS handshake error from 172.17.0.1:12749: remote error: tls: bad certificate
2021-08-03T14:12:52.509Z [ERROR] handler: http: TLS handshake error from 172.17.0.1:1626: remote error: tls: bad certificate
@rishabh-arya95 rishabh-arya95 added the bug Something isn't working label Aug 3, 2021
@RakeshRaj97
Copy link

Did you get this sorted?

@birelian
Copy link

Hi,

Any update on this? We confirm it's happening with 0.16.0 deployed in AKS 1.21.

Thanks!

@kevinlmadison
Copy link

Having the same issue, vault 1.10 k8s 1.22.6 (RKE2).

@swenson
Copy link

swenson commented May 20, 2022

The auto TLS certificate regenerates every 24 hours, which sounds like it probably related to the problem.

I'm having trouble reproducing this, even on the helm chart version 0.16.0.

Are there any other steps you use to see this issue?

What I am doing:

helm install vault vault --repo https://helm.releases.hashicorp.com \
  --version=0.16.0 \
  --set server.enabled=false \
  --set injector.enabled=true \
  --set "injector.externalVaultAddr=http://192.168.65.2:8200"
  • Continually deploy and delete a pod that injects a secret

But I never see the failure mentioned after the certificate is refreshed.

@xanmanning
Copy link

We are seeing the same on the clusters we've upgraded, however it seems less frequent on the two clusters where we have a 1 minute cronjob continuously deploying and deleting a pod that injects a secret.

Running on GKE 1.21

Deployment is effectively:

helm install vault vault --repo https://helm.releases.hashicorp.com \
  --set server.enabled=false \
  --set injector.enabled=true \
  --set injector.image.tag=0.16.0 \
  --set "injector.externalVaultAddr=https://SOME_VAULT_ADDRESS:8200"

I'm going to perform some tests on a K3D cluster to see if there's a pattern.

@birelian
Copy link

The auto TLS certificate regenerates every 24 hours, which sounds like it probably related to the problem.

I'm having trouble reproducing this, even on the helm chart version 0.16.0.

Are there any other steps you use to see this issue?

What I am doing:

* Vault running on my machine

* Setup Kubernetes auth method as per https://learn.hashicorp.com/tutorials/vault/kubernetes-external-vault?in=vault/kubernetes#install-the-vault-helm-chart-configured-to-address-an-external-vault

* Get the injector running with
helm install vault vault --repo https://helm.releases.hashicorp.com \
  --version=0.16.0 \
  --set server.enabled=false \
  --set injector.enabled=true \
  --set "injector.externalVaultAddr=http://192.168.65.2:8200"
* Continually deploy and delete a pod that injects a secret

But I never see the failure mentioned after the certificate is refreshed.

Sorry @swenson for not being precise. When I said 0.16.0 I was talking about the vault-k8s version. The configuration that failed was Helm Chart vault-0.18.0 with Vault 1.9.6 and injector 0.16.0. This combination was failing in two different AKS clusters running K8s 1.21.

After we downgraded injector to 0.15.0, the error seems to be gone in the both clusters (at least, during the last week).

Thanks!

@eddiehoffman
Copy link

The auto TLS certificate regenerates every 24 hours, which sounds like it probably related to the problem.
I'm having trouble reproducing this, even on the helm chart version 0.16.0.
Are there any other steps you use to see this issue?
What I am doing:

* Vault running on my machine

* Setup Kubernetes auth method as per https://learn.hashicorp.com/tutorials/vault/kubernetes-external-vault?in=vault/kubernetes#install-the-vault-helm-chart-configured-to-address-an-external-vault

* Get the injector running with
helm install vault vault --repo https://helm.releases.hashicorp.com \
  --version=0.16.0 \
  --set server.enabled=false \
  --set injector.enabled=true \
  --set "injector.externalVaultAddr=http://192.168.65.2:8200"
* Continually deploy and delete a pod that injects a secret

But I never see the failure mentioned after the certificate is refreshed.

Sorry @swenson for not being precise. When I said 0.16.0 I was talking about the vault-k8s version. The configuration that failed was Helm Chart vault-0.18.0 with Vault 1.9.6 and injector 0.16.0. This combination was failing in two different AKS clusters running K8s 1.21.

After we downgraded injector to 0.15.0, the error seems to be gone in the both clusters (at least, during the last week).

Thanks!

We have the same issue and have had to revert back to 0.15.0.

@xanmanning
Copy link

Managed to re-create this using k3d cluster locally.

I created a cluster and deployed Vault+Vault-Agent-Injector

Set up a cronjob pulling Vault secrets running every minute for over 24 hours, no issue. I stopped the cronjob, and noted the time that the certificate was last updated (15:58 UTC on the 22nd) - waited until ~16:57 UTC on the 23rd (about 5 minutes ago) and ran a job from my cronjob.

Screenshot from 2022-05-23 17-56-46

Screenshot from 2022-05-23 18-02-36

swenson pushed a commit that referenced this issue May 23, 2022
Once the `time.NewTimer()` expires, calls to `timer.Stop()` will return
`false`, but the channel will have nothing in it, causing `<-timer.C` to
hang forever.

This is hinted at by the docs, even though they suggest `timer.Stop()`
should return true in that case.

We change to a non-blocking drain so that we won't block forever.

This manifests in never updating the certificate after it expires,
causing TLS handshake errors.

Fixes #275
swenson pushed a commit that referenced this issue May 23, 2022
Once the `time.NewTimer()` expires, calls to `timer.Stop()` will return
`false`, but the channel will have nothing in it, causing `<-timer.C` to
hang forever.

This is hinted at by the docs, even though they suggest `timer.Stop()`
should return true in that case.

We change to a non-blocking drain so that we won't block forever.

This manifests in never updating the certificate after it expires,
causing TLS handshake errors.

Fixes #275
@swenson
Copy link

swenson commented May 23, 2022

I believe we have found the underlying cause for this and fixed it in the last few PRs.

I think we'll cut a new release of vault-k8s soon to address these issues (not sure exactly when, but I'd like to after this week and possibly after a few more fixes get in).

@birelian
Copy link

I believe we have found the underlying cause for this and fixed it in the last few PRs.

I think we'll cut a new release of vault-k8s soon to address these issues (not sure exactly when, but I'd like to after this week and possibly after a few more fixes get in).

Thank you!

@Preen
Copy link

Preen commented May 25, 2022

can confirm I also had the problems described above and that downgrading to 0.15 worked.

@lucasscheepers
Copy link

Still experiencing this problem: Vault agent injector throws error 'tls: bad certificate' after each 24 hours.

@swenson In which version did you fix this bug?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants