Skip to content
This repository has been archived by the owner on Jun 26, 2023. It is now read-only.

HNC : logs "http: TLS handshake error from x:x remote error: tls: bad certificate" #1255

Closed
ledroide opened this issue Nov 5, 2020 · 13 comments

Comments

@ledroide
Copy link

ledroide commented Nov 5, 2020

Hello,
The manager container from hnc-controller-manager deployment show continuously lots of logs like this :

2020/11/05 15:45:55 http: TLS handshake error from 10.233.98.0:25105: remote error: tls: bad certificate
2020/11/05 15:45:55 http: TLS handshake error from 10.233.102.0:44592: remote error: tls: bad certificate
2020/11/05 15:45:55 http: TLS handshake error from 10.233.92.0:49653: remote error: tls: bad certificate
2020/11/05 15:45:55 http: TLS handshake error from 10.233.98.0:36010: remote error: tls: bad certificate
2020/11/05 15:45:55 http: TLS handshake error from 10.233.98.0:6684: remote error: tls: bad certificate
2020/11/05 15:45:55 http: TLS handshake error from 10.233.102.0:45771: remote error: tls: bad certificate
2020/11/05 15:45:55 http: TLS handshake error from 10.233.92.0:37314: remote error: tls: bad certificate
2020/11/05 15:45:55 http: TLS handshake error from 10.233.102.0:8601: remote error: tls: bad certificate
2020/11/05 15:45:55 http: TLS handshake error from 10.233.92.0:15705: remote error: tls: bad certificate
2020/11/05 15:45:55 http: TLS handshake error from 10.233.98.0:50125: remote error: tls: bad certificate
2020/11/05 15:45:55 http: TLS handshake error from 10.233.92.0:11056: remote error: tls: bad certificate
2020/11/05 15:45:55 http: TLS handshake error from 10.233.102.0:59676: remote error: tls: bad certificate
2020/11/05 15:45:55 http: TLS handshake error from 10.233.102.0:17264: remote error: tls: bad certificate
2020/11/05 15:45:55 http: TLS handshake error from 10.233.102.0:24978: remote error: tls: bad certificate
2020/11/05 15:45:55 http: TLS handshake error from 10.233.98.0:28812: remote error: tls: bad certificate
2020/11/05 15:45:55 http: TLS handshake error from 10.233.98.0:15634: remote error: tls: bad certificate
2020/11/05 15:45:55 http: TLS handshake error from 10.233.92.0:7157: remote error: tls: bad certificate
2020/11/05 15:45:55 http: TLS handshake error from 10.233.92.0:41785: remote error: tls: bad certificate

Does it show an actual issue ? If not, how can we disable the handshakes attempts, or do not log these attempts ?

@adrianludwin
Copy link
Contributor

adrianludwin commented Nov 5, 2020 via email

@ledroide
Copy link
Author

ledroide commented Nov 6, 2020

  • I did not modify the hnc-manager.yaml all-in-one file
  • container image name is gcr.io/k8s-staging-multitenancy/hnc-manager:v0.6.0
  • this log is still displayed continuously after 22h running
$ kubectl get deploy/hnc-controller-manager -o wide -n hnc-system
NAME                     READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS                IMAGES                                                                                         SELECTOR
hnc-controller-manager   1/1     1            1           22h   manager,kube-rbac-proxy   gcr.io/k8s-staging-multitenancy/hnc-manager:v0.6.0,gcr.io/kubebuilder/kube-rbac-proxy:v0.4.0   control-plane=controller-manager
$ kubectl logs --tail 12 deploy/hnc-controller-manager -c manager -n hnc-system
2020/11/06 14:02:15 http: TLS handshake error from 10.233.102.0:10357: remote error: tls: bad certificate
{"level":"info","ts":1604671335.157892,"logger":"cert-rotation","msg":"CRD subnamespaceanchors.hnc.x-k8s.io is being deleted"}
{"level":"info","ts":1604671335.254194,"logger":"cert-rotation","msg":"CRD hierarchyconfigurations.hnc.x-k8s.io is being deleted"}
2020/11/06 14:02:15 http: TLS handshake error from 10.233.102.0:38357: remote error: tls: bad certificate
{"level":"info","ts":1604671335.2594817,"logger":"cert-rotation","msg":"ensuring CA cert on ValidatingWebhookConfiguration"}
2020/11/06 14:02:15 http: TLS handshake error from 10.233.102.0:40448: remote error: tls: bad certificate
2020/11/06 14:02:15 http: TLS handshake error from 10.233.102.0:25077: remote error: tls: bad certificate
2020/11/06 14:02:15 http: TLS handshake error from 10.233.102.0:18810: remote error: tls: bad certificate
2020/11/06 14:02:15 http: TLS handshake error from 10.233.98.0:56772: remote error: tls: bad certificate
2020/11/06 14:02:15 http: TLS handshake error from 10.233.92.0:52681: remote error: tls: bad certificate
2020/11/06 14:02:15 http: TLS handshake error from 10.233.98.0:50827: remote error: tls: bad certificate
2020/11/06 14:02:15 http: TLS handshake error from 10.233.92.0:29824: remote error: tls: bad certificate

I cannot define a parent to a namespace

$ kubectl hns --version
kubectl-hns version v0.6.0
$ kubectl hns tree webs
Error reading hierarchy for webs: conversion webhook for hnc.x-k8s.io/v1alpha1, Kind=HierarchyConfiguration failed: Post "https://hnc-webhook-service.hnc-system.svc:443/convert?timeout=30s": x509: certificate signed by unknown authority
$ kubectl get ns sbr01
NAME    STATUS   AGE
sbr01   Active   14d
$ kubectl hns set sbr01 --parent webs
Error reading hierarchy for sbr01: conversion webhook for hnc.x-k8s.io/v1alpha1, Kind=HierarchyConfiguration failed: Post "https://hnc-webhook-service.hnc-system.svc:443/convert?timeout=30s": x509: certificate signed by unknown authority
$ kubectl get hncconfiguration 
Error from server: conversion webhook for hnc.x-k8s.io/v1alpha1, Kind=HNCConfiguration failed: Post "https://hnc-webhook-service.hnc-system.svc:443/convert?timeout=30s": x509: certificate signed by unknown authority

Maybe the errors above are not the same issue, but anyway it does not work and messages talk about certificates.

Serge

@adrianludwin
Copy link
Contributor

adrianludwin commented Nov 6, 2020 via email

@ledroide
Copy link
Author

ledroide commented Nov 6, 2020

That's what I suspected at first, but if you read my output, there is the check :

$ kubectl hns --version
kubectl-hns version v0.6.0

I had some difficulties during upgrade from v0.5.0 to v0.6.0 (messages about remaining CRDs), some I have delete all resources from v0.5.0 manifests, then I deleted namespace hnc-system, then I deployed v0.6.0 from the new all-in-one manifest.

Here are the API references and resources :

$ kubectl api-resources | grep -i hnc
hierarchyconfigurations                                 hnc.x-k8s.io                   true         HierarchyConfiguration
hncconfigurations                                       hnc.x-k8s.io                   false        HNCConfiguration
subnamespaceanchors               subns                 hnc.x-k8s.io                   true         SubnamespaceAnchor
$ kubectl api-versions | grep -i hnc
hnc.x-k8s.io/v1alpha2
$ kubectl get crd -o wide | grep hnc
hierarchyconfigurations.hnc.x-k8s.io             2020-10-01T09:23:18Z
hncconfigurations.hnc.x-k8s.io                   2020-10-01T09:23:18Z
subnamespaceanchors.hnc.x-k8s.io                 2020-10-01T09:23:18Z

@adrianludwin
Copy link
Contributor

adrianludwin commented Nov 6, 2020 via email

@adrianludwin
Copy link
Contributor

adrianludwin commented Nov 6, 2020 via email

@yiqigao217
Copy link
Contributor

It looks like you were upgrading when the certs were not there, so the conversion webhooks cannot work either. Before your upgrade, did the validating webhooks work for your in v0.5?

@ledroide
Copy link
Author

@yiqigao217 : yes, validation webhooks worked with hnc v0.5.0

@adrianludwin : I have deleted validatingwebhookconfigurations.admissionregistration.k8s.io/hnc-validating-webhook-configuration and re-created the hnc-system with v0.6.0.

Here is the situation now :

$ kubectl hns tree webs
Error reading hierarchy for webs: conversion webhook for hnc.x-k8s.io/v1alpha1, Kind=HierarchyConfiguration failed: Post "https://hnc-webhook-service.hnc-system.svc:443/convert?timeout=30s": x509: certificate signed by unknown authority
$ kubectl create namespace level01
$ kubectl create namespace level02
$ kubectl hns tree level01
level01
$ kubectl hns set level02 --parent level01
Setting the parent of level02 to level01
Could not update the hierarchical configuration of level02.
Reason: create not allowed while custom resource definition is terminating
$ kubectl get customresourcedefinition,validatingwebhookconfiguration -o wide | grep hnc
customresourcedefinition.apiextensions.k8s.io/hierarchyconfigurations.hnc.x-k8s.io             2020-10-01T09:23:18Z
customresourcedefinition.apiextensions.k8s.io/hncconfigurations.hnc.x-k8s.io                   2020-10-01T09:23:18Z
customresourcedefinition.apiextensions.k8s.io/subnamespaceanchors.hnc.x-k8s.io                 2020-10-01T09:23:18Z
validatingwebhookconfiguration.admissionregistration.k8s.io/hnc-validating-webhook-configuration   5          2d
$ kubectl logs --tail 6 deploy/hnc-controller-manager -c manager -n hnc-system
{"level":"info","ts":1605197748.854099,"logger":"cert-rotation","msg":"CRD hierarchyconfigurations.hnc.x-k8s.io is being deleted"}
{"level":"info","ts":1605197748.8579118,"logger":"cert-rotation","msg":"ensuring CA cert on ValidatingWebhookConfiguration"}
2020/11/12 16:15:48 http: TLS handshake error from 10.233.102.0:62683: remote error: tls: bad certificate
2020/11/12 16:15:48 http: TLS handshake error from 10.233.92.0:22964: remote error: tls: bad certificate
2020/11/12 16:15:48 http: TLS handshake error from 10.233.102.0:11259: remote error: tls: bad certificate
2020/11/12 16:15:48 http: TLS handshake error from 10.233.98.0:37281: remote error: tls: bad certificate

@adrianludwin
Copy link
Contributor

adrianludwin commented Nov 12, 2020 via email

@ledroide
Copy link
Author

ledroide commented Nov 13, 2020

Solved.
TL;DR:

  • delete hnc manager from all-in-one manifest : kubectl delete -f hnc-manager.yaml
  • delete webhook configurations : kubectl delete validatingwebhookconfigurations.admissionregistration.k8s.io/hnc-validating-webhook-configuration
  • check what is remaining stuck : kubectl get customresourcedefinition,validatingwebhookconfiguration -o wide | grep hnc
  • edit manually (kubectl edit) all remaining CRDs, find finalizers: array and delete all lines, this should actually delete the CRD
  • re-install hnc manager kubectl apply -f hnc-manager.yaml

Details :

Before removing manually the finalizers for remaining customresourcedefinitions (after deleting hnc controller) :

$ kubectl get subns --all-namespaces
Error from server: conversion webhook for hnc.x-k8s.io/v1alpha1, Kind=SubnamespaceAnchor failed: Post "https://hnc-webhook-service.hnc-system.svc:443/convert?timeout=30s": service "hnc-webhook-service" not found

After deletion, before re-install

$ kubectl get subns --all-namespaces
Error from server (NotFound): Unable to list "hnc.x-k8s.io/v1alpha2, Resource=subnamespaceanchors": the server could not find the requested resource (get subnamespaceanchors.hnc.x-k8s.io)

After re-install :
There are some warning you should consider (I guess there is no relation with this issue, may be there is). I'm running Kubernetes 1.19.3.

Warning: apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
Warning: admissionregistration.k8s.io/v1beta1 ValidatingWebhookConfiguration is deprecated in v1.16+, unavailable in v1.22+; use admissionregistration.k8s.io/v1 ValidatingWebhookConfiguration

Now hnc controller v0.6.0 is re-created.

$ kubectl get subns --all-namespaces
NAMESPACE   NAME    AGE
webs        sbr01   20d

$ kubectl hns tree webs
webs
└── [s] sbr01

$ kubectl hns set level02 --parent level01
Setting the parent of level02 to level01
Succesfully updated 1 property of the hierarchical configuration of level02

$ kubectl hns tree level01
level01
└── level02

Logs look much better :

$ kubectl logs --tail 5 deploy/hnc-controller-manager -c manager -n hnc-system
{"level":"info","ts":1605273919.0605707,"logger":"reconcilers.Hierarchy","msg":"New namespace found","rid":242,"ns":"sbr01"}
{"level":"info","ts":1605273919.2598712,"logger":"reconcilers.RoleBinding","msg":"Propagating object","rid":257,"trigger":"sbr01/gitlab-webs-poc-webs"}
{"level":"info","ts":1605273919.2602034,"logger":"reconcilers.RoleBinding","msg":"Propagating object","rid":258,"trigger":"sbr01/sbrouet-poc-webs"}
{"level":"info","ts":1605274029.4946737,"logger":"validators.Hierarchy","msg":"Checking authz","ns":"level02","user":"gailuron","object":"level01","reason":"proposed parent"}
{"level":"info","ts":1605274029.5124876,"logger":"reconcilers.Hierarchy","msg":"Creating hierarchyconfiguration","rid":270,"ns":"level01","conditions":0}

Problem is solved. Thanks @adrianludwin

@adrianludwin
Copy link
Contributor

Ugh, sorry you ran into so much trouble. I've filed #1270 to fix the warnings.

I'm not sure what caused the problems in the first place, but once you delete the deployment, it's not surprising that the CRD conversion webhooks fail. It's usually best to delete the CRs before the deployment because the manager is what typically removes the finalizers - but if we get into a bad enough state, it might stop doing the right thing.

Please let me know if you see anything like this again.

@vikas027
Copy link

vikas027 commented Jun 17, 2021

I can confirm this, the issue went away after upgrading to v0.8.0 (from 0.7.0) but I had to delete all resources and recreate them again.

Update: I think I spoke to fast, it has started throwing errors again.

@adrianludwin
Copy link
Contributor

@vikas027 what was the prior version of HNC, was it v0.6.0 or v0.7.0? And had HNC been working despite the errors, or was it broken?

Only v0.6.0 had the CRD conversion webhooks in it (they were removed in v0.7.0) so if you saw this problem in v0.7.0, I'm leaning more towards it being a K8s issue than an HNC issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants