Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watches always fail to start once when agent restarts #1035

Closed
antoninbas opened this issue Aug 5, 2020 · 2 comments · Fixed by #1042
Closed

Watches always fail to start once when agent restarts #1035

antoninbas opened this issue Aug 5, 2020 · 2 comments · Fixed by #1042
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.

Comments

@antoninbas
Copy link
Contributor

Describe the bug
I don't know if this is a bug, but when force restarting an Antrea Agent on a Node (by deleting the previous Pod), I always see these logs on restart:

W0805 00:49:26.646142       1 client.go:102] Didn't get CA certificate from ConfigMap. May not be able to verify server cert
I0805 00:49:26.646293       1 client.go:107] No antrea kubeconfig file was specified. Falling back to in-cluster config
I0805 00:49:26.646382       1 client.go:127] Updating Antrea client with the new CA bundle
I0805 00:49:26.647914       1 log_file.go:134] Starting log file monitoring. Maximum log file number is 4
I0805 00:49:26.648187       1 server.go:538] Starting CNI server
I0805 00:49:26.648344       1 server.go:548] CNI server is listening ...
I0805 00:49:26.649045       1 agent.go:46] Starting Antrea Agent Monitor
I0805 00:49:26.649507       1 configmap_cafile_content.go:202] Starting antrea-ca::kube-system::antrea-ca::ca.crt
I0805 00:49:26.649523       1 client.go:81] t.go:202] Starting antrea-ca::kube-system::antrea-ca::ca.crt
I0805 00:49:26.649530       1 shared_informer.go:223] Waiting for caches to sync for antrea-ca::kube-system::antrea-ca::ca.crt
I0805 00:49:26.649535       1 configmap_cafile_content.go:209]  Waiting for caches to sync for antrea-ca::kube-system::antrea-ca::ca.crt
I0805 00:49:26.651516       1 node_route_controller.go:263] Starting AntreaAgentNodeRouteController
I0805 00:49:26.651529       1 node_route_controller.go:266] Waiting for caches to sync for AntreaAgentNodeRouteController
I0805 00:49:26.651721       1 networkpolicy_controller.go:309] Waiting for all watchers to complete full sync
I0805 00:49:26.651834       1 networkpolicy_controller.go:468] Starting watch for NetworkPolicy
I0805 00:49:26.652799       1 networkpolicy_controller.go:468] Starting watch for AppliedToGroup
I0805 00:49:26.653091       1 networkpolicy_controller.go:468] Starting watch for AddressGroup
W0805 00:49:26.658270       1 networkpolicy_controller.go:471] Failed to start watch for AddressGroup: Get https://10.106.23.99:443/apis/networking.antrea.tanzu.vmware.com/v1beta1/addressgroups?fieldSelector=nodeName%3Dk8s-node-master&watch=true: x509: certificate signed by unknown authority
W0805 00:49:26.660306       1 networkpolicy_controller.go:471] Failed to start watch for AppliedToGroup: Get https://10.106.23.99:443/apis/networking.antrea.tanzu.vmware.com/v1beta1/appliedtogroups?fieldSelector=nodeName%3Dk8s-node-master&watch=true: x509: certificate signed by unknown authority
W0805 00:49:26.660862       1 networkpolicy_controller.go:471] Failed to start watch for NetworkPolicy: Get https://10.106.23.99:443/apis/networking.antrea.tanzu.vmware.com/v1beta1/networkpolicies?fieldSelector=nodeName%3Dk8s-node-master&watch=true: x509: certificate signed by unknown authority
I0805 00:49:26.750596       1 shared_informer.go:230] Caches are synced for antrea-ca::kube-system::antrea-ca::ca.crt
I0805 00:49:26.750622       1 configmap_cafile_content.go:209]  Caches are synced for antrea-ca::kube-system::antrea-ca::ca.crt
I0805 00:49:26.750943       1 client.go:107] No antrea kubeconfig file was specified. Falling back to in-cluster config
I0805 00:49:26.751165       1 client.go:127] Updating Antrea client with the new CA bundle

Notice the warnings.

The call to this function is failing:
https://github.com/vmware-tanzu/antrea/blob/5be570b203cde54f42eaeb8f52a4353f846ff214/pkg/agent/client.go#L72-L76
I believe this is because the apiserver code uses the ConfigMap lister before the cache has synced (not 100% sure): https://github.com/kubernetes/apiserver/blob/7b7ecfc9c50835ea75f5dfe2abd93036cf9628cf/pkg/server/dynamiccertificates/configmap_cafile_content.go#L141-L146

We could ignore the error in RunOnce, like is done here: https://github.com/kubernetes/apiserver/blob/7b7ecfc9c50835ea75f5dfe2abd93036cf9628cf/pkg/server/dynamiccertificates/configmap_cafile_content.go#L189-L195

But the watches would still fail. @tnqn do you have an idea on how we can avoid these warnings?

To Reproduce
Deploy Antrea on a cluster. Delete an Antrea Agent Pod on a Node. Wait for the Agent to restart and look at the logs.

Expected
No warning since the antrea-ca ConfigMap already exists.

Actual behavior
Warnings about failure to retrieve certificate, and consequently warnings about failure to start the watches.

Versions:
Antrea v0.9.0-dev

@antoninbas antoninbas added kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Aug 5, 2020
@tnqn
Copy link
Member

tnqn commented Aug 5, 2020

@antoninbas thanks for filing it, I think your analysis makes sense. ConfigMapCAController.RunOnce() seems useless as the cache is not synced for sure when it's called.
I think we could add a wait to networkPolicyController.Run to avoid the warnings.

@tnqn
Copy link
Member

tnqn commented Aug 5, 2020

Had a quick PR to fix it: #1042

@tnqn tnqn self-assigned this Aug 5, 2020
@tnqn tnqn closed this as completed in #1042 Aug 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants