Watches always fail to start once when agent restarts #1035

antoninbas · 2020-08-05T01:37:55Z

Describe the bug
I don't know if this is a bug, but when force restarting an Antrea Agent on a Node (by deleting the previous Pod), I always see these logs on restart:

W0805 00:49:26.646142       1 client.go:102] Didn't get CA certificate from ConfigMap. May not be able to verify server cert
I0805 00:49:26.646293       1 client.go:107] No antrea kubeconfig file was specified. Falling back to in-cluster config
I0805 00:49:26.646382       1 client.go:127] Updating Antrea client with the new CA bundle
I0805 00:49:26.647914       1 log_file.go:134] Starting log file monitoring. Maximum log file number is 4
I0805 00:49:26.648187       1 server.go:538] Starting CNI server
I0805 00:49:26.648344       1 server.go:548] CNI server is listening ...
I0805 00:49:26.649045       1 agent.go:46] Starting Antrea Agent Monitor
I0805 00:49:26.649507       1 configmap_cafile_content.go:202] Starting antrea-ca::kube-system::antrea-ca::ca.crt
I0805 00:49:26.649523       1 client.go:81] t.go:202] Starting antrea-ca::kube-system::antrea-ca::ca.crt
I0805 00:49:26.649530       1 shared_informer.go:223] Waiting for caches to sync for antrea-ca::kube-system::antrea-ca::ca.crt
I0805 00:49:26.649535       1 configmap_cafile_content.go:209]  Waiting for caches to sync for antrea-ca::kube-system::antrea-ca::ca.crt
I0805 00:49:26.651516       1 node_route_controller.go:263] Starting AntreaAgentNodeRouteController
I0805 00:49:26.651529       1 node_route_controller.go:266] Waiting for caches to sync for AntreaAgentNodeRouteController
I0805 00:49:26.651721       1 networkpolicy_controller.go:309] Waiting for all watchers to complete full sync
I0805 00:49:26.651834       1 networkpolicy_controller.go:468] Starting watch for NetworkPolicy
I0805 00:49:26.652799       1 networkpolicy_controller.go:468] Starting watch for AppliedToGroup
I0805 00:49:26.653091       1 networkpolicy_controller.go:468] Starting watch for AddressGroup
W0805 00:49:26.658270       1 networkpolicy_controller.go:471] Failed to start watch for AddressGroup: Get https://10.106.23.99:443/apis/networking.antrea.tanzu.vmware.com/v1beta1/addressgroups?fieldSelector=nodeName%3Dk8s-node-master&watch=true: x509: certificate signed by unknown authority
W0805 00:49:26.660306       1 networkpolicy_controller.go:471] Failed to start watch for AppliedToGroup: Get https://10.106.23.99:443/apis/networking.antrea.tanzu.vmware.com/v1beta1/appliedtogroups?fieldSelector=nodeName%3Dk8s-node-master&watch=true: x509: certificate signed by unknown authority
W0805 00:49:26.660862       1 networkpolicy_controller.go:471] Failed to start watch for NetworkPolicy: Get https://10.106.23.99:443/apis/networking.antrea.tanzu.vmware.com/v1beta1/networkpolicies?fieldSelector=nodeName%3Dk8s-node-master&watch=true: x509: certificate signed by unknown authority
I0805 00:49:26.750596       1 shared_informer.go:230] Caches are synced for antrea-ca::kube-system::antrea-ca::ca.crt
I0805 00:49:26.750622       1 configmap_cafile_content.go:209]  Caches are synced for antrea-ca::kube-system::antrea-ca::ca.crt
I0805 00:49:26.750943       1 client.go:107] No antrea kubeconfig file was specified. Falling back to in-cluster config
I0805 00:49:26.751165       1 client.go:127] Updating Antrea client with the new CA bundle

Notice the warnings.

The call to this function is failing:
https://github.com/vmware-tanzu/antrea/blob/5be570b203cde54f42eaeb8f52a4353f846ff214/pkg/agent/client.go#L72-L76
I believe this is because the apiserver code uses the ConfigMap lister before the cache has synced (not 100% sure): https://github.com/kubernetes/apiserver/blob/7b7ecfc9c50835ea75f5dfe2abd93036cf9628cf/pkg/server/dynamiccertificates/configmap_cafile_content.go#L141-L146

We could ignore the error in RunOnce, like is done here: https://github.com/kubernetes/apiserver/blob/7b7ecfc9c50835ea75f5dfe2abd93036cf9628cf/pkg/server/dynamiccertificates/configmap_cafile_content.go#L189-L195

But the watches would still fail. @tnqn do you have an idea on how we can avoid these warnings?

To Reproduce
Deploy Antrea on a cluster. Delete an Antrea Agent Pod on a Node. Wait for the Agent to restart and look at the logs.

Expected
No warning since the antrea-ca ConfigMap already exists.

Actual behavior
Warnings about failure to retrieve certificate, and consequently warnings about failure to start the watches.

Versions:
Antrea v0.9.0-dev

The text was updated successfully, but these errors were encountered:

tnqn · 2020-08-05T16:35:37Z

@antoninbas thanks for filing it, I think your analysis makes sense. ConfigMapCAController.RunOnce() seems useless as the cache is not synced for sure when it's called.
I think we could add a wait to networkPolicyController.Run to avoid the warnings.

tnqn · 2020-08-05T16:42:25Z

Had a quick PR to fix it: #1042

antoninbas added kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Aug 5, 2020

tnqn mentioned this issue Aug 5, 2020

Wait for Antrea client to be ready before starting watches #1042

Merged

tnqn self-assigned this Aug 5, 2020

tnqn closed this as completed in #1042 Aug 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Watches always fail to start once when agent restarts #1035

Watches always fail to start once when agent restarts #1035

antoninbas commented Aug 5, 2020

tnqn commented Aug 5, 2020

tnqn commented Aug 5, 2020

Watches always fail to start once when agent restarts #1035

Watches always fail to start once when agent restarts #1035

Comments

antoninbas commented Aug 5, 2020

tnqn commented Aug 5, 2020

tnqn commented Aug 5, 2020