Skip to content
This repository has been archived by the owner on Mar 5, 2024. It is now read-only.

Issue on k8s/kops 1.17 clusters #378

Closed
jhohertz opened this issue Feb 14, 2020 · 4 comments
Closed

Issue on k8s/kops 1.17 clusters #378

jhohertz opened this issue Feb 14, 2020 · 4 comments

Comments

@jhohertz
Copy link

I have been tracking an issue only seen on kubernetes 1.17 versions... works fine with 1.15 and 1.16. Clusters are launched with kops, and the kops/k8s version is the only thing changed between the working 1.16 clusters, and the non-working 1.17 clusters.

The kiam-agent is unable to establish a link to the kiam-servers and starts crashlooping.

Enabling the gRPC debug env vars, I see this in the logs, suggesting that there is a DNS lookup failure (we are using CoreDNS as of all these k8s versions):

{"level":"info","msg":"started prometheus metric listener 0.0.0.0:9620","time":"2020-02-14T18:13:41Z"}
INFO: 2020/02/14 18:13:41 parsed scheme: "dns"
INFO: 2020/02/14 18:13:46 grpc: failed dns SRV record lookup due to lookup _grpclb._tcp.kiam-server on 100.64.0.10:53: dial udp 100.64.0.10:53: operation was canceled.
WARNING: 2020/02/14 18:13:46 grpc: failed dns A record lookup due to lookup kiam-server on 100.64.0.10:53: dial udp 100.64.0.10:53: operation was canceled.
INFO: 2020/02/14 18:13:46 ccResolverWrapper: got new service config: 
INFO: 2020/02/14 18:13:46 ccResolverWrapper: sending new addresses to cc: []
{"level":"fatal","msg":"error creating server gateway: error dialing grpc server: context deadline exceeded","time":"2020-02-14T18:13:46Z"}

I've scoured the k8s and kops changelogs looking for possible changes that would cause this but have yet to find anything that seems relevant.

Has anyone else experienced this yet?

@jhohertz
Copy link
Author

Just found this bug, which seems likely to be related: kubernetes/kubernetes#87852

@jhohertz
Copy link
Author

Update: This seems to be specific to using the flannel/canal CNI with the vxlan backend by some accounts, and further testing seems to support that.

@jhohertz
Copy link
Author

So the problem isn't really with kiam, see: flannel-io/flannel#1243

However it might be worth warning people as I suspect flannel/vxlan is not that uncommon.

@jhohertz
Copy link
Author

Closing since the problem is fixed elswehere

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant