Issue on k8s/kops 1.17 clusters #378

jhohertz · 2020-02-14T20:27:13Z

I have been tracking an issue only seen on kubernetes 1.17 versions... works fine with 1.15 and 1.16. Clusters are launched with kops, and the kops/k8s version is the only thing changed between the working 1.16 clusters, and the non-working 1.17 clusters.

The kiam-agent is unable to establish a link to the kiam-servers and starts crashlooping.

Enabling the gRPC debug env vars, I see this in the logs, suggesting that there is a DNS lookup failure (we are using CoreDNS as of all these k8s versions):

{"level":"info","msg":"started prometheus metric listener 0.0.0.0:9620","time":"2020-02-14T18:13:41Z"}
INFO: 2020/02/14 18:13:41 parsed scheme: "dns"
INFO: 2020/02/14 18:13:46 grpc: failed dns SRV record lookup due to lookup _grpclb._tcp.kiam-server on 100.64.0.10:53: dial udp 100.64.0.10:53: operation was canceled.
WARNING: 2020/02/14 18:13:46 grpc: failed dns A record lookup due to lookup kiam-server on 100.64.0.10:53: dial udp 100.64.0.10:53: operation was canceled.
INFO: 2020/02/14 18:13:46 ccResolverWrapper: got new service config: 
INFO: 2020/02/14 18:13:46 ccResolverWrapper: sending new addresses to cc: []
{"level":"fatal","msg":"error creating server gateway: error dialing grpc server: context deadline exceeded","time":"2020-02-14T18:13:46Z"}

I've scoured the k8s and kops changelogs looking for possible changes that would cause this but have yet to find anything that seems relevant.

Has anyone else experienced this yet?

The text was updated successfully, but these errors were encountered:

jhohertz · 2020-02-14T21:05:15Z

Just found this bug, which seems likely to be related: kubernetes/kubernetes#87852

jhohertz · 2020-02-18T15:50:26Z

Update: This seems to be specific to using the flannel/canal CNI with the vxlan backend by some accounts, and further testing seems to support that.

jhohertz · 2020-02-18T19:23:59Z

So the problem isn't really with kiam, see: flannel-io/flannel#1243

However it might be worth warning people as I suspect flannel/vxlan is not that uncommon.

jhohertz · 2020-05-11T13:23:06Z

Closing since the problem is fixed elswehere

jhohertz mentioned this issue Feb 14, 2020

1.17 alpha versions causing regression for kiam? kubernetes/kops#8562

Closed

jhohertz mentioned this issue Feb 14, 2020

dnsPolicy in hostNetwork not working as expected kubernetes/kubernetes#87852

Closed

Capitrium mentioned this issue Mar 23, 2020

ClusterIP services not accessible when using flannel CNI from host machines in Kubernetes flannel-io/flannel#1243

Closed

jhohertz closed this as completed May 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue on k8s/kops 1.17 clusters #378

Issue on k8s/kops 1.17 clusters #378

jhohertz commented Feb 14, 2020

jhohertz commented Feb 14, 2020

jhohertz commented Feb 18, 2020

jhohertz commented Feb 18, 2020

jhohertz commented May 11, 2020

Issue on k8s/kops 1.17 clusters #378

Issue on k8s/kops 1.17 clusters #378

Comments

jhohertz commented Feb 14, 2020

jhohertz commented Feb 14, 2020

jhohertz commented Feb 18, 2020

jhohertz commented Feb 18, 2020

jhohertz commented May 11, 2020