You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Kuma is not working on GKE with Kubernetes versions 1.19 and above. There are two issues which have the same root cause. One issue is with the kuma-init container failing. The second one is with kuma-cni which is failing to configure network for new containers.
Both of the issues are present only on GKE clusters with worker nodes using image COS_CONTAINERD. The problem is highly visible now as Google removed docker container runtime going forward with Kubernetes 1.19 and COS_CONTAINERD become default image when creating new clusters.
I want to point out, that clusters with nodes using image UBUNTU_CONTAINERD are working fine, so it may be a temporary solution to use it instead of COS_CONTAINERD if we want to really deploy Kuma on GKE, before we'll fix the issue
The cause of the issues
Image used by default to create cluster nodes on GKE (COS_CONTAINERD) doesn't contain kernel module ip6tables_nat which is needed by kuma-cni and by our transparent-proxy mechanism
Details
Context
Some time ago I was investigating an issue that Kuma was not working on GKE with Kubernetes 1.19.
During the investigation:
I realized that network interfaces on nodes have both - IPv4 and IPv6 addresses. It is especially weird as GKE does not support IPv6.
As a result the ip6tables-restore command on kuma-init was failing.
As a quick solution i created #1947. It forces to use only IPv4 if there are two addresses (IPv4 and IPv6) assigned. This was ugly solution, so we haven't merge it yet. After some internal discussions about how to solve it, our conclusion was to disable IPv6 on GCP/GKE only. The work was not done yer.
I was curious what was the root cause of the ip6tables-restore command failing. I digged deeper with the conclusion the probable cause is missing kernel modules. I didn't have more time to investigate it deeper so I left it there.
At this point I was thinking these are separate issues, but I have some guess the one with CNI is also related to IPv6. Both issues results in failing pods with injected Kuma.
At this point we can see our pods with injected Kuma doesn't want to start
without CNI enabled
kubectl describe pod (kubectl get pod -l app=foo -o template --template"{{(index .items 0).metadata.name}}")
# [...]# Init Containers:# kuma-init:# [...]# State: Waiting# Reason: CrashLoopBackOff
kubectl logs deploy/foo -c kuma-init
# kumactl is about to apply the iptables rules that will enable transparent proxying on the machine. The SSH connection may drop. If that happens, just reconnect again.
with CNI enabled
kubectl describe pod (kubectl get pod -l app=foo -o template --template"{{(index .items 0).metadata.name}}")
# [...]# Events:# Type Reason Age From Message# ---- ------ ---- ---- -------# Normal Scheduled 64s default-scheduler Successfully assigned default/foo-9fc9cff76-ml8mr to gke-cluster-2-default-pool-79de5cf6-g75l# Warning FailedCreatePodSandBox 63s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "5cc29917d6882301caf2651202fbd21f6315b8cc6bb9f92774e024f8cb024201": exit status 1# Warning FailedCreatePodSandBox 62s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "c351b66881e8a2d047c3a2f0b6bf44d00cce3d3f2c16d66250e4799f9b120864": exit status 1# Warning FailedCreatePodSandBox 61s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "bc8588aa3db93b1e59c2184456ab11b002518843afbbce6d6124c48a26e4061d": exit status 1# Warning FailedCreatePodSandBox 60s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "832e78f2801a45764224307eb3e6d3413865093f9b086d19bf489b4ebd7c6dbf": exit status 1# Warning FailedCreatePodSandBox 59s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "9dbb996ba8ab820251f696f2ec61c47bad18af27e17ef8b0ef5de93748eeecec": exit status 1# Warning FailedCreatePodSandBox 58s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "582e929f25e0d7ce7927ced1201fe1a18eca2eb5b3a914db4f3cdfecc62f62d9": exit status 1# Warning FailedCreatePodSandBox 57s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "79ca2a01bf19f6d5d2d1465d65d05b47160cf7e7572963f4e8e6c964a319250b": exit status 1# Warning FailedCreatePodSandBox 56s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "395a76bda5b9febdb0b04b311552305c159166cca8fc8adfc8cb9600850f943d": exit status 1# Warning FailedCreatePodSandBox 55s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "79ed50635f9cc1f70ea67cc3f90932c78e0e223768f20251594226c81a6aa949": exit status 1# Warning FailedCreatePodSandBox 39s (x16 over 54s) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "061acb8e6f45a69f5eed09b991c025e1140c8b7264f76a51c8cbdad039beec4b": exit status 1
The most helpful place during my investigation was /tmp/kuma-cni.log file on worker node
Steps To Reproduce
Comments
I'm using fish shell so you may have to adjust some bits to make the commands work on different shells
Everything inside square brackets [] needs to be replaces by actual values
As the initial problem, described by the community member was present on multizone mode, I was reproducing it with this mode in mind. It doesn't mean the issue is not present on standalone mode though.
Expose GCP project name where all the testing will happen:
Create two GKE clusters with Kubernetes version >= 1.19
for i in 1 2;
gcloud beta container --project"$GCP_PROJECT_NAME" clusters create "cluster-$i"\--zone"us-central1-c"\--no-enable-basic-auth\--cluster-version"1.19.9-gke.1400"\--release-channel"regular"\--machine-type"n1-standard-2"\--image-type"COS_CONTAINERD"\--disk-type"pd-standard"\--disk-size"100"\--num-nodes"1"\--enable-network-policy\--node-locations"us-central1-c"&endwait
Update your local kubeconfig with the clusters just created
for i in 1 2;
gcloud container clusters get-credentials "cluster-$i"--zone us-central1-c --project"$GCP_PROJECT_NAME"end
Expose variable with the names of just created contexts/clusters. The names will be returned after the first step will finish.
Summary
Kuma is not working on GKE with Kubernetes versions 1.19 and above. There are two issues which have the same root cause. One issue is with the
kuma-init
container failing. The second one is withkuma-cni
which is failing to configure network for new containers.Both of the issues are present only on GKE clusters with worker nodes using image
COS_CONTAINERD
. The problem is highly visible now as Google removed docker container runtime going forward with Kubernetes 1.19 andCOS_CONTAINERD
become default image when creating new clusters.I want to point out, that clusters with nodes using image
UBUNTU_CONTAINERD
are working fine, so it may be a temporary solution to use it instead ofCOS_CONTAINERD
if we want to really deploy Kuma on GKE, before we'll fix the issueThe cause of the issues
Image used by default to create cluster nodes on GKE (
COS_CONTAINERD
) doesn't contain kernel moduleip6tables_nat
which is needed bykuma-cni
and by ourtransparent-proxy
mechanismDetails
Context
Some time ago I was investigating an issue that Kuma was not working on GKE with Kubernetes 1.19.
During the investigation:
I realized that network interfaces on nodes have both - IPv4 and IPv6 addresses. It is especially weird as GKE does not support IPv6.
As a result the
ip6tables-restore
command onkuma-init
was failing.As a quick solution i created #1947. It forces to use only
IPv4
if there are two addresses (IPv4
andIPv6
) assigned. This was ugly solution, so we haven't merge it yet. After some internal discussions about how to solve it, our conclusion was to disableIPv6
on GCP/GKE only. The work was not done yer.I was curious what was the root cause of the
ip6tables-restore
command failing. I digged deeper with the conclusion the probable cause is missing kernel modules. I didn't have more time to investigate it deeper so I left it there.Few weeks later I started investigating another issue related to GKE. One of our community members wrote that kuma-cni is not working on GKE (ref. https://kuma-mesh.slack.com/archives/CN2GN4HE1/p1621526104028000).
Investigation
At this point I was thinking these are separate issues, but I have some guess the one with
CNI
is also related toIPv6
. Both issues results in failing pods with injected Kuma.At this point we can see our pods with injected Kuma doesn't want to start
without
CNI
enabledwith
CNI
enabledThe most helpful place during my investigation was
/tmp/kuma-cni.log
file on worker nodeSteps To Reproduce
Comments
fish
shell so you may have to adjust some bits to make the commands work on different shells[]
needs to be replaces by actual valuesExpose GCP project name where all the testing will happen:
Create two GKE clusters with Kubernetes version >=
1.19
Update your local
kubeconfig
with the clusters just createdExpose variable with the names of just created contexts/clusters. The names will be returned after the first step will finish.
Install Kuma on the first cluster
a. without
CNI
enabled to reproduce issue withkuma-init
b. with
CNI
enabled to reproduce issue withkuma-cni
Wait enough time
kuma-global-remote-sync
service receives external service. You can check it by running command:Expose the
kuma-global-remote-sync
addressInstall Kuma on the second cluster
a. without
CNI
enabled to reproduce issue withkuma-init
b. with
CNI
enabled to reproduce issue withkuma-cni
Annotate default namespace to inject Kuma for new pods
kubectl annotate namespace default kuma.io/sidecar-injection=enabled --overwrite
Create some test deployment on the second cluster
Additional Details & Logs
1.1.6
GKE 1.19
helm
andkumactl
The text was updated successfully, but these errors were encountered: