[Summary] Issues on GKE (kuma-cni and kuma-init) #2046

bartsmykla · 2021-05-26T08:51:11Z

Summary

Kuma is not working on GKE with Kubernetes versions 1.19 and above. There are two issues which have the same root cause. One issue is with the kuma-init container failing. The second one is with kuma-cni which is failing to configure network for new containers.

Both of the issues are present only on GKE clusters with worker nodes using image COS_CONTAINERD. The problem is highly visible now as Google removed docker container runtime going forward with Kubernetes 1.19 and COS_CONTAINERD become default image when creating new clusters.

I want to point out, that clusters with nodes using image UBUNTU_CONTAINERD are working fine, so it may be a temporary solution to use it instead of COS_CONTAINERD if we want to really deploy Kuma on GKE, before we'll fix the issue

The cause of the issues

Image used by default to create cluster nodes on GKE (COS_CONTAINERD) doesn't contain kernel module ip6tables_nat which is needed by kuma-cni and by our transparent-proxy mechanism

Details

Context

Some time ago I was investigating an issue that Kuma was not working on GKE with Kubernetes 1.19.

During the investigation:

I realized that network interfaces on nodes have both - IPv4 and IPv6 addresses. It is especially weird as GKE does not support IPv6.
As a result the ip6tables-restore command on kuma-init was failing.

As a quick solution i created #1947. It forces to use only IPv4 if there are two addresses (IPv4 and IPv6) assigned. This was ugly solution, so we haven't merge it yet. After some internal discussions about how to solve it, our conclusion was to disable IPv6 on GCP/GKE only. The work was not done yer.

I was curious what was the root cause of the ip6tables-restore command failing. I digged deeper with the conclusion the probable cause is missing kernel modules. I didn't have more time to investigate it deeper so I left it there.

Few weeks later I started investigating another issue related to GKE. One of our community members wrote that kuma-cni is not working on GKE (ref. https://kuma-mesh.slack.com/archives/CN2GN4HE1/p1621526104028000).

Investigation

At this point I was thinking these are separate issues, but I have some guess the one with CNI is also related to IPv6. Both issues results in failing pods with injected Kuma.

At this point we can see our pods with injected Kuma doesn't want to start

without CNI enabled

kubectl describe pod (kubectl get pod -l app=foo -o template --template "{{(index .items 0).metadata.name}}")

# [...]
# Init Containers:
#   kuma-init:
#      [...]
#      State:          Waiting
#      Reason:       CrashLoopBackOff

kubectl logs deploy/foo -c kuma-init

# kumactl is about to apply the iptables rules that will enable transparent proxying on the machine. The SSH connection may drop. If that happens, just reconnect again.

with CNI enabled

kubectl describe pod (kubectl get pod -l app=foo -o template --template "{{(index .items 0).metadata.name}}")

# [...]
# Events:
#   Type     Reason                  Age                 From               Message
#   ----     ------                  ----                ----               -------
#   Normal   Scheduled               64s                 default-scheduler  Successfully assigned default/foo-9fc9cff76-ml8mr to gke-cluster-2-default-pool-79de5cf6-g75l
#   Warning  FailedCreatePodSandBox  63s                 kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "5cc29917d6882301caf2651202fbd21f6315b8cc6bb9f92774e024f8cb024201": exit status 1
#   Warning  FailedCreatePodSandBox  62s                 kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "c351b66881e8a2d047c3a2f0b6bf44d00cce3d3f2c16d66250e4799f9b120864": exit status 1
#   Warning  FailedCreatePodSandBox  61s                 kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "bc8588aa3db93b1e59c2184456ab11b002518843afbbce6d6124c48a26e4061d": exit status 1
#   Warning  FailedCreatePodSandBox  60s                 kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "832e78f2801a45764224307eb3e6d3413865093f9b086d19bf489b4ebd7c6dbf": exit status 1
#   Warning  FailedCreatePodSandBox  59s                 kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "9dbb996ba8ab820251f696f2ec61c47bad18af27e17ef8b0ef5de93748eeecec": exit status 1
#   Warning  FailedCreatePodSandBox  58s                 kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "582e929f25e0d7ce7927ced1201fe1a18eca2eb5b3a914db4f3cdfecc62f62d9": exit status 1
#   Warning  FailedCreatePodSandBox  57s                 kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "79ca2a01bf19f6d5d2d1465d65d05b47160cf7e7572963f4e8e6c964a319250b": exit status 1
#   Warning  FailedCreatePodSandBox  56s                 kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "395a76bda5b9febdb0b04b311552305c159166cca8fc8adfc8cb9600850f943d": exit status 1
#   Warning  FailedCreatePodSandBox  55s                 kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "79ed50635f9cc1f70ea67cc3f90932c78e0e223768f20251594226c81a6aa949": exit status 1
#   Warning  FailedCreatePodSandBox  39s (x16 over 54s)  kubelet            (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "061acb8e6f45a69f5eed09b991c025e1140c8b7264f76a51c8cbdad039beec4b": exit status 1

The most helpful place during my investigation was /tmp/kuma-cni.log file on worker node

Steps To Reproduce

Comments

I'm using fish shell so you may have to adjust some bits to make the commands work on different shells
Everything inside square brackets [] needs to be replaces by actual values
As the initial problem, described by the community member was present on multizone mode, I was reproducing it with this mode in mind. It doesn't mean the issue is not present on standalone mode though.

Expose GCP project name where all the testing will happen:

set -gx GCP_PROJECT_NAME [the_real_name_of_your_gcp_project]

Create two GKE clusters with Kubernetes version >= 1.19

for i in 1 2;
  gcloud beta container --project "$GCP_PROJECT_NAME" clusters create "cluster-$i" \
    --zone "us-central1-c" \
    --no-enable-basic-auth \
    --cluster-version "1.19.9-gke.1400" \
    --release-channel "regular" \
    --machine-type "n1-standard-2" \
    --image-type "COS_CONTAINERD" \
    --disk-type "pd-standard" \
    --disk-size "100" \
    --num-nodes "1" \
    --enable-network-policy \
    --node-locations "us-central1-c" &
end

wait

Update your local kubeconfig with the clusters just created

for i in 1 2;
  gcloud container clusters get-credentials "cluster-$i" --zone us-central1-c --project "$GCP_PROJECT_NAME"
end

Expose variable with the names of just created contexts/clusters. The names will be returned after the first step will finish.
```
set -gx C1 [the_name_of_the_first_cluster]
set -gx C2 [the_name_of_the_second_cluster]
```

Install Kuma on the first cluster

a. without CNI enabled to reproduce issue with kuma-init

kubectl config use-context "$C1"

kubectl create ns kuma-system

helm install --debug --namespace kuma-system --set controlPlane.mode="global" kuma kuma/kuma

b. with CNI enabled to reproduce issue with kuma-cni

kubectl config use-context "$C1"

kubectl create ns kuma-system

helm install --debug --namespace kuma-system --set cni.enabled=true,cni.chained=true,cni.netDir="/etc/cni/net.d",cni.binDir="/home/kubernetes/bin",cni.confName="10-calico.conflist",cni.logLevel="debug",controlPlane.mode="global" kuma kuma/kuma

Wait enough time kuma-global-remote-sync service receives external service. You can check it by running command:

kubectl get service kuma-global-remote-sync -n kuma-system -o template --template="{{(index .status.loadBalancer.ingress 0).ip}}"

Expose the kuma-global-remote-sync address

set -gx KDS_ADDRESS (kubectl get service kuma-global-remote-sync -n kuma-system -o template --template="{{(index .status.loadBalancer.ingress 0).ip}}")

Install Kuma on the second cluster

a. without CNI enabled to reproduce issue with kuma-init

kubectl config use-context "$C2"

kubectl create ns kuma-system

helm install --debug --namespace kuma-system --set controlPlane.mode="remote",controlPlane.zone="remote-1",controlPlane.kdsGlobalAddress="grpcs://$KDS_ADDRESS:5685",ingress.enabled=true kuma kuma/kuma

b. with CNI enabled to reproduce issue with kuma-cni

kubectl config use-context "$C2"

kubectl create ns kuma-system

helm install --debug --namespace kuma-system --set cni.enabled=true,cni.chained=true,cni.netDir="/etc/cni/net.d",cni.binDir="/home/kubernetes/bin",cni.confName="10-calico.conflist",cni.logLevel="debug",controlPlane.mode="remote",controlPlane.zone="remote-1",controlPlane.kdsGlobalAddress="grpcs://$KDS_ADDRESS:5685",ingress.enabled=true kuma kuma/kuma

Annotate default namespace to inject Kuma for new pods

kubectl annotate namespace default kuma.io/sidecar-injection=enabled --overwrite

Create some test deployment on the second cluster

kubectl create deployment foo --image=ubuntu:focal -- sleep infinity

Additional Details & Logs

Version: 1.1.6
Platform and Operating System: GKE 1.19
Installation Method: helm and kumactl

The text was updated successfully, but these errors were encountered:

bartsmykla · 2021-05-26T11:15:58Z

Following up.

I found that the newest versions of the COS images should have this module enabled, according to: https://cos.googlesource.com/cos/overlays/board-overlays/+/7c1f05c2df5afaecac577d464a63b88a2d28092d and https://cloud.google.com/container-optimized-os/docs/release-notes#cos-dev-93-16340-0-0 in the newest versions of this image

bartsmykla self-assigned this May 26, 2021

bartsmykla added the bug label May 26, 2021

This was referenced May 26, 2021

feat(kumactl) add --no-config flag #2048

Merged

fix(*) fix transparent-proxy for GCP/GKE #2051

Merged

bartsmykla closed this as completed in #2051 May 27, 2021

bartsmykla mentioned this issue Jun 2, 2021

Let kuma-init to run with --verbose flag #2080

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Summary] Issues on GKE (kuma-cni and kuma-init) #2046

[Summary] Issues on GKE (kuma-cni and kuma-init) #2046

bartsmykla commented May 26, 2021

bartsmykla commented May 26, 2021