Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Summary] Issues on GKE (kuma-cni and kuma-init) #2046

Closed
bartsmykla opened this issue May 26, 2021 · 1 comment · Fixed by #2051
Closed

[Summary] Issues on GKE (kuma-cni and kuma-init) #2046

bartsmykla opened this issue May 26, 2021 · 1 comment · Fixed by #2051
Assignees

Comments

@bartsmykla
Copy link
Contributor

Summary

Kuma is not working on GKE with Kubernetes versions 1.19 and above. There are two issues which have the same root cause. One issue is with the kuma-init container failing. The second one is with kuma-cni which is failing to configure network for new containers.

Both of the issues are present only on GKE clusters with worker nodes using image COS_CONTAINERD. The problem is highly visible now as Google removed docker container runtime going forward with Kubernetes 1.19 and COS_CONTAINERD become default image when creating new clusters.

I want to point out, that clusters with nodes using image UBUNTU_CONTAINERD are working fine, so it may be a temporary solution to use it instead of COS_CONTAINERD if we want to really deploy Kuma on GKE, before we'll fix the issue

The cause of the issues

Image used by default to create cluster nodes on GKE (COS_CONTAINERD) doesn't contain kernel module ip6tables_nat which is needed by kuma-cni and by our transparent-proxy mechanism

Details

Context

Some time ago I was investigating an issue that Kuma was not working on GKE with Kubernetes 1.19.

During the investigation:

  1. I realized that network interfaces on nodes have both - IPv4 and IPv6 addresses. It is especially weird as GKE does not support IPv6.

  2. As a result the ip6tables-restore command on kuma-init was failing.

As a quick solution i created #1947. It forces to use only IPv4 if there are two addresses (IPv4 and IPv6) assigned. This was ugly solution, so we haven't merge it yet. After some internal discussions about how to solve it, our conclusion was to disable IPv6 on GCP/GKE only. The work was not done yer.

I was curious what was the root cause of the ip6tables-restore command failing. I digged deeper with the conclusion the probable cause is missing kernel modules. I didn't have more time to investigate it deeper so I left it there.

Few weeks later I started investigating another issue related to GKE. One of our community members wrote that kuma-cni is not working on GKE (ref. https://kuma-mesh.slack.com/archives/CN2GN4HE1/p1621526104028000).

Investigation

At this point I was thinking these are separate issues, but I have some guess the one with CNI is also related to IPv6. Both issues results in failing pods with injected Kuma.

At this point we can see our pods with injected Kuma doesn't want to start

  • without CNI enabled

    kubectl describe pod (kubectl get pod -l app=foo -o template --template "{{(index .items 0).metadata.name}}")
    
    # [...]
    # Init Containers:
    #   kuma-init:
    #      [...]
    #      State:          Waiting
    #      Reason:       CrashLoopBackOff
    kubectl logs deploy/foo -c kuma-init
    
    # kumactl is about to apply the iptables rules that will enable transparent proxying on the machine. The SSH connection may drop. If that happens, just reconnect again.
  • with CNI enabled

    kubectl describe pod (kubectl get pod -l app=foo -o template --template "{{(index .items 0).metadata.name}}")
    
    # [...]
    # Events:
    #   Type     Reason                  Age                 From               Message
    #   ----     ------                  ----                ----               -------
    #   Normal   Scheduled               64s                 default-scheduler  Successfully assigned default/foo-9fc9cff76-ml8mr to gke-cluster-2-default-pool-79de5cf6-g75l
    #   Warning  FailedCreatePodSandBox  63s                 kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "5cc29917d6882301caf2651202fbd21f6315b8cc6bb9f92774e024f8cb024201": exit status 1
    #   Warning  FailedCreatePodSandBox  62s                 kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "c351b66881e8a2d047c3a2f0b6bf44d00cce3d3f2c16d66250e4799f9b120864": exit status 1
    #   Warning  FailedCreatePodSandBox  61s                 kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "bc8588aa3db93b1e59c2184456ab11b002518843afbbce6d6124c48a26e4061d": exit status 1
    #   Warning  FailedCreatePodSandBox  60s                 kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "832e78f2801a45764224307eb3e6d3413865093f9b086d19bf489b4ebd7c6dbf": exit status 1
    #   Warning  FailedCreatePodSandBox  59s                 kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "9dbb996ba8ab820251f696f2ec61c47bad18af27e17ef8b0ef5de93748eeecec": exit status 1
    #   Warning  FailedCreatePodSandBox  58s                 kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "582e929f25e0d7ce7927ced1201fe1a18eca2eb5b3a914db4f3cdfecc62f62d9": exit status 1
    #   Warning  FailedCreatePodSandBox  57s                 kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "79ca2a01bf19f6d5d2d1465d65d05b47160cf7e7572963f4e8e6c964a319250b": exit status 1
    #   Warning  FailedCreatePodSandBox  56s                 kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "395a76bda5b9febdb0b04b311552305c159166cca8fc8adfc8cb9600850f943d": exit status 1
    #   Warning  FailedCreatePodSandBox  55s                 kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "79ed50635f9cc1f70ea67cc3f90932c78e0e223768f20251594226c81a6aa949": exit status 1
    #   Warning  FailedCreatePodSandBox  39s (x16 over 54s)  kubelet            (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "061acb8e6f45a69f5eed09b991c025e1140c8b7264f76a51c8cbdad039beec4b": exit status 1

The most helpful place during my investigation was /tmp/kuma-cni.log file on worker node

Steps To Reproduce

Comments

  • I'm using fish shell so you may have to adjust some bits to make the commands work on different shells
  • Everything inside square brackets [] needs to be replaces by actual values
  • As the initial problem, described by the community member was present on multizone mode, I was reproducing it with this mode in mind. It doesn't mean the issue is not present on standalone mode though.
  1. Expose GCP project name where all the testing will happen:

    set -gx GCP_PROJECT_NAME [the_real_name_of_your_gcp_project]
  2. Create two GKE clusters with Kubernetes version >= 1.19

    for i in 1 2;
      gcloud beta container --project "$GCP_PROJECT_NAME" clusters create "cluster-$i" \
        --zone "us-central1-c" \
        --no-enable-basic-auth \
        --cluster-version "1.19.9-gke.1400" \
        --release-channel "regular" \
        --machine-type "n1-standard-2" \
        --image-type "COS_CONTAINERD" \
        --disk-type "pd-standard" \
        --disk-size "100" \
        --num-nodes "1" \
        --enable-network-policy \
        --node-locations "us-central1-c" &
    end
    
    wait
  3. Update your local kubeconfig with the clusters just created

    for i in 1 2;
      gcloud container clusters get-credentials "cluster-$i" --zone us-central1-c --project "$GCP_PROJECT_NAME"
    end
  4. Expose variable with the names of just created contexts/clusters. The names will be returned after the first step will finish.

    set -gx C1 [the_name_of_the_first_cluster]
    set -gx C2 [the_name_of_the_second_cluster]
  5. Install Kuma on the first cluster

    a. without CNI enabled to reproduce issue with kuma-init

    kubectl config use-context "$C1"
    
    kubectl create ns kuma-system
    
    helm install --debug --namespace kuma-system --set controlPlane.mode="global" kuma kuma/kuma

    b. with CNI enabled to reproduce issue with kuma-cni

    kubectl config use-context "$C1"
    
    kubectl create ns kuma-system
    
    helm install --debug --namespace kuma-system --set cni.enabled=true,cni.chained=true,cni.netDir="/etc/cni/net.d",cni.binDir="/home/kubernetes/bin",cni.confName="10-calico.conflist",cni.logLevel="debug",controlPlane.mode="global" kuma kuma/kuma
  6. Wait enough time kuma-global-remote-sync service receives external service. You can check it by running command:

    kubectl get service kuma-global-remote-sync -n kuma-system -o template --template="{{(index .status.loadBalancer.ingress 0).ip}}" 
  7. Expose the kuma-global-remote-sync address

    set -gx KDS_ADDRESS (kubectl get service kuma-global-remote-sync -n kuma-system -o template --template="{{(index .status.loadBalancer.ingress 0).ip}}")
  8. Install Kuma on the second cluster

    a. without CNI enabled to reproduce issue with kuma-init

    kubectl config use-context "$C2"
    
    kubectl create ns kuma-system
    
    helm install --debug --namespace kuma-system --set controlPlane.mode="remote",controlPlane.zone="remote-1",controlPlane.kdsGlobalAddress="grpcs://$KDS_ADDRESS:5685",ingress.enabled=true kuma kuma/kuma

    b. with CNI enabled to reproduce issue with kuma-cni

    kubectl config use-context "$C2"
    
    kubectl create ns kuma-system
    
    helm install --debug --namespace kuma-system --set cni.enabled=true,cni.chained=true,cni.netDir="/etc/cni/net.d",cni.binDir="/home/kubernetes/bin",cni.confName="10-calico.conflist",cni.logLevel="debug",controlPlane.mode="remote",controlPlane.zone="remote-1",controlPlane.kdsGlobalAddress="grpcs://$KDS_ADDRESS:5685",ingress.enabled=true kuma kuma/kuma
  9. Annotate default namespace to inject Kuma for new pods

    kubectl annotate namespace default kuma.io/sidecar-injection=enabled --overwrite
  10. Create some test deployment on the second cluster

    kubectl create deployment foo --image=ubuntu:focal -- sleep infinity 

Additional Details & Logs

  • Version: 1.1.6
  • Platform and Operating System: GKE 1.19
  • Installation Method: helm and kumactl
@bartsmykla bartsmykla self-assigned this May 26, 2021
@bartsmykla bartsmykla added the bug label May 26, 2021
@bartsmykla
Copy link
Contributor Author

Following up.

I found that the newest versions of the COS images should have this module enabled, according to: https://cos.googlesource.com/cos/overlays/board-overlays/+/7c1f05c2df5afaecac577d464a63b88a2d28092d and https://cloud.google.com/container-optimized-os/docs/release-notes#cos-dev-93-16340-0-0 in the newest versions of this image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant