Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rke-canal pod is not running due to incompatible ipset protocol version #4145

Closed
prashanth-nani opened this issue Apr 27, 2023 · 11 comments · Fixed by rancher/image-build-calico#40
Assignees

Comments

@prashanth-nani
Copy link

prashanth-nani commented Apr 27, 2023

Environmental Info:
RKE2 Version:
v1.26.4+rke2r1

Node(s) CPU architecture, OS, and Version:
Linux k8s-agent16 5.15.0-70-generic #77-Ubuntu SMP Tue Mar 21 14:02:37 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

2 servers, 16 agents all running Ubuntu 22.04

Describe the bug:

rke2-canal pods on some agents are not starting. The pod logs contain the following.

2023-04-27 12:01:53.719 [WARNING][2437501] felix/ipsets.go 319: Failed to resync with dataplane error=exit status 1 family="inet"
2023-04-27 12:01:53.752 [INFO][2437501] felix/ipsets.go 309: Retrying after an ipsets update failure... family="inet"
2023-04-27 12:01:53.753 [ERROR][2437501] felix/ipsets.go 569: Bad return code from 'ipset list'. error=exit status 1 family="inet" stderr="ipset v6.36: Kernel support protocol versions 6-7 while userspace supports protocol versions 6-6\nKernel and userspace incompatible: settype hash:net with revision 7 not supported by userspace.\n"

There was a similar issue reported at projectcalico/calico#5011. But, it's mentioned that it only happens if kube-proxy mode is ipvs and it shouldn't impact if the proxy-mode is iptables. I have confirmed that the proxy-mode is iptables. Here are the logs from the kube-proxy pod.

I0424 19:10:54.248089       1 server.go:224] "Warning, all flags other than --config, --write-config-to, and --cleanup are deprecated, please begin using a config file ASAP"
I0424 19:10:54.257660       1 node.go:163] Successfully retrieved node IP: 192.168.39.77
I0424 19:10:54.257687       1 server_others.go:109] "Detected node IP" address="192.168.39.77"
I0424 19:10:54.294553       1 server_others.go:176] "Using iptables Proxier"
I0424 19:10:54.294622       1 server_others.go:183] "kube-proxy running in dual-stack mode" ipFamily=IPv4
I0424 19:10:54.294646       1 server_others.go:184] "Creating dualStackProxier for iptables"
I0424 19:10:54.294680       1 server_others.go:465] "Detect-local-mode set to ClusterCIDR, but no IPv6 cluster CIDR defined, , defaulting to no-op detect-local for IPv6"
I0424 19:10:54.294748       1 proxier.go:242] "Setting route_localnet=1 to allow node-ports on localhost; to change this either disable iptables.localhostNodePorts (--iptables-localhost-nodeports) or set nodePortAddresses (--nodeport-addresses) to filter loopback addresses"
I0424 19:10:54.295311       1 server.go:655] "Version info" version="v1.26.4+rke2r1"
I0424 19:10:54.295341       1 server.go:657] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
I0424 19:10:54.296172       1 config.go:226] "Starting endpoint slice config controller"
I0424 19:10:54.296195       1 shared_informer.go:270] Waiting for caches to sync for endpoint slice config
I0424 19:10:54.296235       1 config.go:444] "Starting node config controller"
I0424 19:10:54.296255       1 shared_informer.go:270] Waiting for caches to sync for node config
I0424 19:10:54.296254       1 config.go:317] "Starting service config controller"
I0424 19:10:54.296275       1 shared_informer.go:270] Waiting for caches to sync for service config
I0424 19:10:54.397015       1 shared_informer.go:277] Caches are synced for node config
I0424 19:10:54.397063       1 shared_informer.go:277] Caches are synced for endpoint slice config
I0424 19:10:54.397175       1 shared_informer.go:277] Caches are synced for service config
E0425 16:30:24.197986       1 service_health.go:187] "Healthcheck closed" err="accept tcp [::]:32666: use of closed network connection" service="istio-system/istio-ingressgateway"
E0425 16:30:24.198068       1 service_health.go:187] "Healthcheck closed" err="accept tcp [::]:32675: use of closed network connection" service="istio-system/istio-internal-ingressgateway"

Steps To Reproduce:

Installed RKE2 using the following steps

sudo swapoff -a

hostnamectl set-hostname k8s-master01

# add the master node details in every node
vi /etc/hosts
192.168.39.5 k8s-master01


# kubectl install on Debian based distributions
sudo apt update
sudo apt install -y apt-transport-https ca-certificates curl
sudo curl -fsSLo /usr/share/keyrings/kubernetes-archive-keyring.gpg https://packages.cloud.google.com/apt/doc/apt-key.gpg
echo "deb [signed-by=/usr/share/keyrings/kubernetes-archive-keyring.gpg] https://apt.kubernetes.io/ kubernetes-xenial main" | sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo apt update
sudo apt install -y kubectl

#network_bridges
sudo tee -a /etc/sysctl.d/99-kubernetes.conf <<EOF
net.bridge.bridge-nf-call-ip6tables = 1 
net.bridge.bridge-nf-call-iptables = 1
EOF

cat >>/etc/sysctl.d/kubernetes.conf<<EOF
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
EOF

sysctl --system

curl -sfL https://get.rke2.io | INSTALL_RKE2_CHANNEL=latest sh -

# first server node 
systemctl enable rke2-server
systemctl start rke2-server

export KUBECONFIG=/etc/rancher/rke2/rke2.yaml PATH=$PATH:/var/lib/rancher/rke2/bin

cat /var/lib/rancher/rke2/server/node-token

# second server node
mkdir -p /etc/rancher/rke2
vi /etc/rancher/rke2/config.yaml
server: https://192.168.39.2:9345
token: <TOKEN_FROM_THE_ABOVE_CAT_COMMAND>

systemctl enable rke2-server
systemctl start rke2-server

# agent node
mkdir -p /etc/rancher/rke2
vi /etc/rancher/rke2/config.yaml

server: https://192.168.39.2:9345
token: <TOKEN_FROM_THE_ABOVE_CAT_COMMAND>

systemctl enable rke2-agent
systemctl start rke2-agent

Expected behavior:
Running kubectl get pod -n kube-system should result in all rke2-canal pods running successfully.

Actual behavior:

Some of the rke2-canal are stuck at Ready 1/2

Additional context / logs:

On host:

# ipset version
ipset v7.15, protocol version: 7

On rke2-canal pod and calico-node container running in the same host:

sh-4.4# ipset version
ipset v6.36, protocol version: 6

Note that this behavior is observed only on one server node and one agent node. All other nodes are working fine. One common thing on both these nodes is that the output of ipset list contained sets with Revision: 7 in them.

Output of ipset list from the problematic agent node:

Name: cali40all-ipam-pools
Type: hash:net
Revision: 7
Header: family inet hashsize 1024 maxelem 1048576 bucketsize 12 initval 0x54a33cf9
Size in memory: 504
References: 0
Number of entries: 1
Members:
10.42.0.0/16

Name: cali40masq-ipam-pools
Type: hash:net
Revision: 7
Header: family inet hashsize 1024 maxelem 1048576 bucketsize 12 initval 0xdd3fa3ae
Size in memory: 504
References: 0
Number of entries: 1
Members:
10.42.0.0/16

Name: cali40this-host
Type: hash:ip
Revision: 5
Header: family inet hashsize 1024 maxelem 1048576 bucketsize 12 initval 0x2d605c42
Size in memory: 360
References: 0
Number of entries: 4
Members:
192.168.39.77
10.42.180.64
127.0.0.1
127.0.0.0

Name: cali40all-vxlan-net
Type: hash:net
Revision: 7
Header: family inet hashsize 1024 maxelem 1048576 bucketsize 12 initval 0x17e7d094
Size in memory: 1320
References: 0
Number of entries: 18
Members:
192.168.39.99
192.168.39.91
192.168.39.154
192.168.39.72
192.168.39.3
192.168.39.79
192.168.39.98
192.168.39.74
192.168.39.78
192.168.39.96
192.168.39.92
192.168.39.94
192.168.39.151
192.168.39.93
192.168.39.2
192.168.39.1
192.168.39.97
192.168.39.95

Output of ipset list from the node that is working fine:

Name: cali40all-ipam-pools
Type: hash:net
Revision: 6
Header: family inet hashsize 1024 maxelem 1048576
Size in memory: 504
References: 1
Number of entries: 1
Members:
172.16.0.0/16

Name: cali40masq-ipam-pools
Type: hash:net
Revision: 6
Header: family inet hashsize 1024 maxelem 1048576
Size in memory: 504
References: 1
Number of entries: 1
Members:
172.16.0.0/16

Name: cali40this-host
Type: hash:ip
Revision: 4
Header: family inet hashsize 1024 maxelem 1048576
Size in memory: 360
References: 0
Number of entries: 4
Members:
10.42.30.0
127.0.0.1
192.168.39.96
127.0.0.0
@brandond
Copy link
Member

In my experience the ipset kernel/userspace mismatch is only critical if something else adds ipsets with a newer version than the userspace tool supports. This matches your observation that the broken host has Revision: 7 ipsets present. Do you have some other tool on this host that is using the host ipset binary to add newer-revision ipsets before Canal starts?

@brandond
Copy link
Member

cc @rbrtbnfgl @manuelbuil we do need to bump the ipset version bundled with canal though, v6.36 is from March of 2018...

@prashanth-nani
Copy link
Author

@brandond This is a fresh Ubuntu installation with no other tools installed in it. Is there a way to figure out the source of the ipsets with Revision:7?

@brandond
Copy link
Member

I'm not sure. The ipset names do suggest that they also came from Calio but I'm not sure why there would be some that used the host ipset version and others that used the one bundled in the image. I will defer to the network team tagged above.

@prashanth-nani
Copy link
Author

prashanth-nani commented Apr 27, 2023

@brandond Also, rke2-canal pods use rancher/hardened-calico image for calico-node container. As observed here rancher/hardened-calico uses registry.suse.com/bci/bci-base:15.3.17.20.12 as the base image. zypper package manager in bci-base:15.3 is installing ipset v6.36. The latest bci-base:15.4.27.14.53 installs ipset v7.15 though. I'm assuming that updating the base image of rancher/hardened-calico to bci-base:15.4 should resolve this issue.

@brandond
Copy link
Member

yes, there is an effort in progress to update the base across all of our hardened images.

@brandond
Copy link
Member

brandond commented May 3, 2023

Reopening for QA ( reminder not to use fix/fixes/so on alongside an issue number in PR descriptions @rbrtbnfgl )

@rbrtbnfgl
Copy link
Contributor

/backport v1.26.5+rke2r1

@rbrtbnfgl
Copy link
Contributor

/backport v1.25.10+rke2r1

@rbrtbnfgl
Copy link
Contributor

/backport v1.24.14+rke2r1

@bguzman-3pillar
Copy link
Contributor

bguzman-3pillar commented May 23, 2023

Validated on with /

$ rke2 -v
rke2 version v1.27.2-rc2+rke2r1 (c22c7d7196c938c32f7a8f0afe2e1c20791e5e73)
go version go1.20.4 X:boringcrypto

Environment Details

Infrastructure

  • Cloud
  • Hosted

Node(s) CPU architecture, OS, and Version:

$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.2 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.2 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

Cluster Configuration:

1 server
## Testing Steps
<!-- Provide all the steps necessary to replicate and validate the issue -->
1. Copy config.yaml

$ sudo mkdir -p /etc/rancher/rke2 && sudo cp config.yaml /etc/rancher/rke2

2. Install RKE2
3. Check the image version

Validation Results:

  • rke2 version used for validation:
$ kubectl describe pod/rke2-canal-vwkls -n kube-system | grep "calico"
    Image:         rancher/hardened-calico:v3.25.1-build20230512
    Image ID:      docker.io/rancher/hardened-calico@sha256:1f53576f9d9cd64887e3bda5714eb5e0b3bf082590549926c000f8faa8260fe8
    Image:         rancher/hardened-calico:v3.25.1-build20230512
    Image ID:      docker.io/rancher/hardened-calico@sha256:1f53576f9d9cd64887e3bda5714eb5e0b3bf082590549926c000f8faa8260fe8
  calico-node:
    Image:         rancher/hardened-calico:v3.25.1-build20230512
    Image ID:      docker.io/rancher/hardened-calico@sha256:1f53576f9d9cd64887e3bda5714eb5e0b3bf082590549926c000f8faa8260fe8
    Liveness:   exec [/usr/bin/calico-node -felix-live] delay=10s timeout=10s period=10s #success=1 #failure=6
      /var/lib/calico from var-lib-calico (rw)
      /var/log/calico/cni from cni-log-dir (ro)
      /var/run/calico from var-run-calico (rw)
  var-run-calico:
    Path:          /var/run/calico
  var-lib-calico:
    Path:          /var/lib/calico
    Path:          /var/log/calico/cni
  Normal  Pulling    6m8s   kubelet            Pulling image "rancher/hardened-calico:v3.25.1-build20230512"
  Normal  Pulled     5m57s  kubelet            Successfully pulled image "rancher/hardened-calico:v3.25.1-build20230512" in 10.109396447s (10.109412828s including waiting)
  Normal  Pulled     5m54s  kubelet            Container image "rancher/hardened-calico:v3.25.1-build20230512" already present on machine
  Normal  Pulled     5m52s  kubelet            Container image "rancher/hardened-calico:v3.25.1-build20230512" already present on machine
  Normal  Created    5m52s  kubelet            Created container calico-node
  Normal  Started    5m52s  kubelet            Started container calico-node
$ kubectl get node,pod -A
NAME                  STATUS   ROLES                       AGE     VERSION
node/ip-172-31-5-96   Ready    control-plane,etcd,master   9m15s   v1.27.2+rke2r1

NAMESPACE     NAME                                                       READY   STATUS      RESTARTS   AGE
kube-system   pod/cloud-controller-manager-ip-172-31-5-96                1/1     Running     0          9m
kube-system   pod/etcd-ip-172-31-5-96                                    1/1     Running     0          8m38s
kube-system   pod/helm-install-rke2-canal-vs85k                          0/1     Completed   0          8m56s
kube-system   pod/helm-install-rke2-coredns-7k78p                        0/1     Completed   0          8m56s
kube-system   pod/helm-install-rke2-ingress-nginx-5dpmd                  0/1     Completed   0          8m56s
kube-system   pod/helm-install-rke2-metrics-server-sxfqv                 0/1     Completed   0          8m56s
kube-system   pod/helm-install-rke2-snapshot-controller-c7w6d            0/1     Completed   0          8m56s
kube-system   pod/helm-install-rke2-snapshot-controller-crd-m4xmj        0/1     Completed   0          8m56s
kube-system   pod/helm-install-rke2-snapshot-validation-webhook-2wzjz    0/1     Completed   0          8m56s
kube-system   pod/kube-apiserver-ip-172-31-5-96                          1/1     Running     0          9m8s
kube-system   pod/kube-controller-manager-ip-172-31-5-96                 1/1     Running     0          9m2s
kube-system   pod/kube-proxy-ip-172-31-5-96                              1/1     Running     0          9m5s
kube-system   pod/kube-scheduler-ip-172-31-5-96                          1/1     Running     0          9m2s
kube-system   pod/rke2-canal-vwkls                                       2/2     Running     0          8m48s
kube-system   pod/rke2-coredns-rke2-coredns-5896cccb79-ckv9s             1/1     Running     0          8m49s
kube-system   pod/rke2-coredns-rke2-coredns-autoscaler-f6766cdc9-ss472   1/1     Running     0          8m49s
kube-system   pod/rke2-ingress-nginx-controller-q6bpd                    1/1     Running     0          7m59s
kube-system   pod/rke2-metrics-server-5688df6c76-2v2bs                   1/1     Running     0          8m9s
kube-system   pod/rke2-snapshot-controller-7d6476d7cb-pjcpx              1/1     Running     0          8m10s
kube-system   pod/rke2-snapshot-validation-webhook-6c5f6cf5d8-vtdpt      1/1     Running     0          8m13s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants