[Backport release-1.24] rke-canal pod is not running due to incompatible ipset protocol version #4217

rancherbot · 2023-05-12T09:11:40Z

This is a backport issue for #4145, automatically created via rancherbot by @rbrtbnfgl

Original issue description:

Environmental Info:
RKE2 Version:
v1.26.4+rke2r1

Node(s) CPU architecture, OS, and Version:
Linux k8s-agent16 5.15.0-70-generic #77-Ubuntu SMP Tue Mar 21 14:02:37 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

2 servers, 16 agents all running Ubuntu 22.04

Describe the bug:

rke2-canal pods on some agents are not starting. The pod logs contain the following.

2023-04-27 12:01:53.719 [WARNING][2437501] felix/ipsets.go 319: Failed to resync with dataplane error=exit status 1 family="inet"
2023-04-27 12:01:53.752 [INFO][2437501] felix/ipsets.go 309: Retrying after an ipsets update failure... family="inet"
2023-04-27 12:01:53.753 [ERROR][2437501] felix/ipsets.go 569: Bad return code from 'ipset list'. error=exit status 1 family="inet" stderr="ipset v6.36: Kernel support protocol versions 6-7 while userspace supports protocol versions 6-6\nKernel and userspace incompatible: settype hash:net with revision 7 not supported by userspace.\n"

There was a similar issue reported at projectcalico/calico#5011. But, it's mentioned that it only happens if kube-proxy mode is ipvs and it shouldn't impact if the proxy-mode is iptables. I have confirmed that the proxy-mode is iptables. Here are the logs from the kube-proxy pod.

I0424 19:10:54.248089       1 server.go:224] "Warning, all flags other than --config, --write-config-to, and --cleanup are deprecated, please begin using a config file ASAP"
I0424 19:10:54.257660       1 node.go:163] Successfully retrieved node IP: 192.168.39.77
I0424 19:10:54.257687       1 server_others.go:109] "Detected node IP" address="192.168.39.77"
I0424 19:10:54.294553       1 server_others.go:176] "Using iptables Proxier"
I0424 19:10:54.294622       1 server_others.go:183] "kube-proxy running in dual-stack mode" ipFamily=IPv4
I0424 19:10:54.294646       1 server_others.go:184] "Creating dualStackProxier for iptables"
I0424 19:10:54.294680       1 server_others.go:465] "Detect-local-mode set to ClusterCIDR, but no IPv6 cluster CIDR defined, , defaulting to no-op detect-local for IPv6"
I0424 19:10:54.294748       1 proxier.go:242] "Setting route_localnet=1 to allow node-ports on localhost; to change this either disable iptables.localhostNodePorts (--iptables-localhost-nodeports) or set nodePortAddresses (--nodeport-addresses) to filter loopback addresses"
I0424 19:10:54.295311       1 server.go:655] "Version info" version="v1.26.4+rke2r1"
I0424 19:10:54.295341       1 server.go:657] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
I0424 19:10:54.296172       1 config.go:226] "Starting endpoint slice config controller"
I0424 19:10:54.296195       1 shared_informer.go:270] Waiting for caches to sync for endpoint slice config
I0424 19:10:54.296235       1 config.go:444] "Starting node config controller"
I0424 19:10:54.296255       1 shared_informer.go:270] Waiting for caches to sync for node config
I0424 19:10:54.296254       1 config.go:317] "Starting service config controller"
I0424 19:10:54.296275       1 shared_informer.go:270] Waiting for caches to sync for service config
I0424 19:10:54.397015       1 shared_informer.go:277] Caches are synced for node config
I0424 19:10:54.397063       1 shared_informer.go:277] Caches are synced for endpoint slice config
I0424 19:10:54.397175       1 shared_informer.go:277] Caches are synced for service config
E0425 16:30:24.197986       1 service_health.go:187] "Healthcheck closed" err="accept tcp [::]:32666: use of closed network connection" service="istio-system/istio-ingressgateway"
E0425 16:30:24.198068       1 service_health.go:187] "Healthcheck closed" err="accept tcp [::]:32675: use of closed network connection" service="istio-system/istio-internal-ingressgateway"

Steps To Reproduce:

Installed RKE2 using the following steps

sudo swapoff -a

hostnamectl set-hostname k8s-master01

# add the master node details in every node
vi /etc/hosts
192.168.39.5 k8s-master01


# kubectl install on Debian based distributions
sudo apt update
sudo apt install -y apt-transport-https ca-certificates curl
sudo curl -fsSLo /usr/share/keyrings/kubernetes-archive-keyring.gpg https://packages.cloud.google.com/apt/doc/apt-key.gpg
echo "deb [signed-by=/usr/share/keyrings/kubernetes-archive-keyring.gpg] https://apt.kubernetes.io/ kubernetes-xenial main" | sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo apt update
sudo apt install -y kubectl

#network_bridges
sudo tee -a /etc/sysctl.d/99-kubernetes.conf <<EOF
net.bridge.bridge-nf-call-ip6tables = 1 
net.bridge.bridge-nf-call-iptables = 1
EOF

cat >>/etc/sysctl.d/kubernetes.conf<<EOF
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
EOF

sysctl --system

curl -sfL https://get.rke2.io | INSTALL_RKE2_CHANNEL=latest sh -

# first server node 
systemctl enable rke2-server
systemctl start rke2-server

export KUBECONFIG=/etc/rancher/rke2/rke2.yaml PATH=$PATH:/var/lib/rancher/rke2/bin

cat /var/lib/rancher/rke2/server/node-token

# second server node
mkdir -p /etc/rancher/rke2
vi /etc/rancher/rke2/config.yaml
server: https://192.168.39.2:9345
token: <TOKEN_FROM_THE_ABOVE_CAT_COMMAND>

systemctl enable rke2-server
systemctl start rke2-server

# agent node
mkdir -p /etc/rancher/rke2
vi /etc/rancher/rke2/config.yaml

server: https://192.168.39.2:9345
token: <TOKEN_FROM_THE_ABOVE_CAT_COMMAND>

systemctl enable rke2-agent
systemctl start rke2-agent

Expected behavior:
Running kubectl get pod -n kube-system should result in all rke2-canal pods running successfully.

Actual behavior:

Some of the rke2-canal are stuck at Ready 1/2

Additional context / logs:

On host:

# ipset version
ipset v7.15, protocol version: 7

On rke2-canal pod and calico-node container running in the same host:

sh-4.4# ipset version
ipset v6.36, protocol version: 6

Note that this behavior is observed only on one server node and one agent node. All other nodes are working fine. One common thing on both these nodes is that the output of ipset list contained sets with Revision: 7 in them.

Output of ipset list from the problematic agent node:

Name: cali40all-ipam-pools
Type: hash:net
Revision: 7
Header: family inet hashsize 1024 maxelem 1048576 bucketsize 12 initval 0x54a33cf9
Size in memory: 504
References: 0
Number of entries: 1
Members:
10.42.0.0/16

Name: cali40masq-ipam-pools
Type: hash:net
Revision: 7
Header: family inet hashsize 1024 maxelem 1048576 bucketsize 12 initval 0xdd3fa3ae
Size in memory: 504
References: 0
Number of entries: 1
Members:
10.42.0.0/16

Name: cali40this-host
Type: hash:ip
Revision: 5
Header: family inet hashsize 1024 maxelem 1048576 bucketsize 12 initval 0x2d605c42
Size in memory: 360
References: 0
Number of entries: 4
Members:
192.168.39.77
10.42.180.64
127.0.0.1
127.0.0.0

Name: cali40all-vxlan-net
Type: hash:net
Revision: 7
Header: family inet hashsize 1024 maxelem 1048576 bucketsize 12 initval 0x17e7d094
Size in memory: 1320
References: 0
Number of entries: 18
Members:
192.168.39.99
192.168.39.91
192.168.39.154
192.168.39.72
192.168.39.3
192.168.39.79
192.168.39.98
192.168.39.74
192.168.39.78
192.168.39.96
192.168.39.92
192.168.39.94
192.168.39.151
192.168.39.93
192.168.39.2
192.168.39.1
192.168.39.97
192.168.39.95

Output of ipset list from the node that is working fine:

Name: cali40all-ipam-pools
Type: hash:net
Revision: 6
Header: family inet hashsize 1024 maxelem 1048576
Size in memory: 504
References: 1
Number of entries: 1
Members:
172.16.0.0/16

Name: cali40masq-ipam-pools
Type: hash:net
Revision: 6
Header: family inet hashsize 1024 maxelem 1048576
Size in memory: 504
References: 1
Number of entries: 1
Members:
172.16.0.0/16

Name: cali40this-host
Type: hash:ip
Revision: 4
Header: family inet hashsize 1024 maxelem 1048576
Size in memory: 360
References: 0
Number of entries: 4
Members:
10.42.30.0
127.0.0.1
192.168.39.96
127.0.0.0

The text was updated successfully, but these errors were encountered:

aganesh-suse · 2023-05-23T20:31:56Z

OS: Ubuntu 22.4

rke2 -v

rke2 version v1.24.14-rc1+rke2r1 (1c1a6f6f0d87e5935ef3def0828777d1d30c5210)
go version go1.19.9 X:boringcrypto

Setup on 1 main server:

sudo mkdir -p /etc/rancher/rke2
sudo bash -c 'cat <<EOF> /etc/rancher/rke2/config.yaml
cni: canal
write-kubeconfig-mode: "0644"
token: secret
EOF'

Steps followed:

Installed rke2
verified calico version:

kubectl describe pod rke2-canal-tjvdm -n kube-system | grep calico
    Image:         rancher/hardened-calico:v3.25.1-build20230512

rancherbot added this to the v1.24.14+rke2r1 milestone May 12, 2023

rbrtbnfgl mentioned this issue May 12, 2023

[Release 1.24] Update Calico image on Canal #4220

Merged

rancher-max assigned aganesh-suse May 19, 2023

aganesh-suse closed this as completed May 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Backport release-1.24] rke-canal pod is not running due to incompatible ipset protocol version #4217

[Backport release-1.24] rke-canal pod is not running due to incompatible ipset protocol version #4217

rancherbot commented May 12, 2023

aganesh-suse commented May 23, 2023

[Backport release-1.24] rke-canal pod is not running due to incompatible ipset protocol version #4217

[Backport release-1.24] rke-canal pod is not running due to incompatible ipset protocol version #4217

Comments

rancherbot commented May 12, 2023

aganesh-suse commented May 23, 2023