UDP: bad checksum on VXLAN interface #1279

dmitry-irtegov · 2020-04-07T11:06:43Z

On k8s 1.17 cluster with RHeL 7 nodes, service IPs for pods on other nodes are not accessible.
Pod IP seem to work fine. Most noticeably, CoreDNS does not work.
Target node dmesg is filled with messages like:

[ 1423.035722] UDP: bad checksum. From 172.16.9.99:27503 to 172.16.80.252:8472 ulen 75
[ 1426.036873] UDP: bad checksum. From 172.16.9.99:4894 to 172.16.80.252:8472 ulen 75
[ 1427.537392] UDP: bad checksum. From 172.16.9.99:32693 to 172.16.80.252:8472 ulen 75
[ 1429.037910] UDP: bad checksum. From 172.16.9.99:46133 to 172.16.80.252:8472 ulen 75

Turning IP checksum offloading on the flannel.1 interface fixes the issue:

[root@ip-172-16-102-241 ~]# nslookup www.google.com 100.64.0.10
;; connection timed out; no servers could be reached
[root@ip-172-16-102-241 ~]# ethtool -K flannel.1 tx-checksum-ip-generic off
Actual changes:
tx-checksumming: off
	tx-checksum-ip-generic: off
tcp-segmentation-offload: off
	tx-tcp-segmentation: off [requested on]
	tx-tcp-ecn-segmentation: off [requested on]
	tx-tcp6-segmentation: off [requested on]
	tx-tcp-mangleid-segmentation: off [requested on]
udp-fragmentation-offload: off [requested on]
[root@ip-172-16-102-241 ~]# nslookup www.google.com 100.64.0.10
Server:		100.64.0.10
Address:	100.64.0.10#53
Non-authoritative answer:
Name:	www.google.com
Address: 172.217.13.228
Name:	www.google.com
Address: 2607:f8b0:4004:80a::2004
[root@ip-172-16-102-241 ~]#

Other people also hit this: https://t.du9l.com/2020/03/kubernetes-flannel-udp-packets-dropped-for-wrong-checksum-workaround/

This happens both with cni-canal and pure cni-flannel, so we decided to report the issue here.

Expected Behavior

I do not have to adjust interface settings to get flannel to work.

Your Environment

Flannel version: 0.11.0
Backend used (e.g. vxlan or udp): vxlan
Etcd version: 3.4.3
Kubernetes version (if used): 1.17.4
Operating System and version: RHeL 7.8
Link to your project (optional): https://www.kublr.com

The text was updated successfully, but these errors were encountered:

CMajeri · 2020-05-14T03:32:22Z

We hit this too. It was an absolute pain to figure out.

It seems to only affect service IPs (I'm guessing because of masquerade??), and specifically UDP (nslookup doesn't work, nslookup in tcp mode does)

If anyone knows where this bug comes from (besides checksum offloading) I'd be very interested.
I'll keep looking for a bit, but I'm not great with networking.

Our environment:

flannel: 0.9.0 (vxlan mode)
kubernetes: 1.16.9 (kube-proxy in iptables mode)
OS: Centos 8
etcd: 3.4.7

Weird thing is we're running almost the same versions of things (etcd is different, but I really doubt it comes from that) on a fedora 30 server and things work fine. Settings are the same, and while routing tables differ the base idea is the same...
Could it be something kernel or virtio related?

holooloo · 2020-05-26T09:53:45Z

check iptables versions on Centos 7 and 8
must be upper then 1.6.2

ksancheti · 2020-05-26T16:11:46Z

We are facing the same issue as mentioned by @dmitry-irtegov and @CMajeri.

Environment:

Flannel version: 0.11.0
Backend used (e.g. vxlan or udp): vxlan
Etcd version: 3.4.3
Kubernetes version: 1.18.2 (installed using kubeadm)
Operating System and version: Ubuntu 18.04.4
iptables version: 1.6.1

This workaround worked for us -

ethtool -K flannel.1 tx-checksum-ip-generic off

brucedlg · 2020-06-22T19:01:49Z

It's definitely related to this one: kubernetes/kubernetes#88986 The solution kubernetes/kubernetes#92035 has a good description on the issue. It's the change on iptables rule exposing some existing kernel bug, especially in RHEL7.

Here is another workaround for the issue not requiring turning off chksum offload:

sudo iptables -A OUTPUT -p udp -m udp --dport 8472 -j MARK --set-xmark 0x0

UDP port 8472 is the default port for flannel encapsulating packet. It clears the mark to avoid doing SNAT on the encapsulating packet, thus no double SNAT.
This assumes that you use iptables. ipvs should have similar commands.

Similar to flannel-io/flannel#1279, unmark output to bypaas kernel bug and enable checksum for better performance.

Similar to flannel-io/flannel#1279, unmark output to bypaas kernel bug and enable checksum for better performance. (cherry picked from commit dcda11d)

stale · 2023-01-26T01:22:31Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

simondeting · 2024-04-25T15:18:22Z

检查 Centos 7 和 8 上的 iptables 版本必须高于 1.6.2

Can you tell me why or give a link? Please.

Capitrium mentioned this issue Apr 13, 2020

Disable tx and rx offloading on VXLAN interfaces #1282

Closed

3 tasks

hakman mentioned this issue May 6, 2020

Disable TX checksum offload for Flannel VXLAN kubernetes/kops#9074

Merged

HugoRh mentioned this issue May 9, 2020

DNS timeout in 2.13.0 with flannel/canal on rhel7 kubernetes-sigs/kubespray#6115

Closed

champtar mentioned this issue May 26, 2020

Intra-cluster access is slow metallb/metallb#528

Closed

txtdevelop mentioned this issue Feb 26, 2021

Upgrade to 20.10 breaks swarm network moby/moby#41775

Open

oilbeater added a commit to kubeovn/kube-ovn that referenced this issue Aug 17, 2021

fix: bad udp checksum when access nodeport

dcda11d

Similar to flannel-io/flannel#1279, unmark output to bypaas kernel bug and enable checksum for better performance.

oilbeater mentioned this issue Aug 17, 2021

fix: bad udp checksum when access nodeport kubeovn/kube-ovn#975

Merged

oilbeater added a commit to kubeovn/kube-ovn that referenced this issue Aug 18, 2021

fix: bad udp checksum when access nodeport

2560987

Similar to flannel-io/flannel#1279, unmark output to bypaas kernel bug and enable checksum for better performance. (cherry picked from commit dcda11d)

brandond mentioned this issue Aug 26, 2021

the overlay network between nodes seems to be broken in Debian 11 (bullseye) k3s-io/k3s#3863

Closed

kwinkel mentioned this issue Nov 18, 2021

Monitoring (100.0.0+up16.6.0): cattle-monitoring-system/rancher-monitoring-kube-proxy failed to resolve DNS rancher/rancher#35327

Closed

maxpain mentioned this issue Jun 16, 2022

VXLAN: bad UDP checksums kubernetes-sigs/kubespray#8992

Closed

This was referenced Dec 20, 2022

UDP access to a service from another node is broken with hostNetworking k3s-io/k3s#6664

Closed

[BUG] CIS scan on k3s clusters running for too long before it gets completed. rancher/rancher#39839

Closed

jan-lucansky mentioned this issue Jan 11, 2023

Pod can't access clusterip service for another pod with endpoint on the same node #1702

Closed

stale bot added the wontfix label Jan 26, 2023

brandond mentioned this issue Feb 7, 2023

Flannel CNI / DNS Issues with in-cluster Service Names when "hostNetwork: true" k3s-io/k3s#6880

Closed

stale bot closed this as completed Feb 16, 2023

ejweber mentioned this issue Aug 21, 2023

[BUG] Failed Statefulset Pod Creation with RWX Workload on Longhorn v1.3.3 and SLES 15 SP5 longhorn/longhorn#6494

Closed

AkihiroSuda mentioned this issue Apr 4, 2024

troubleshooting.md: add ethtool -K flannel.1 tx-checksum-ip-generic off for NAT #1929

Merged

brandond mentioned this issue Jul 11, 2024

Networking issue between nodes rancher/rke2#6307

Closed

zcq98 mentioned this issue Oct 18, 2024

fix: udp bad checksum on VXLAN interface kubeovn/kube-ovn#4639

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UDP: bad checksum on VXLAN interface #1279

UDP: bad checksum on VXLAN interface #1279

dmitry-irtegov commented Apr 7, 2020 •

edited

Loading

CMajeri commented May 14, 2020

holooloo commented May 26, 2020

ksancheti commented May 26, 2020 •

edited

Loading

brucedlg commented Jun 22, 2020 •

edited

Loading

stale bot commented Jan 26, 2023

simondeting commented Apr 25, 2024

UDP: bad checksum on VXLAN interface #1279

UDP: bad checksum on VXLAN interface #1279

Comments

dmitry-irtegov commented Apr 7, 2020 • edited Loading

Expected Behavior

Your Environment

CMajeri commented May 14, 2020

holooloo commented May 26, 2020

ksancheti commented May 26, 2020 • edited Loading

brucedlg commented Jun 22, 2020 • edited Loading

stale bot commented Jan 26, 2023

simondeting commented Apr 25, 2024

dmitry-irtegov commented Apr 7, 2020 •

edited

Loading

ksancheti commented May 26, 2020 •

edited

Loading

brucedlg commented Jun 22, 2020 •

edited

Loading