-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DNS timeout in 2.13.0 with flannel/canal on rhel7 #6115
Comments
Hello @HugoRh I feel bad because we identify this and apply this workaround in our test (tests/testcases/040_check-network-adv.yml:19), sorry not to have documented this in the release note...
Anyhow I guess we can close this as it is indeed a flannel bug, nothing to do with kubespray :) |
Hi @floryut Thanks !! |
Also the test might be modified (sorry if the syntax is bad, I begin with Ansible) : tasks:
- name: Flannel | Disable tx and rx offloading on VXLAN interfaces (see https://github.com/coreos/flannel/pull/1282)
shell: "ethtool --offload flannel.1 rx off tx off"
ignore_errors: true
when: (kube_network_plugin|default('calico') == 'flannel') or (kube_network_plugin|default('calico') == 'canal') |
I would go with something more like |
I might have a clue here. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/close |
@floryut: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Environment:
printf "$(uname -srm)\n$(cat /etc/os-release)\n"
):Linux 3.10.0-957.27.2.el7.x86_64 x86_64
NAME="Red Hat Enterprise Linux Server"
VERSION="7.6 (Maipo)"
ID="rhel"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="7.6"
PRETTY_NAME="Red Hat Enterprise Linux"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:7.6:GA:server"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7"
REDHAT_BUGZILLA_PRODUCT_VERSION=7.6
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="7.6"
ansible --version
):ansible 2.9.6
config file = /data/caas/deploy-k8s/kubespray/ansible.cfg
configured module search path = [u'/data/caas/deploy-k8s/kubespray/library']
ansible python module location = /opt/rh/python27/root/usr/lib/python2.7/site-packages/ansible
executable location = /opt/rh/python27/root/usr/bin/ansible
python version = 2.7.16 (default, Jun 28 2019, 16:57:28) [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)]
python --version
):Python 2.7.16
Kubespray version (commit) (
git rev-parse --short HEAD
):01dbc90
Network plugin used:
canal
Anything else do we need to know:
I upgraded my test cluster from 2.12.5 to 2.13.0 without issue.
Yet , soon after my cluster was available again, many pods were in CrashLoopBackOff status.
And I noticed pretty quickly some issue at the DNS level with the nodelocaldns pods:
After a very long analysis ( I suspected calico for a very long time), I noticed many messages like below in
dmesg
:Long story short, this is a bug in flannel =>flannel-io/flannel#1279
The workaround mentioned in that issue does work ( I applied it on all the nodes as root):
# ethtool -K flannel.1 tx-checksum-ip-generic off Actual changes: tx-checksumming: off tx-checksum-ip-generic: off tcp-segmentation-offload: off tx-tcp-segmentation: off [requested on] tx-tcp-ecn-segmentation: off [requested on] tx-tcp6-segmentation: off [requested on] tx-tcp-mangleid-segmentation: off [requested on] udp-fragmentation-offload: off [requested on]
As the fix has been merged only recently, it will be in flannel 0.13.
Until then this workaround is the only solution today.
The text was updated successfully, but these errors were encountered: