Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document Vsphere caveats #3143

Closed
andrewrynhard opened this issue Feb 12, 2021 · 9 comments
Closed

Document Vsphere caveats #3143

andrewrynhard opened this issue Feb 12, 2021 · 9 comments
Labels

Comments

@andrewrynhard
Copy link
Member

@andrewrynhard
Copy link
Member Author

@alex1989hu @mologie Am I missing anything?

@mologie
Copy link

mologie commented Feb 12, 2021

Oh, your mail reminds me I wanted to file this one. Sorry about that, and thanks for the heads-up!

  • Not sure if vSphere 7.X is really a requirement, I recall @alex1989hu mentioned he'd previously run Talos under 6.X
  • vSphere 7.X contains the vmxnet bug
  • The vmxnet bug affects only VXLAN-based CNIs, including the default Flannel. Calico in IPIP mode works fine.
  • The workaround is to use E1000 instead of vmxnet3.
  • open-vm-tools / talos-vmtoolsd is required only for vSphere CPI+CSI support and clean shutdown; Talos itself works out of the box

@andrewrynhard
Copy link
Member Author

Thanks @mologie !

@alex1989hu
Copy link
Contributor

Many conditions shall be met, not just VMXNET3 and vSphere 7.x. I will try to summarize those conditions later.

@rgl
Copy link
Contributor

rgl commented Aug 26, 2021

Having the status/caveats at, e.g., https://www.talos.dev/docs/v0.12/virtualized-platforms/vmware/ would be great.

Also showing how to integrate with https://github.com/kubernetes-sigs/vsphere-csi-driver would be pretty awesome.

@luqelinux
Copy link

luqelinux commented Nov 5, 2021

The lack of network connectivity from pods affected my setup on vSphere 7.0u2 with ESXI 7.0u1 hosts.
Everything worked fine until (https://github.com/mologie/talos-vmtoolsd) release v0.2 has been deployed on Talos v0.13.0 cluster. Switching the network adapters to E1000 has fixed it.

@CompPhy
Copy link

CompPhy commented Mar 14, 2024

As someone who ran into this problem recently, I have to admit I agree with the sentiment here. I actually didn't even see this thread until finding my own work around, because these issues aren't clear in the documentation. For those that are curious, you can actually make VXLAN work in vSphere, and without having to move away from VMXNET3 interfaces. Although, the work around below might loose the VXLAN offloading support; I'm not actually sure how to verify.

My experience, which is on VMware ESXi, 7.0.3, 20328353, was that any VXLAN packets going between hosts were just not routing at all. Any communication that was within a single node was fine, but anything attempting to cross nodes would just timeout. All of my my ESXi host network layers and kubernetes hosts are in the same subnet and VLAN, so I could immediately rule out any of those type of issues. Which left me a little stumped, I could ping between hosts but any TCP traffic would just drop.

Once I realized that ESXi was trying to manage VXLAN traffic offloading, I took a shot in the dark that worked out as a good solution. I just changed the flannel configuration to move VXLAN traffic onto a different port. All my problems with VXLAN routing disappeared and things seem to be working fine now.

WORKAROUND:

kubectl edit configmap/kube-flannel-cfg -n kube-system
    # Change data -> net-conf.json -> Backend -> Port to a non-standard  port
    # EG:  "Port": 4799   (Default is 4789)
kubectl rollout restart daemonset/kube-flannel -n kube-system

The only caveat here is that running talosctl upgrade-k8s will revert this configuration. I have yet to find a way to customize the bootstrap manifest for flannel in this regard.

LONG TERM SOLUTION:
As a proposed solution here, maybe Talos devs can add cluster config options for customizing net-conf.json? Another good use case here might be better support for flannel backend options. For example, flannel also supports things like host-gw and wireguard, instead of VXLAN. https://github.com/flannel-io/flannel/blob/master/Documentation/backends.md

I do realize that one option is to disable Talos management of flannel, and implement your own custom CNI. However, the Talos implementation is already fairly well configured, and just exposing a few additional options could provide some needed flexibility.

Copy link

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale label Sep 11, 2024
Copy link

This issue was closed because it has been stalled for 7 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants