-
Notifications
You must be signed in to change notification settings - Fork 670
Don't rely on setting ethtool tx off
on guest interfaces
#1255
Comments
Bryan's statement is correct, although it is easy to read it as though it is describing a bug in the Linux TCP stack. In fact, it is working as designed (and you can replace Bryan's "sometimes" with "always"): the kernel will delegate TCP segmentation and checksumming to the network interface if possible. It doesn't do something different just because you are sniffing the traffic. So if you capture outgoing traffic with a raw socket (e.g. pcap), you see what the kernel sent to the network interface, not what would actually appear on the wire. If you run an iperf TCP bandwidth test to a VM over a virtual bridge, and look at the traffic in wireshark, two things can be observed:
The same effects occur in the context of weave. Weave captures packets via pcap with incorrect checksums, and relays them to the other end with incorrect checksums. But the kernel does not verify the checksums of the injected packets (because they count as locally produced?). So that is not a problem. On the other hand, the effective segment size is a problem. As soon as the kernel produces an over-large TCP packet (with DF set, as PTMU discovery is routine for TCP), weave drops it and sends back an ICMP fragmentation needed. The kernel sees this, and drops the effective segment size on the TCP connection down to the one you might expect. The data is resent, and gets through. But on the next TCP packet it tries to grow the segment size once again, ... and so on. The data gets through, but it is slow. I expect there are various ways to influence this kernel behaviour, but if the point is to make weave work well for a virtual bridge in its default state we need to fix it within the weave router. And finding a way to do that that is simple and clean seems challenging. A simple hack might be to ignore DF on TCP packets, so that the over-large TCP packets simply pass through (that won't work if Linux checks that injected packets conform to the nominal MTU, but I find that unlikely). Fast datapath is not affected by this issue when the VXLAN encapsulation is handled by the kernel. I need to check whether we receive over-large packets on ODP misses (if so, the issue would re-appear for a connection using the sleeve fallback). |
To get back the actual question:
As discussed, its not a bug. Bridge networking works as intended, and there is no issue for Docker users to notice. More generally, you might wonder why flannel's udp backend is not affected. It is because: a) flannel does not bother with MTUs and DF and generating frag-needed packets, and possibly also b) flannel's udp backend uses a tun device, and maybe this issue does not occur for tun devices (if the kernel does the GSO segmentation step before delivering to the tun recipient). |
ethtool tx off
on guest interfaces required?ethtool tx off
on guest interfaces
Thank you for the analysis @dpw. On the basis that this is due to how weave operates, and that it prevents weave from working with other tooling, I think this ought to be addressed. |
The only interface is a veth, and, from my reading of the kernel code, veths don't implement segmentation and checksumming. From offline discussion, a better interpretation is that, given the expected use of veths, they are self-consistent: they don't care how big the packets are and they don't check checksums for in-memory copies. |
Another way of looking at it is that the kernel defers segmentation/checksumming as late as possible before an outgoing packet hits the wire. If the outgoing device hardware supports it, then it is left to the hardware. If the hardware doesn't support it, then the kernel does it just before handing it to the hardware (GSO). But for virtual devices like veths and bridges, it can be deferred entirely: If the packet reaches a physical device, it gets handled then; if it doesn't, why bother? |
IIRC, if an injected packet has a non-local MAC then the checksum is inspected and the packet is dropped if it's wrong. Disabling checksum offloading is not about changing things on the capturing side, it's about allowing injection to work by ensuring the checksums are valid. |
Are you suggesting this happens in the kernel?
If this explanation was correct, then surely without Disabling checksum offload necessarily disables segmentation offload (even if you use the more fine-grained options to ethtool, disabling the former disables the latter). But I believe it is segmentation offload that is the cause of low throughput without |
Well, doubtless you know more about this than me. Yes, I believe it happens in the kernel, but I never had time to go chasing through the source. Ahh, fair enough - I never had time to properly dig into the segmentation options. I think you can just turn off the seg offload though can't you? If that is the case then you could test with seg offload disabled, but with the "wrong" checksums and see what happens. |
I am quite sure I spent a fair amount of time narrowing the options to the minimum. Which suggests that disabling checksum offloading is necessary. |
For my edification - what is the mechanism that implements this doubling? Is it a way for the kernel to dynamically discover the maximum offloadable write that a TSO supporting NIC will accept? |
I'm not sure, but I suspect it is the just the TCP congestion window growing during the "slow start" phase. |
I just tried replacing So I continue to believe that it is segmentation offload that is the culprit, and the checksum issue is incidental. |
If you leave checksum offloading enabled, then tcpdump shows that all injected packets have incorrect TCP checksums. (Well, maybe about 1 in 65536 have a correct checksum; even a stopped clock tells the right time twice a day.) |
Ahh cool; not sure where I got the idea from it was about packets being dropped on the injected side then. |
In most cases we are now no longer using pcap, so this issue is less important. I just tested:
|
Just to note that, post #2307, we are doing the Doing "get Docker to use the Weave bridge" in |
In https://www.weave.works/blog/bridge-over-troubled-weavers/, Bryan says
In configurations where weave does not create veth pairs (e.g., when working with CNI https://www.weave.works/blog/weave-and-rkt/), there is no opportunity to run ethtool to do this configuration. It is hard to see this as a limitation of the rest of the world, rather than a limitation of "weave as designed".
So the question is, is this a problem for all users of bridge networking? (For instance, Docker -- a search of issues suggests it's not, or not yet recognised anyway)
If it's just a problem with weave, how can it be fixed?
The text was updated successfully, but these errors were encountered: