Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Antrea-1.4 connecitivity between windows pod and linux pods fails with different CNIs #3081

Closed
shettyg opened this issue Dec 2, 2021 · 17 comments
Assignees
Labels
area/OS/windows Issues or PRs related to the Windows operating system. area/transit/routing Issues or PRs related to routing. kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@shettyg
Copy link
Contributor

shettyg commented Dec 2, 2021

Describe the bug

We have a setup with windows nodes using Antrea. linux nodes use a different CNI plugin. In this setup, with Antrea1.4, connectivity between linux pods and windows pods do not work.

There is asymmetry in the packet path. A ping from linux pods reached the windows pods. The response from the windows pod ends up being sent to antrea-gw0 and then gets SNATed to the host IP.

This behavior looks happen because table70 of openflow pipeline sets the destination mac address of the packet as antrea-gw0 mac for the podCIDR of linux node. But for podCIDR of other windows nodes, it sets it as the mac address of the physical interface of the other windows nodes.

e.g:

# antrea-gw0 mac when dst podcidr exists on a linux node.
 cookie=0x2020000000000, duration=750366.615s, table=L3Forwarding, n_packets=2736210, n_bytes=222875530, priority=200,ip,nw_dst=192.168.0.0/24 actions=mod_dl_dst:00:15:5d:5b:84:03,resubmit(,L2Forwarding)

# remote windows node mac
 cookie=0x2020000000000, duration=750366.615s, table=L3Forwarding, n_packets=205, n_bytes=15422, priority=200,ip,nw_dst=192.168.3.0/24 actions=mod_dl_dst:00:50:56:82:64:50,resubmit(,L2Forwarding)

Relevent ofproto/trace for failure case

PS C:\antrea> ovs-dpctl show
2021-12-02T18:03:39Z|00001|dpif_netlink|INFO|The kernel module does not support meters.
system@ovs-system:
  lookups: hit:115763301 missed:48521682 lost:0
  flows: 148
  port 1: Ethernet0 2
  port 2: br-int (internal)
  port 3: antrea-gw0 (internal)
  port 4: gke-metr-3bc3dd (internal)
  port 5: fluent-b-0a9e18 (internal)
  port 6: anetd-wi-e20fb4 (internal)
  port 7: windows--773e64 (internal)
  port 8: windows--6a9a0f (internal)


PS C:\antrea> ovs-appctl ofproto/trace 'recirc_id(0x2d717),ct_state(est|rpl|trk),ct_tuple4(src=192.168.1.30,dst=192.168.4.8,proto=1,tp_src=8,tp_dst=0),eth(src=00:15:5d:fa:70:11,dst=00:15:5d:5b:84:03),in_port(8),eth_type(0x0800),ipv4(src=192.168.4.8,dst=192.168.1.30,proto=1,tos=0,ttl=128,frag=no),icmp(type=0,code=0)'
Flow: recirc_id=0x2d717,ct_state=est|rpl|trk,ct_nw_src=192.168.1.30,ct_nw_dst=192.168.4.8,ct_nw_proto=1,ct_tp_src=8,ct_tp_dst=0,eth,icmp,in_port=9,vlan_tci=0x0000,dl_src=00:15:5d:fa:70:11,dl_dst=00:15:5d:5b:84:03,nw_src=192.168.4.8,nw_dst=192.168.1.30,nw_tos=0,nw_ecn=0,nw_ttl=128,icmp_type=0,icmp_code=0

bridge("br-int")
----------------
    thaw
        Resuming from table 31
31. ct_state=-new+trk,ip, priority 190, cookie 0x2040000000000
    goto_table:50
50. ct_state=-new+est,ip, priority 210, cookie 0x2000000000000
    goto_table:61
61. priority 0, cookie 0x2000000000000
    goto_table:70
70. ip,nw_dst=192.168.1.0/24, priority 200, cookie 0x2020000000000
    set_field:00:15:5d:5b:84:03->eth_dst
    goto_table:80
80. dl_dst=00:15:5d:5b:84:03, priority 200, cookie 0x2000000000000
    load:0x2->NXM_NX_REG1[]
    load:0x1->NXM_NX_REG0[16]
    goto_table:101
101. priority 0, cookie 0x2000000000000
    goto_table:105
105. priority 0, cookie 0x2000000000000
    goto_table:108
108. priority 0, cookie 0x2000000000000
    goto_table:110
110. ip,reg0=0x10000/0x10000, priority 200, cookie 0x2000000000000
    output:NXM_NX_REG1[]
     -> output port is 2

Final flow: recirc_id=0x2d717,ct_state=est|rpl|trk,ct_nw_src=192.168.1.30,ct_nw_dst=192.168.4.8,ct_nw_proto=1,ct_tp_src=8,ct_tp_dst=0,eth,icmp,reg0=0x10002,reg1=0x2,in_port=9,vlan_tci=0x0000,dl_src=00:15:5d:fa:70:11,dl_dst=00:15:5d:5b:84:03,nw_src=192.168.4.8,nw_dst=192.168.1.30,nw_tos=0,nw_ecn=0,nw_ttl=128,icmp_type=0,icmp_code=0
Megaflow: recirc_id=0x2d717,ct_state=-new+est-rel+rpl+trk,ct_mark=0,eth,ip,in_port=9,dl_dst=00:15:5d:5b:84:03,nw_dst=192.168.1.0/24,nw_frag=no
Datapath actions: 3

To Reproduce

Expected
Connection should work. It worked correctly with Antrea 1.2

Actual behavior
Description above

Versions:

Antrea 1.4
kubernetes v1.21.5
containerd

@shettyg shettyg added the kind/bug Categorizes issue or PR as related to a bug. label Dec 2, 2021
@antoninbas antoninbas added area/OS/windows Issues or PRs related to the Windows operating system. area/transit/routing Issues or PRs related to routing. labels Dec 2, 2021
@antoninbas
Copy link
Contributor

I am assigning this to @tnqn because this is related to #2161 which he worked on. #2161 relies on Linux Nodes annotating their Node resource with "node.antrea.io/mac-address", which obviously is not the case if the Antrea Agent is not running on the Linux Nodes.

@shettyg if there is any possibility to annotate the Nodes with "node.antrea.io/mac-address" and the correct MAC address value, it should resolve the issue.

I understand why the path is asymmetric (this is by design when the annotation is missing). I do not understand however why the Windows host would do SNAT on the reply traffic (Pod-to-Pod traffic in encapMode...).

@shettyg
Copy link
Contributor Author

shettyg commented Dec 2, 2021

Thank you for the immediate response. We will consider adding "node.antrea.io/mac-address" if there is no easy fix to avoid it. (In this case, we are using noencapmode.)

@tnqn
Copy link
Member

tnqn commented Dec 3, 2021

@antoninbas is correct. To get better dataplane performance, it bypasses the windows host network when possible, but that requires overwriting the dst MAC by OpenFlow rules. If it doesn't know the dst MAC, it falls through to the host network to forward the traffic.

I guess the windows host didn't even know this was reply traffic as it's the first packet it saw for this connection. Even it can check TCP flags, it doesn't work for UDP.

Except the workaround, a possible solution is we don't bypass the host network for incoming traffic if we don't know the MAC of the Node which they come from, to make the path symmetry. But we need to check if Windows will still do SNAT when it can see the request packet, and how much it will affect the dataplane performance.

@antoninbas
Copy link
Contributor

@tnqn why is there SNAT in this case given that it is Pod-to-Pod traffic (even if the host doesn't know this is reply traffic)?

@tnqn
Copy link
Member

tnqn commented Dec 6, 2021

@antoninbas I forgot this is Pod-to-Pod. You are right. It shouldn't do SNAT at all regardless of the direction. Then we should look at the NAT configuration of the windows host. For Linux, we use -m set ! --match-set ANTREA-POD-IP dst -j MASQUERADE to filter pod-to-external traffic. There should be a similar filter on Windows. @wenyingd could you confirm if this is an issue?

@wenyingd
Copy link
Contributor

wenyingd commented Dec 7, 2021

Windows NetNat confguration doesn't support "exclude" options on either internal addresses (PodCIDR) or external addresses (SNATed address), so I don't think we could do it on the Windows host as what we have done on Linux. A substitution is, maybe we could query peer Node's MAC (using "Get-NetNeighbor" on Windows by Agent) if the Node's annotation is not set. What do you think?

@tnqn
Copy link
Member

tnqn commented Dec 7, 2021

There may be no neighbor cache if the two Nodes never communicated. If adding another step to trigger communication, maybe it makes make sense to just send ARP query to retrieve the MAC address. There should be some Go libraries doing it.

And it needs to handle cross-subnet case. Bypassing windows host network was optional and affected performance only before, now it becomes necessary as it will cause destination Pods not seeing source Pod IPs.

@wenyingd
Copy link
Contributor

If sending ARP, I have thought of 3 options to leverage the ARP reply: 1) packet in the reply to Antrea Agent, then Antrea Agent could install OpenFlow entries for peer PodCIDR on a different Node; 2) use "learn" action to dynamic use the src MAC in the reply, then Antrea Agent could not wait for the ARP reply, but directly installs a flow on OVS. 3) let the Windows Host to learn the ARP reply (by output the reply packet to OVS bridge interface), and use "Get-NetNeighbor" command to learn the MAC from Windows host.

option 1 and 3 are similar, the difference is if using the packetIn or not. And Antrea Agent should wait the ARP reply. option 2 leverage OVS and not need to wait in Antrea Agent. A disadvantage is Antrea Agent doesn't know the MAC of the peer Node, and not able to cache the L3 flow entry. If a disconnection with OVS happens, Antrea Agent is not able to replay the flows.

Which option would you like, do you have other suggestions? @tnqn @jianjuns @lzhecheng @XinShuYang

@jianjuns
Copy link
Contributor

But Windows should have a way to skip NAT for specific IPs, no?

Using ARP to discover MAC is much complexer, and can be another source of traffic issues, esp. when we have a large number of Nodes.

@wenyingd
Copy link
Contributor

But Windows should have a way to skip NAT for specific IPs, no?

Using ARP to discover MAC is much complexer, and can be another source of traffic issues, esp. when we have a large number of Nodes.

I didn't find a valid configuration to exclude some IP/CIDR in Windows NetNat yet.

@wenyingd
Copy link
Contributor

Another thought is, using another CIDR for windows NAT, and filter the traffic that is sending to external in OVS then perform SNAT in OVS using the CIDR configured for Windows NAT. This solutions should be also suitable for Egress feature on windows. What do you think? @jianjuns @tnqn

@wenyingd
Copy link
Contributor

Another thought is, using another CIDR for windows NAT, and filter the traffic that is sending to external in OVS then perform SNAT in OVS using the CIDR configured for Windows NAT. This solutions should be also suitable for Egress feature on windows. What do you think? @jianjuns @tnqn

The CIDR for Windows NAT could be 169.254.0.128/25, which can be configured on Windows NetNAT

@wenyingd
Copy link
Contributor

Having some offline discussion and test with @hongliangl @XinShuYang , we have another solution: add a new OVS internal port for the traffic that doesn't need to perform SNAT, enable IP-Forwarding on the interface, and then we add OpenFlow entries in OVS to ensure the packets are output to the new interface. Back to this issue, we could add OpenFlow entry for the Pod traffic on a Node which is not annotated with MAC, and set the dst MAC of packet with the new Interface in L3ForwardTable, then set the output ofport number in L2ForwardCalcTable. Since IP forwarding is enabled on the new interface, the packet can be forwarded to br-int from the interface, and then output to the uplink.

@wenyingd wenyingd assigned XinShuYang and wenyingd and unassigned tnqn Dec 17, 2021
@wenyingd
Copy link
Contributor

Having some offline discussion and test with @hongliangl @XinShuYang , we have another solution: add a new OVS internal port for the traffic that doesn't need to perform SNAT, enable IP-Forwarding on the interface, and then we add OpenFlow entries in OVS to ensure the packets are output to the new interface. Back to this issue, we could add OpenFlow entry for the Pod traffic on a Node which is not annotated with MAC, and set the dst MAC of packet with the new Interface in L3ForwardTable, then set the output ofport number in L2ForwardCalcTable. Since IP forwarding is enabled on the new interface, the packet can be forwarded to br-int from the interface, and then output to the uplink.

Any thoughts about this option? @jianjuns @tnqn @antoninbas

@jianjuns
Copy link
Contributor

I feel the new proposal sounds better than the previous two.

So, there is no other way like packet mark, etc. to identify a packet requires NAT?

@wenyingd
Copy link
Contributor

wenyingd commented Jan 12, 2022

I feel the new proposal sounds better than the previous two.

So, there is no other way like packet mark, etc. to identify a packet requires NAT?

I didn't find a valid configuration from the Windows host. Windows SNAT uses source CIDR as the filter, and we can't use it to differentiate the pod traffic to an external destination or to a Node

@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/OS/windows Issues or PRs related to the Windows operating system. area/transit/routing Issues or PRs related to routing. kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

6 participants