Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable AF_XDP for cmd-forwarder-vpp management interface #283

Closed
edwarnicke opened this issue Jul 15, 2021 · 38 comments
Closed

Enable AF_XDP for cmd-forwarder-vpp management interface #283

edwarnicke opened this issue Jul 15, 2021 · 38 comments
Assignees

Comments

@edwarnicke
Copy link
Member

Currently cmd-forwarder-vpp uses AF_PACKET to bind to an existing Node interface using LinkToAfPacket

AF_XDP is faster than than AF_PACKET, but AF_XDP is only useable for our purposes from kernel version 5.4 onward. The good news is that lots of places have kernel versions that new (including the more recent version of Docker Desktop).

AF_XDP is supported in govpp

Because AF_XDP is only supported for newer kernels, a check will need to be made and then the correct method (AF_XDP if available, otherwise AF_PACKET) before choosing the method to use.

@glazychev-art
Copy link
Contributor

glazychev-art commented Aug 9, 2021

blocked by #284

For some reason AF_XDP doesn't work fine with vpp v20.09

@glazychev-art
Copy link
Contributor

Found a problem on clusters - forwarder just hangs during the start without any logs.
I tested it on kind and packet cluster - situation is the same.

Created a JIRA issue - https://jira.fd.io/browse/VPP-1994

@denis-tingaikin denis-tingaikin moved this from Todo to In Progress in Release v1.8.0 Dec 27, 2022
@glazychev-art
Copy link
Contributor

It seems that it became clear why we see the forwarder (and node) hanging.
If I understand correctly, AF_XDP moves frames directly to VPP, without Linux network stack. But we know that the forwarder uses hostNetwork: true - https://github.com/networkservicemesh/deployments-k8s/blob/main/apps/forwarder-vpp/forwarder.yaml#L19. This is required for the interdomain.

So, when VPP takes the uplink interface, it grabs the primary node interface. And traffic goes directly to the VPP, bypassing Linux. Therefore, we lose connection with the node and it seems to us that it hangs.

@edwarnicke
As I see it, Calico-vpp has a similar scenario - https://projectcalico.docs.tigera.io/reference/vpp/host-network
Should we take a similar approach?

@denis-tingaikin
Copy link
Member

denis-tingaikin commented Jan 2, 2023

@glazychev-art

As I see it, Calico-vpp has a similar scenario - https://projectcalico.docs.tigera.io/reference/vpp/host-network
Should we take a similar approach?

Could you please say more?

Also, as I know AF_XDP is not working with calico. Am I wrong?

@edwarnicke
Copy link
Member Author

@glazychev-art Look into AD_XDP and eBPF. You should be able to craft an eBPF program that is passed in for AD_XDP that only passes on VXLAN/Wireguard/IPSEC packets (sort of like pinhole) and then that traffic will go to VPP, and all other traffic will go to the kernel interface.

@glazychev-art
Copy link
Contributor

glazychev-art commented Jan 10, 2023

Most likely the action plan will be:

  • run vpp af_xdp with a custom eBPF program (fix possible bugs)
  • figure out how to communicate with eBPF from golang
  • check kernel/libbpf/vpp-af-xdp compatibility
  • update vpp version: ~22h
    a. fix AF_PACKET ~1h
    b. fix wireguard [vpp] ~ 6h
    c. fix ACLs ~ 3h
    d. check calico-vpp ~ 4h
    e. check and fix other possible problems ~ 8h
  • implement an eBPF program ~8h
  • create a chain element to update eBPF map (it will contain UDP ports)
  • update cmd-forwarder-vpp: ~4h
    a. apply the new chain element ~ 1h
    b. update VppInit function (add AF_XDP) ~2h
    c. update forwarder Dockerfile to build it correctly ~ 1h
  • Risks ~12h

@glazychev-art
Copy link
Contributor

Current state:

  1. Prepared a eBPF program
  2. Built govpp with this patch - https://gerrit.fd.io/r/c/vpp/+/37274
  3. Run cmd-forwarder-vpp docker tests - they are working very well. They don't work without the patch from step 2
  4. Still have a problem with kubernetes - forwarders not responding after creation

There was an idea to update VPP to the latest version.
Docker tests also don't work without https://gerrit.fd.io/r/c/vpp/+/37274, and the problem with kubernetes was not resolved.

Perhaps the patch https://gerrit.fd.io/r/c/vpp/+/37274 is not entirely correct if we run the cluster locally (kind). I continue to work in this direction.

@edwarnicke
Copy link
Member Author

@glazychev-art Is calico-vpp being on an older vpp version still blocking us updating to a more recent vpp version?

@glazychev-art
Copy link
Contributor

glazychev-art commented Jan 17, 2023

@edwarnicke
Not really - it was updated recently (on main branch) - projectcalico/vpp-dataplane@d8288e1.
I've tested this vpp revision and seen a few problems:

  1. Minor - we need to use newer api version for AF_PACKET. For unknown reasons, our current one no longer works.
  2. More serious - with many improvements to Wireguard vpp, the event mechanism (when we get that wireguard interface is ready) was broken. Will need to figure it out
  3. We need to deal with ACLs because our current usage returns an error

Do we need to update?

@edwarnicke
Copy link
Member Author

@glazychev-art Its probably a good idea to update yes

@edwarnicke
Copy link
Member Author

@glazychev-art It might also be a good idea to put in tests in VPP to prevent some of the breakage we are seeing happening in the future.

@glazychev-art
Copy link
Contributor

glazychev-art commented Jan 18, 2023

@edwarnicke
I have a question related to eBPF program. Currently I've implemented it so that it only filters IP UDP packets based on a port.

But what do we do with ARP packets?
We definitely need ARP packets to be handled by the kernel for the proper pod function.
On the other hand, we also need ARP in the VPP so that we can find out the MAC addresses of other forwarders.

Perhaps we need also filter frames by Destination MAC, if they are different for VPP and kernel interfaces

Do you have any thoughts?

@edwarnicke
Copy link
Member Author

@glazychev-art Could you point me to your existing eBPF program?

@edwarnicke
Copy link
Member Author

@glazychev-art Have you looked at bpf_clone_redirect() ?

@glazychev-art
Copy link
Contributor

glazychev-art commented Jan 19, 2023

@edwarnicke
Currenlty eBPF program looks like

/*
 * SPDX-License-Identifier: GPL-2.0 OR Apache-2.0
 * Dual-licensed under GPL version 2.0 or Apache License version 2.0
 * Copyright (c) 2020 Cisco and/or its affiliates.
 */
#include <linux/bpf.h>
#include <linux/in.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <linux/udp.h>
#include <bpf/bpf_helpers.h>


/*
 * when compiled, debug print can be viewed with eg.
 * sudo cat /sys/kernel/debug/tracing/trace_pipe
 */
#ifdef DEBUG
#define s__(n)   # n
#define s_(n)    s__(n)
#define x_(fmt)  __FILE__ ":" s_(__LINE__) ": " fmt "\n"
#define DEBUG_PRINT_(fmt, ...) do { \
    const char fmt__[] = fmt; \
    bpf_trace_printk(fmt__, sizeof(fmt), ## __VA_ARGS__); } while(0)
#define DEBUG_PRINT(fmt, ...)   DEBUG_PRINT_ (x_(fmt), ## __VA_ARGS__)
#else   /* DEBUG */
#define DEBUG_PRINT(fmt, ...)
#endif  /* DEBUG */

#define ntohs(x)        __constant_ntohs(x)
#define MAX_NR_PORTS 65536
			    
struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, MAX_NR_PORTS);
    __type(key, int);
    __type(value, unsigned short int);
    __uint(pinning, LIBBPF_PIN_BY_NAME);
} ports_map SEC(".maps");

struct {
    __uint(type, BPF_MAP_TYPE_XSKMAP);
    __uint(max_entries, 64);
    __type(key, int);
    __type(value, int);
} xsks_map SEC(".maps");


SEC("xdp_sock")
int xdp_sock_prog(struct xdp_md *ctx) {
    const void *data = (void *)(long)ctx->data;
    const void *data_end = (void *)(long)ctx->data_end;
    int qid = ctx->rx_queue_index;
    
    DEBUG_PRINT("rx %ld bytes packet", (long)data_end - (long)data);
    
    if (data + sizeof(struct ethhdr) > data_end) {
        DEBUG_PRINT("packet too small");
        return XDP_PASS;
    }
   
    const struct ethhdr *eth = data;
    if (eth->h_proto != ntohs(ETH_P_IP) && eth->h_proto != ntohs(ETH_P_ARP)) {
          return XDP_PASS;
    }
    
    if (eth->h_proto == ntohs(ETH_P_ARP)) {
      if (!bpf_map_lookup_elem(&xsks_map, &qid))
      {
        DEBUG_PRINT("no socket found");
        return XDP_PASS;
      }

      DEBUG_PRINT("going to socket %d", qid);
      return bpf_redirect_map(&xsks_map, qid, 0);
    }

    if (data + sizeof(struct ethhdr) + sizeof(struct iphdr) + sizeof(struct udphdr) > data_end) {
        DEBUG_PRINT("packet too small");
        return XDP_PASS;
    }

    const struct iphdr *ip = (void *)(eth + 1);
    switch (ip->protocol) {
      case IPPROTO_UDP: {
            const struct udphdr *udp = (void *)(ip + 1);
            const int port = ntohs(udp->dest);
            if (!bpf_map_lookup_elem(&ports_map, &port))
      	    {
      	        DEBUG_PRINT("unsupported udp dst port %x", (int)udp->dest);
        	return XDP_PASS;
      	    }
      	    break;
          }
      default:
        DEBUG_PRINT("unsupported ip proto %x", (int)ip->protocol);
        return XDP_PASS;
    }

    if (!bpf_map_lookup_elem(&xsks_map, &qid))
      {
        DEBUG_PRINT("no socket found");
        return XDP_PASS;
      }

    DEBUG_PRINT("going to socket %d", qid);
    return bpf_redirect_map(&xsks_map, qid, 0);
}

/* actually Dual GPLv2/Apache2, but GPLv2 as far as kernel is concerned */
SEC("license")
char _license[] = "GPL";

In short, we pass all ARP packets to VPP and filter IP packets - if UDP port belongs to VxLAN, Wireguard and so on - we pass it VPP, otherwise - to kernel

@glazychev-art
Copy link
Contributor

@edwarnicke
Yes, I looked at long bpf_clone_redirect(struct sk_buff *skb, u32 ifindex, u64 flags).
But as you can see, it receives sk_buff. So it seems, that we can call this function after XDP layer, when we already chose kernel (in TC ingress layer for example).
Probably we need to create sk_buff manually in xdp function and call bpf_clone_redirect.

@edwarnicke
Copy link
Member Author

@glazychev-art Trying to create an sk_buff sounds like it might be prone to error.

We may also want to think through what the problem really is. Is the problem that we are not receiving arp packets, or is the problem how we construct our neighbor table in VPP?

@glazychev-art
Copy link
Contributor

I think the problem is that we are not receiving arp packets.
We construct the VPP neighbor table correctly - we take all ARP entries from the kernel known at the start time.
Next, we need to know about other pods in the VPP - for example, about another forwarder in order to set up a tunnel.
On the other hand, we also need to process arp in the kernel too - for example, when passing the request forwarder --> manager.

@edwarnicke
Copy link
Member Author

So, kernel will only accept and remember the response if it sent the request itself.

Have we checked this? It might be true, but I wouldn't simply presume it.

@glazychev-art
Copy link
Contributor

I think I tested something similar. Without NeighSubscribeAt, but I looked at ip neigh.

But definitely, we need to double-check that.

@glazychev-art glazychev-art moved this from Todo to In Progress in Release v1.9.0 Mar 10, 2023
@glazychev-art
Copy link
Contributor

@edwarnicke
It looks like that NeighSubscribeAt and IPNeighborAddDel are working fine for IPv4 interfaces.

But this is not the case for IPv6. Since it has neighbor mechanism, Linux side doesn't save NA (Neighbor Advertisement) if we send NS (Neighbor Solicitation) from the VPP side. I tried changing the Solicited and Override flags in the response but it didn't help.

Should we continue to work in this direction or does it make sense to implement only IPv4?

@denis-tingaikin denis-tingaikin moved this from In Progress to Under review in Release v1.9.0 Mar 20, 2023
@glazychev-art glazychev-art moved this from Under review to In Progress in Release v1.9.0 Mar 27, 2023
@glazychev-art
Copy link
Contributor

Current state:

  • Rechecked work with flags (including Router flag)
  • Used wireshark to check if the packet is valid (e.g. checksum)
  • It looks like Linux really rejects unexpected NA:
   When a valid Neighbor Advertisement is received (either solicited or
   unsolicited), the Neighbor Cache is searched for the target's entry.
   If no entry exists, the advertisement SHOULD be silently discarded.
   There is no need to create an entry if none exists, since the
   recipient has apparently not initiated any communication with the
   target.

https://www.rfc-editor.org/rfc/rfc4861.html#section-7.2.5

@glazychev-art
Copy link
Contributor

Current state:

  • double-checked the possibility of copying the incoming packet both to the userspace and to the kernel space - did not find a way.
  • considered using TC egress level. We could store there whether an NS (neighbour solicitation) was issued from the kernel space. And use this information in XDP ingress layer. But it looks like this will bring more problems, because we don't know whether we will receive an answer at all.
  • perhaps it makes sense to send an NS during the NSM Request from the chain element. For example, we can try - https://pkg.go.dev/github.com/mdlayher/ndp

@glazychev-art
Copy link
Contributor

I've tried to resolve IPv6 neighbors in the kernel space manually.
And it works, because the forwarder receives the event from the netlink and adds the neighbor to vpp via IPNeighborAddDel. Ping works after that.

@edwarnicke
Copy link
Member Author

Are we typically looking for anything other than the mac address of the gateway IP for the IPv6 case?

If so, could we simply scrape the linux Neighbor table for v6?

@edwarnicke
Copy link
Member Author

@glazychev-art
Copy link
Contributor

Current state:

  • checked this - https://insights.sei.cmu.edu/blog/ping-sweeping-in-ipv6/. Looks interesting. But pinging ff02::1 doesn't always invokes Neighbor Discovering. Let's look at wireshark:
    ICMP IPv6 request:
    ICMP request
    ICMP IPv6 response:
    ICMP response
    So we see that the linux doesn't send Neighbour Solicitation at all (fd00:10:244:1::1 is a gateway).

Instead, we can resolve gateways for a given interface in a slightly different way. Before creating AF_XDP, we can use netlink.RouteList and then ping every gateway found. This will allow us to add neighbor entries to the linux. And they will later be read and added to the VPP.

@edwarnicke
What do you think?

@glazychev-art
Copy link
Contributor

@edwarnicke
It seems that it is not possible to run more than one AF_XDP forwarder on one node, unlike AF_PACKET (forwarders use hostNetwork). Logs from the second:

af_xdp               [error ]: af_xdp_create_queue: xsk_socket__create() failed (is linux netdev vpp1host up?): Device or resource busy
create interface af_xdp: xsk_socket__create() failed (is linux netdev vpp1host up?): Device or resource busy

@glazychev-art
Copy link
Contributor

glazychev-art commented Apr 6, 2023

Current state:
Tested a new forwarder on public clusters:
GKE - doesn't start. Logs from forwarder:

Apr  3 05:38:16.954 [INFO] [cmd:vpp] libbpf: Kernel error message: virtio_net: XDP expects header/data in single page, any_header_sg required
Apr  3 05:38:16.954 [INFO] [cmd:vpp] vpp[10244]: af_xdp: af_xdp_load_program: bpf_set_link_xdp_fd(eth0) failed: Invalid argument
Apr  3 05:38:18.228 [ERRO] [cmd:/bin/forwarder] [duration:12.809608ms] [hostIfName:eth0] [vppapi:AfXdpCreate] VPPApiError: System call error #6 (-16)
panic: error: VPPApiError: System call error #6 (-16)

AWS - doesn't start. Logs from forwarder:

Apr  3 13:24:25.406 [INFO] [cmd:vpp] libbpf: Kernel error message: veth: Peer MTU is too large to set XDP
Apr  3 13:24:25.406 [INFO] [cmd:vpp] vpp[10508]: af_xdp: af_xdp_load_program: bpf_set_link_xdp_fd(eth0) failed: Numerical result out of range
Apr  3 13:24:26.563 [ERRO] [cmd:/bin/forwarder] [duration:18.015838ms] [hostIfName:eth0] [vppapi:AfXdpCreate] VPPApiError: System call error #6 (-16)
panic: error: VPPApiError: System call error #6 (-16)

Packet - started, but ping doesn't work. This is most likely due to the fact that af_packet vpp plugin doesn't process bonded interfaces (they are used by packet)
AKS - ping works only without hostNetwork: true flag. But poor performance (compared to AF_PACKET about 2 times slower)
Kind - works, but performance has not increased (even decreased slightly).

Measurements on Kind

iperf3 TCP

Ethernet remote mechanism (VxLAN)

AF_PACKET:

Connecting to host 172.16.1.100, port 5201
[  5] local 172.16.1.101 port 43488 connected to 172.16.1.100 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  46.6 MBytes   391 Mbits/sec  174    969 KBytes       
[  5]   1.00-2.00   sec  48.8 MBytes   409 Mbits/sec    0   1.02 MBytes       
[  5]   2.00-3.00   sec  58.8 MBytes   493 Mbits/sec    0   1.07 MBytes       
[  5]   3.00-4.00   sec  53.8 MBytes   451 Mbits/sec    0   1.10 MBytes       
[  5]   4.00-5.00   sec  46.2 MBytes   388 Mbits/sec    0   1.12 MBytes       
[  5]   5.00-6.00   sec  62.5 MBytes   524 Mbits/sec    0   1.13 MBytes       
[  5]   6.00-7.00   sec  45.0 MBytes   377 Mbits/sec    0   1.14 MBytes       
[  5]   7.00-8.00   sec  65.0 MBytes   545 Mbits/sec    0   1.18 MBytes       
[  5]   8.00-9.00   sec  56.2 MBytes   472 Mbits/sec    0   1.22 MBytes       
[  5]   9.00-10.00  sec  45.0 MBytes   377 Mbits/sec    0   1.24 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   528 MBytes   443 Mbits/sec  174             sender
[  5]   0.00-10.08  sec   526 MBytes   438 Mbits/sec                  receiver

AF_XDP:

Connecting to host 172.16.1.100, port 5201
[  5] local 172.16.1.101 port 36586 connected to 172.16.1.100 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  46.9 MBytes   393 Mbits/sec  1326    113 KBytes       
[  5]   1.00-2.00   sec  41.3 MBytes   346 Mbits/sec  1114   42.2 KBytes       
[  5]   2.00-3.00   sec  36.2 MBytes   304 Mbits/sec  1058   34.0 KBytes       
[  5]   3.00-4.00   sec  54.2 MBytes   455 Mbits/sec  1560   20.4 KBytes       
[  5]   4.00-5.00   sec  36.3 MBytes   304 Mbits/sec  1149   44.9 KBytes       
[  5]   5.00-6.00   sec  27.9 MBytes   234 Mbits/sec  953   20.4 KBytes       
[  5]   6.00-7.00   sec  37.9 MBytes   318 Mbits/sec  1106   25.9 KBytes       
[  5]   7.00-8.00   sec  33.1 MBytes   278 Mbits/sec  964   25.9 KBytes       
[  5]   8.00-9.00   sec  39.2 MBytes   329 Mbits/sec  1448   32.7 KBytes       
[  5]   9.00-10.00  sec  51.1 MBytes   429 Mbits/sec  1445   23.1 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   404 MBytes   339 Mbits/sec  12123             sender
[  5]   0.00-10.00  sec   403 MBytes   338 Mbits/sec                  receiver

Note: many Retrs

IP remote mechanism (Wireguard)

AF_PACKET:

Connecting to host 172.16.1.100, port 5201
[  5] local 172.16.1.101 port 49978 connected to 172.16.1.100 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  88.3 MBytes   740 Mbits/sec    2    487 KBytes       
[  5]   1.00-2.00   sec  87.4 MBytes   733 Mbits/sec    0    606 KBytes       
[  5]   2.00-3.00   sec  76.5 MBytes   642 Mbits/sec    6    495 KBytes       
[  5]   3.00-4.00   sec  74.6 MBytes   626 Mbits/sec    0    596 KBytes       
[  5]   4.00-5.00   sec  42.3 MBytes   355 Mbits/sec    0    649 KBytes       
[  5]   5.00-6.00   sec  21.7 MBytes   182 Mbits/sec    8    473 KBytes       
[  5]   6.00-7.00   sec  36.9 MBytes   310 Mbits/sec    0    545 KBytes       
[  5]   7.00-8.00   sec  88.9 MBytes   746 Mbits/sec    0    636 KBytes       
[  5]   8.00-9.00   sec  82.4 MBytes   691 Mbits/sec    8    539 KBytes       
[  5]   9.00-10.00  sec  92.0 MBytes   772 Mbits/sec    0    664 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   691 MBytes   580 Mbits/sec   24             sender
[  5]   0.00-10.03  sec   690 MBytes   577 Mbits/sec                  receiver

AF_XDP:

Connecting to host 172.16.1.100, port 5201
[  5] local 172.16.1.101 port 46608 connected to 172.16.1.100 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   104 MBytes   873 Mbits/sec   47    645 KBytes       
[  5]   1.00-2.00   sec  98.7 MBytes   828 Mbits/sec   39    538 KBytes       
[  5]   2.00-3.00   sec  90.9 MBytes   763 Mbits/sec    0    655 KBytes       
[  5]   3.00-4.00   sec  65.2 MBytes   547 Mbits/sec   14    533 KBytes       
[  5]   4.00-5.00   sec  53.3 MBytes   447 Mbits/sec    7    603 KBytes       
[  5]   5.00-6.00   sec  52.4 MBytes   440 Mbits/sec    0    660 KBytes       
[  5]   6.00-7.00   sec  39.1 MBytes   328 Mbits/sec    8    526 KBytes       
[  5]   7.00-8.00   sec  38.7 MBytes   325 Mbits/sec    0    587 KBytes       
[  5]   8.00-9.00   sec  94.8 MBytes   796 Mbits/sec    0    675 KBytes       
[  5]   9.00-10.00  sec  96.0 MBytes   805 Mbits/sec    7    618 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   733 MBytes   615 Mbits/sec  122             sender
[  5]   0.00-10.05  sec   732 MBytes   611 Mbits/sec                  receiver

iperf3 UDP

AF_PACKET

Accepted connection from 172.16.1.101, port 39452
[  5] local 172.16.1.100 port 5201 connected to 172.16.1.101 port 40692
[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5]   0.00-1.00   sec   118 MBytes   986 Mbits/sec  0.077 ms  525084/613923 (86%)  
[  5]   1.00-2.00   sec   117 MBytes   980 Mbits/sec  0.002 ms  576553/664766 (87%)  
[  5]   2.00-3.00   sec   120 MBytes  1.01 Gbits/sec  0.050 ms  576732/667716 (86%)  
[  5]   3.00-4.00   sec   120 MBytes  1.00 Gbits/sec  0.002 ms  581367/671794 (87%)  
[  5]   4.00-5.00   sec   120 MBytes  1.00 Gbits/sec  0.002 ms  612951/703307 (87%)  
[  5]   5.00-6.00   sec   122 MBytes  1.03 Gbits/sec  0.001 ms  535717/628083 (85%)  
[  5]   6.00-7.00   sec   117 MBytes   980 Mbits/sec  0.041 ms  578869/667122 (87%)  
[  5]   7.00-8.00   sec   119 MBytes  1.00 Gbits/sec  0.002 ms  577990/668247 (86%)  
[  5]   8.00-9.00   sec   116 MBytes   974 Mbits/sec  0.002 ms  582754/670426 (87%)  
[  5]   9.00-10.00  sec   120 MBytes  1.01 Gbits/sec  0.024 ms  579465/670305 (86%)  
[  5]  10.00-10.21  sec  2.50 MBytes   100 Mbits/sec  0.002 ms  38604/40489 (95%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5]   0.00-10.21  sec  1.16 GBytes   979 Mbits/sec  0.002 ms  5766086/6666178 (86%)  receiver

AF_XDP

[  5] local 172.16.1.100 port 5201 connected to 172.16.1.101 port 41437
[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5]   0.00-1.00   sec   156 MBytes  1.31 Gbits/sec  0.001 ms  491872/609832 (81%)  
[  5]   1.00-2.00   sec   168 MBytes  1.41 Gbits/sec  0.001 ms  557337/684419 (81%)  
[  5]   2.00-3.00   sec   166 MBytes  1.39 Gbits/sec  0.001 ms  551925/677423 (81%)  
[  5]   3.00-4.00   sec   163 MBytes  1.36 Gbits/sec  0.001 ms  557553/680349 (82%)  
[  5]   4.00-5.00   sec   165 MBytes  1.38 Gbits/sec  0.001 ms  553140/677503 (82%)  
[  5]   5.00-6.00   sec   170 MBytes  1.43 Gbits/sec  0.002 ms  558848/687616 (81%)  
[  5]   6.00-7.00   sec   161 MBytes  1.35 Gbits/sec  0.001 ms  558833/680687 (82%)  
[  5]   7.00-8.00   sec   162 MBytes  1.36 Gbits/sec  0.001 ms  575608/698261 (82%)  
[  5]   8.00-9.00   sec   163 MBytes  1.36 Gbits/sec  0.001 ms  550618/673519 (82%)  
[  5]   9.00-10.00  sec   169 MBytes  1.42 Gbits/sec  0.001 ms  555133/683148 (81%)  
[  5]  10.00-11.00  sec   434 KBytes  3.55 Mbits/sec  3.840 ms  0/320 (0%)  
[  5]  11.00-12.00  sec  43.4 KBytes   355 Kbits/sec  7.520 ms  0/32 (0%)

Conclusions

Client sends UDP:
AF_XDP is faster than AF_PACKET by ~40% (1.37 Gbits/sec vs 0.98 Gbits/sec)

Client sends TCP:
Average of 10 runs
Ethernet:
AF_PACKET is faster than AF_XDP by ~13% (460.3 Mbits/sec vs 407.2 Mbits/sec)
IP:
AF_XDP is equal to AF_PACKET (372,1 Mbits/sec vs 370,2 Mbits/sec)

@glazychev-art
Copy link
Contributor

Estimation

To run ci on kind cluster with xdp we need:

  1. Prepare a PR for sdk-vpp ~ 1h
  2. Prepare a PR for cmd-forwarder-vpp ~ 1h
  3. Add a new afxdp suite to deployments-k8s ~ 2h
  4. Add and test the suite on kind ~ 2h
  5. Risks ~ 2h

@glazychev-art
Copy link
Contributor

@edwarnicke
Due to problems with public clusters (see the beginning of the post), there is an option to support af_xdp only on kind in this release.
What do you think of it?

@edwarnicke
Copy link
Member Author

@glazychev-art Its strange that AF_PACKET is faster for TCP but slower for UDP. Do we have any notion of why?

@glazychev-art
Copy link
Contributor

glazychev-art commented Apr 11, 2023

@edwarnicke
Yes, there are a couple of guesses:

  1. If we look at iperf3 logs from TCP mode, we will look a huge number of retransmissions:
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  46.9 MBytes   393 Mbits/sec  1326    113 KBytes       
[  5]   1.00-2.00   sec  41.3 MBytes   346 Mbits/sec  1114   42.2 KBytes
...

(we don't see them with AF_PACKET)
2. I was able to reproduce something similar on bare vpp instances:
https://lists.fd.io/g/vpp-dev/topic/af_xdp_performance/98105671
3. If we look at the vpp gerrit, we can see several open af_xdp patches, that the owners claim will greatly increase performance (I tried them, it didn't help for TCP).
https://gerrit.fd.io/r/c/vpp/+/37653
https://gerrit.fd.io/r/c/vpp/+/38135

So, I think the problem may be in the VPP plugin.

@glazychev-art
Copy link
Contributor

As part of this task, we have done the integration of AF_XDP interface on the kind cluster. This is working successfully.
https://github.com/networkservicemesh/integration-k8s-kind/actions/runs/4798046461/jobs/8535800517

On public clusters, we ran into problems. Separate issues were created
#859
Performance:
#860

I think this issue can be closed

@github-project-automation github-project-automation bot moved this from Under review to Done in Release v1.9.0 Apr 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants