perf.txt

This file contains various notes and lessons learned concerning performance
of the Homa Linux kernel module.  The notes are in reverse chronological
order.

* (November 2022) Software GSO is very slow (17 usec on AMD EPYC processors,
  breaking 64K into 9K jumbo frames. The main problem appears to be sk_buff
  allocation, which takes multiple usecs because the packet buffers are too
  large to be cached in the slab allocator.

* (November 2022) Compared "cp_node client --workload 500000" performance
  on c6525-100g cluster (24-core AMD 7402P processors @ 2.8 Ghz, 100 Gbps
  networking) vs. xl170 cluster (10-core Intel E5-2640v4 @ 2.4 Ghz, 25 Gbps
  networking):
                                          Intel/25Gbps     AMD/100Gbps
  -----------------------------------------------------------------------
  Packet size                             1500B            9000B
  Overall throughput (each direction)     3.4 Gbps         6.7-7.5 Gbps
  Stats from ttrpcs.py:
      Xmit/receive tput                   11 Gbps          30-50 Gbps
      Copy to/from user space             36-54 Gbps       30-110 Gbps
      RTT for first grant                 28-32 us         56-70 us
  Stats from ttpktdelay.py:
      SoftIRQ Wakeup (P50/P90)            6/30 us          14/23 us
      Minimum network RTT                 5.5 us           8 us
  RTT with 100B messages                  17 us            28 us

* (August 2022) Found problem with Mellanox driver that explains the
  page_pool_alloc_pages_slow delays in the item below.
  * The driver keeps a cache of "free" pages, organized as a FIFO
    queue with a size limit.
  * The page for a packet buffer gets added to the queue when the
    packet is received, but with a nonzero reference count.
  * The reference count is decremented when the skbuff is released.
  * If the page gets to the front of the queue with a nonzero reference
    count, it can't be allocated. Instead, a new page is allocated,
    which is slower. Furthermore, this will result in excess pages,
    eventually causing the queue to overflow; at that point, the excess
    pages will be freed back to Linux, which is slow.
  * Homa likes to keep around large numbers of buffers around for
    significant time periods; as a result, it triggers the slow path
    frequently, especially for large messages.

* (August 2022) 2-node performance is problematic. Ran experiments with
  the following client cp_node command:
  cp_node client --ports 3 --server-ports 3 --client-max 10 --workload 500000
  With max_window = rtt_bytes = 60000, throughput is only about 10 Gbps
  on xl170 nodes. ttpktdelay output shows one-way times commonly 30us or
  more, which means Homa can't keep enough grants outstanding for full
  bandwidth. The overheads are spread across many places:

  IP:        IP stack, from calling ip_queue_xmit to NIC wakeup
  Net:       Additional time until homa_gro_receive gets packet
  GRO Other: Time until end of GRO batch
  GRO Gap:   Delay after GRO packet processing until SoftIRQ handoff
  Wakeup:    Delay until homa_softirq starts
  SoftIRQ:   Time in homa_softirq until packet is processed
  Total:     End-to-end time from calling ip_queue_xmit to homa_softirq
             handler for packet

  Data packet lifetime (us), client -> server:
  Pctile   IP     Net  GRO Other GRO Gap  Wakeup  SoftIRQ   Total
    0     0.5     4.6        0.0     0.2     1.0      0.1     7.3
   10     0.6    10.3        0.0     5.7     2.0      0.2    21.0
   30     0.7    12.4        0.4     6.3     2.1      1.9    27.0
   50     0.7    15.3        1.0     6.6     2.2      3.3    32.2
   70     0.8    18.2        2.0     8.1     2.3      3.8    45.3
   90     1.0    33.9        4.9    31.3     2.5      4.8    62.8
   99     1.4    56.5       20.7    48.5    17.7     17.5    85.6
  100    16.0    74.3       31.0    61.9    28.3     24.4   111.0

  Grant lifetime (us), client -> server:
  Pctile   IP     Net  GRO Other GRO Gap  Wakeup  SoftIRQ   Total
    0     1.7     2.6        0.0     0.3     1.0      0.0     7.6
   10     2.4     5.3        0.0     0.5     1.5      0.1    12.1
   30     2.5    10.3        0.0     6.1     2.1      0.1    23.3
   50     2.6    12.7        0.5     6.5     2.2      0.2    28.1
   70     2.8    16.5        1.1     7.2     2.3      0.3    38.1
   90     3.4    31.7        3.5    22.6     2.5      3.1    56.2
   99     4.6    54.1       17.7    48.4    17.5      4.3    78.5
  100    54.9    67.5       28.4    61.9    28.3     21.9    98.3

  Additional client-side statistics:
  Pre NAPI:   usecs from interrupt entry to NAPI handler
  GRO Total:  usecs from NAPI handler entry to last homa_gro_receive
  Batch:      number of packets processed in one interrupt
  Gap:        usecs from last homa_gro_receive call to SoftIRQ handoff

  Pctile   Pre NAPI    GRO  Batch     Gap
    0           0.7    0.4      0     0.2
   10           0.7    0.6      0     0.3
   30           0.8    0.7      1     0.4
   50           0.8    1.5      2     6.6
   70           1.0    2.6      3     7.0
   90           2.7    4.9      4     7.5
   99           6.4    8.0      7    34.2
  100          21.7   23.9     12    48.2

  In looking over samples of long delays, there are two common issues that
  affect multiple metrics:
  * page_pool_alloc_pages_slow; affects:
    P90/99 Net, P90/99 GRO Gap, P99 SoftIRQ wakeup
  * unidentified 14-17 us gaps in homa_xmit_data, homa_gro_receive,
    homa_data_pkt, and other places:
    affects P99 GRO Other, P99 SoftIRQ, P99 GRO

  In addition, I found the following smaller problems:
  * unknown gaps before homa_gro_complete of 20-30 us, affects:
    P90 SoftIRQ wakeup
    Is this related to the "unidentified 14-17 us gaps" above?
  * net_rx_action sometimes slow to start; affects:
    Wakeup
  * large batch size affects:
    P90 SoftIRQ

* (June 2022) Short-message timelines (xl170 clusters, "cp_node client
  --workload 100 --port-receivers 0"). All times are ns (data excludes
  client-side recv->send turnaround time). Most of the difference
  seems to be in kernel call time and NIC->NIC time. Also, note that
  the 5.4.80 times have improved considerably from January 2021; there
  appears to be at least 1 us variation in RTT from machine to machine.
                            5.17.7             5.4.80
                        Server   Client    Server   Client
----------------------------------------------------------
Send:
  homa_send/reply         461      588       468      534
  IP/Driver               514      548       508      522
  Total                   975     1136      1475     1056
Receive:
  Interrupt->Homa GRO     923     1003       789      815
  GRO                     200      227       193      201
  Wakeup SoftIRQ          601      480       355      347
  IP SoftIRQ              361      441       400      361
  Homa SoftIRQ            702      469       588      388
  Wakeup App               94      106        87       53
  homa_recv               447      562       441      588
  Total                  3328     3288      2853     2753
Recv -> send kcall        682                220
NIC->NIC (round-trip)             6361               5261
RTT Total                        15770              13618

* (January 2021) Best-case short-message timelines (xl170 cluster).
  Linux 4.15.18 numbers were measured in September 2020. All times are ns.
                          5.4.80         4.15.18    Ratio
                     Server   Client
---------------------------------------------------------
Send:
  System call           360     360         240      1.50
  homa_send/reply       620     870         420      1.77
  IP/Driver             495     480         420      1.16
  Total                1475    1710        1080      1.47
Receive:
  Interrupt->NAPI       560     500         530      1.00
  NAPI                  560     675         420      1.47
  Wakeup SoftIRQ        480     470         360      1.32
  IP SoftIRQ            305     335         320      1.00
  Homa SoftIRQ          455     190         240      1.34
  Wakeup App             80     100         270      0.33
  homa_recv             420     450         300      1.45
  System Call           360     360         240      1.50
  Total                3220    3080        2680      1.18
NIC->NIC (1-way)       2805    2805        2540      1.10
RTT Total             15100   15100       12600      1.20

* (January 2021) Small-message latencies (usec) for different workloads and
  protocols (xl170 cluster, 40 nodes, high load, MTU 3000, Linux 5.4.80):
               W2             W3            W4            W5
Homa  P50     30.9           41.9          46.8          55.4
      P99     57.7           98.5         109.3         139.0
DCTCP P50    106.7 (3.5x)   160.4 (3.8x)  159.1 (3.4x)  151.8 (2.7x)
      P99   4812.1 (83x)   6361.7 (65x)   881.1 (8.1x)  991.2 (7.1x)
TCP   P50    108.8 (3.5x)   192.7 (4.6x)  353.1 (7.5x)  385.7 (6.9x)
      P99   4151.5 (72x)   5092.7 (52x)  2113.1 (19x)  4360.7 (31x)

* (January 2021) Analyzed effects of various configuration parameters,
  running on 40-node xl170 cluster with MTU 3000:
  duty_cycle:      Reducing to 40% improves small message latency 25% in W4
                   40% in W5
  fifo_fraction:   No impact on small message P99 except W3 (10% degradation);
                   previous measurements showed 2x improvement in P99 for
                   largest messages with modified W4 workload.
  gro_policy:      NORMAL always better; others 10-25% worse for short P99
  max_gro_skbs:    Larger is better; reducing to 5 hurts short P99 10-15%.
                   However, anecdotal experience suggests that very large
                   values can cause long delays for things like sending
                   grants, so perhaps 10 is best?
  max_gso_size:    10K looks best; not much difference above that, 10-20%
                   degradation of short P99 at 5K
  nic_queue_ns:    5-10x degradation in short P99 when there is no limit;
                   no clear winner for short P99 in 1-10 us range; however,
                   shorter is better for P50 (1us slightly better than 2us)
  poll_usecs:      0-50us all equal for W4 and W5; 50us better for W2 and W3
                   (10-20% better short P99 than 0us).
  ports:           Not much sensitivity: 3 server and 3 client looks good.
  client threads:  Need 3 ports: W2 can't keep up with 1-2 ports, W3 can't
                   keep up with 1 port. With 3 ports, 2 receivers has 1.5-2x
                   lower short P99 for W2 and W3 than 4 receivers, but for
                   W5 3 receivers is 10% better than 2. Best choice: 3p2r?
  rtt_bytes:       60K is best, but not much sensitivity: 40K is < 10% worse
  throttle_bytes:  Almost no noticeable difference from 100-2000; perhaps
                   500 or 1000?

* (October 2020) Polling performance impact. In isolation, polling saves
  about 4 us RTT per RPC. In the workloads, it reduces short-message P50
  up to 10 us, and P99 up to 25us (the impact is greater with light-tailed
  workloads like W1 and W2). For W2, polling also improved throughput
  by about 3%.

* (October 2020) Polling problem: some workloads (like W5 with 30 MB
  messages) need a lot of receiving threads for occasional bursts where
  several threads are tied up receiving very large messages. However,
  this same number of receivers results in poor performance in W3,
  because these additional threads spend a lot of time polling, which
  wastes enough CPU time to impact the threads that actually have
  work to do. One possibility: limit the number of polling threads per
  socket? Right now it appears hard to configure polling for all
  workloads.

* (October 2020) Experimented with new GRO policy HOMA_GRO_NO_TASKS,
  which attempts to avoid cores with active threads when picking cores
  for SoftIRQ processing. This made almost no visible difference in
  performance, and also depends on modifying the Linux kernel to
  export a previously unexported function, so I removed it. It's
  still available in repo commits, though.

* (October 2020) Receive queue order. Experimented with ordering the
  hsk->ready_requests and @hsk->ready_responses list to return short
  messages first. Not clear that this provided any major benefits, and
  it reduced throughput in some cases because of overheads in inserting
  ready messages into the queues.

* (October 2020) NIC queue estimation. Experimented with how much to
  underestimate network bandwidth. Answer: not much! The existing 5% margin
  of safety leaves bandwidth on the table, which impacts tail latency for
  large messages. Reduced it to 1%, which helps large messages a lot (up to
  2x reduction in latency). Impact on small messages is mixed (more get worse
  than better), but the impact isn't large in either case.

* (July 2020) P10 under load. Although Homa can provide 13.5 us RTTs under
  best-case conditions, this almost never occurs in practice. Even at low
  loads, the "best case" (P10) is more like 25-30 us. I analyzed a bunch
  of 25-30 us message traces and found the following sources of additional
  delay:
  * Network delays (from passing packet to NIC until interrupt received)
    account for 5-10 us of the additional delay (most likely packet queuing
    in the NIC). There could also be delays in running the interrupt handler.
  * Every stage of software runs slower, typically taking about 2x as long
    (7.1 us becomes 12-23 us in my samples, with median 14.6 us)
  * Occasional other glitches, such as having to wake up a receiving
    user thread, or interference due to NAPI/SoftIRQ processing of other
    messages.

* (July 2020) Adaptive polling. A longer polling interval (e.g. 500 usecs)
  lowers tail latency for heavy-tailed workloads such as W4, but it hurts
  other workloads (P999 tail latency gets much worse for W1 because polling
  threads create contention for cores; P99 tail latency for large messages
  suffers in W3). I attempted an adaptive approach to polling, where a thread
  stops polling if it is no longer first in line, and gets woken up later to
  resume polling if it becomes first in line again. The hope was that this
  would allow a longer polling interval without negatively impacting other
  workloads. It did help, but only a bit, and it added a lot of complexity,
  so I removed it.

* (July 2020) Best-case timetraces for short messages on xl170 CloudLab cluster.
Clients:                                                             Cum.
Event                                                               Median
--------------------------------------------------------------------------
[C?] homa_ioc_send starting, target ?:?, id ?, pid ?                     0
[C?] mlx nic notified                                                  939
[C?] Entering IRQ                                                     9589
[C?] homa_gro_receive got packet from ? id ?, offset ?, priority ?   10491
[C?] enqueue_to_backlog complete, cpu ?, id ?, peer ?                10644
[C?] homa_softirq: first packet from ?:?, id ?, type ?               11300
[C?] incoming data packet, id ?, peer ?, offset ?/?                  11416
[C?] homa_rpc_ready handed off id ?                                  11560
[C?] received message while polling, id ?                            11811
[C?] Freeing rpc id ?, socket ?, dead_skbs ?                         11864
[C?] homa_ioc_recv finished, id ?, peer ?, length ?, pid ?           11987

Servers:                                                             Cum.
Event                                                               Median
--------------------------------------------------------------------------
[C?] Entering IRQ                                                        0
[C?] homa_gro_receive got packet from ? id ?, offset ?, priority ?     762
[C?] homa_softirq: first packet from ?:?, id ?, type ?                1566
[C?] incoming data packet, id ?, peer ?, offset ?/?                   1767
[C?] homa_rpc_ready handed off id ?                                   2012
[C?] received message while polling, id ?                             2071
[C?] homa_ioc_recv finished, id ?, peer ?, length ?, pid ?            2459
[C?] homa_ioc_reply starting, id ?, port ?, pid ?                     2940
[C?] mlx nic notified                                                 3685

* (July 2020) SMIs impact on tail latency. I observed gaps of 200-300 us where
  a core appears to be doing nothing. These occur in a variety of places
  in the code including in the middle of straight-line code or just
  before an interrupt occurs. Furthermore, when these happen, *every* core
  in the processor appears to stop at the same time (different cores are in
  different places). The gaps do not appear to be related to interrupts (I
  instrumented every __irq_entry in the Linux kernel sources), context
  switches, or c-states (which I disabled). It appears that the gaps are
  caused by System Management Interrupts (SMIs); they appear to account
  for about half of the P99 traces I examined in W4.

* (July 2020) RSS configuration. Noticed that tail latency most often occurs
  because too much work is being done by either NAPI or SoftIRQ (or both) on
  a single core, which prevents application threads on that core from running.
  Tried several alternative approaches to RSS to see if better load balancing
  is possible, such as:
  * Concentrate NAPI and SoftIRQ packet handling on a small number of cores,
    and use core affinity to keep application threads off of those cores.
  * Identify an unloaded core for SoftIRQ processing and steer packet batches
    to these carefully chosen cores (attempted several different policies).
  * Bypass the entire Linux networking stack and call homa_softirq directly
    from homa_gro_receive.
  * Arrange for SoftIRQ to run on the same core as NAPI (this is more efficient
    because it avoids inter-processor interrupts, but can increase contention
    on that core).
  Most of these attempts made things worse, and none produced dramatic
  benefits. In the end, I settled on the following hybrid approach:
  * For single-packet batches (meaning the NAPI core is underloaded), process
    SoftIRQ on the same core as NAPI. This reduces small-message RTT by about
    3 us in underloaded systems.
  * When there are packet batches, examine several adjacent cores, and pick
    the one for SoftIRQ that has had the least recent NAPI/SoftIRQ work.
  Overall, this results in a 20-35% improvement in P99 latency for small
  messages under heavy-tailed workloads, in comparison to the Linux default
  RSS behavior.

* (July 2020) P999 latency for small messages. This is 5 ms or more in most
  of the workloads, and it turns out to be caused by Linux SoftIRQ handling.
  If __do_softirq thinks it is taking too much time, it stops processing
  all softirqs in the high-priority NAPI thread, and instead defers them
  to another thread, softirqd, which intentionally runs at a low priority
  so as not to interfere with user threads. This sometimes means it has to
  wait for a full time slice for other threads, which seems to be 5-7 ms.
  I tried disabling this feature of __do_softirq, so that all requests get
  processed in the high-priority thread, and the P999 latency improved by
  about 10x (< 1 ms worst case).

* (July 2020) Small-message latency. The best-case RTT for small messages
  is very difficult to achieve under any real-world conditions. As soon as
  there is any load whatsoever, best-case latency jumps from 15 us to 25-40 us
  (depending on overall load). The latency CDF for Homa is almost completely
  unaffected by load (whereas it varies dramatically with TCP).

* (July 2020) Small-request optimization: if NAPI and SoftIRQ for a packet
  are both done on the same core, it reduces round-trip latency by about
  2 us for short messages; however, this works against the optimization below
  for spreading out the load. I tried implementing it only for packets that
  don't get merged for GRO, but it didn't make a noticeable difference (see
  note above about best-case latency for short messages).

* (June-July 2020) Analyzing tail latency. P99 latency under W4 seems to
  occur primarily because of core congestion: a core becomes completely
  consumed with either NAPI or SoftIRQ processing (or both) for a long
  message, which keeps it from processing a short message. For example,
  the user thread that handles the message might be on the congested core,
  and hence doesn't run for a long time while the core does NAPI/SoftIRQ
  work. I modified Homa's GRO code to pick the SoftIRQ core for each batch
  of packets intelligently (choose a core that doesn't appear to be busy
  with either NAPI or SoftIRQ processing), and this helped a bit, but not
  a lot (10-20% reduction in P99 for W4). Even with clever assignment of
  SoftIRQ processing, the load from NAPI can be enough to monopolize a core.

* (June 2020) Cost of interrupt handler for receiving packets:
      mlx5e_mpwqe_fill_rx_skb: 200 ns
      napi_gro_receive:        150 ns

* (June 2020) Does instrumentation slow Homa down significantly? Modified
  to run without timetraces and without any metrics except essential ones
  for computing priorities:
  Latency dropped from 15.3 us to 15.1 us
  Small-RPC throughput increased from 1.8 Mops/sec to 1.9 Mops/sec
  Large-message throughput didn't change: still about 2.7 MB/sec
  Disabling timetraces while retaining metrics roughly splits the
  difference. Conclusion: not worth the effort of disabling metrics,
  probably not worth turning off timetracing.

* (June 2020) Implemented busy-waiting, where Homa spins for 2 RTTs
  before putting a receiving thread to sleep. This reduced 100B RTT
  on the xl170 cluster from 17.8 us to 15.3 us.

* (May 2020) Noticed that cores can disappear for 10-12ms, during which
  softirq handlers do not get invoked. Homa timetraces show no activity
  of any kind during that time (e.g., no interrupts either?). Found out
  later that this is Homa's fault: there is no preemption when executing
  in the kernel, and RPC reaping could potentially run on for a very long
  time if it gets behind. Fixed this by adding calls to schedule() so that
  SoftIRQ tasks can run.

* (Mar. 2020) For the slowdown tests, the --port-max value needs to be
  pretty high to get true Poisson behavior. It was originally 20, but
  increasing it had significant impact on performance for TCP, particularly
  for short-message workloads. For example, TCP P99 slowdown for W1 increased
  from 15 to 170x when --port-max increase from 20-100. Performance
  got even worse at --port-max=200, but I decided to stick with 100 for now.

* (Mar. 2020) Having multiple threads receiving on a single port makes a
  big difference in tail latency. cperf had been using just one receiver
  thread for each port (on both clients and servers); changing to
  multiple threads reduced P50/P99 slowdown for small messages in W5
  from 7/65 to 2.5/7.5!

* Performance suffers from a variety of load balancing problems. Here
  are some examples:
  * (March 2020) Throughput varies by 20% from run to run when a single client
    sends 500KB messages to a single server. In this configuration, all
    packets arrive through a single NAPI core, which is fully utilized.
    However, if Linux also happens to place other threads on that core (such
    as the pacer) it takes time away from NAPI, which reduces throughput.
  * (March 2020) When analyzing tail latency for small messages in W5, I found
    that user threads are occasionally delayed 100s of microseconds in waking
    up to handle a message. The problem is that both the NAPI and SoftIRQ
    threads happened (randomly) to get busy on that core the same time,
    and they completely monopolized the core.
  * (March 2020) Linux switches threads between cores very frequently when
    threads sleep (2/3 of the time in experiments today).

* (Feb. 2020) The pacer can potentially be a severe performance bottleneck
  (a single thread cannot keep the network utilized with packets that are
  not huge). In a test with 2 clients bombarding a single server with
  1000-byte packets, performance started off high but then suddenly dropped
  by 10x. There were two contributing factors. First, once the pacer got
  involved, all transmissions had to go through the pacer, and the pacer
  became the bottleneck. Second, this resulted in growth of the throttle
  queue (essentially all standing requests: > 300 entries in this experiment).
  Since the queue is scanned from highest to lowest priority, every insertion
  had to scan the entire queue, which took about 6 us. At this point the queue
  lock becomes the bottleneck, resulting in 10x drop in performance.

  I tried inserting RPCs from the other end of the throttle queue, but
  this still left a 2x reduction in throughput because the pacer couldn't
  keep up. In addition, it seems like there could potentially be situations
  where inserting from the other end results in long searches. So, I backed
  this out.

  The solution was to allow threads other than the pacer to transmit packets
  even if there are entries on the throttle queue, as long as the NIC queue
  isn't long. This allows other threads besides the pacer to transmit
  packets if the pacer can't keep up. In order to avoid pacer starvation,
  the pacer uses a modified approach: if the NIC queue is too full for it to
  transmit a packet immediately, it computes the time when it expects the
  NIC queue to get below threshold, waits until that time arrives, and
  then transmits; it doesn't check again to see if the NIC queue is
  actually below threshold (which it may not be if other threads have
  also been transmitting). This guarantees that the pacer will make progress.

* The socket lock is a throughput bottleneck when a multi-threaded server
  is receiving large numbers of small requests. One problem was that the
  lock was being acquired twice while processing a single-packet incoming
  request: once during RPC initialization to add the RPC to active_rpcs,
  and again later to add dispatch the RPC to a server thread. Restructured
  the code to do both of these with a single lock acquisition. Also
  cleaned up homa_wait_for_message to reduce the number of times it
  acquires socket locks. This produced the following improvements, measured
  with one server (--port_threads 8) and 3 clients (--workload 100 --alt_client
  --client_threads 20):
  * Throughput increased from 650 kops/sec to 760
  * socket_lock_miss_cycles dropped from 318% to 193%
  * server_lock_miss_cycles dropped from 1.4% to 0.7%

* Impact of load balancing on latency (xl170, 100B RPCs, 11/2019):
                      1 server thread  18 threads  TCP, 1 thread  TCP, 18 threads
  No RPS/RFS             16.0 us         16.3 us      20.0 us        25.5 us
  RPS/RFS enabled        17.1 us         21.5 us      21.9 us        26.5 us

* It's better to queue a thread waiting for incoming messages at the *front*
  of the list in homa_wait_for_message, rather than the rear. If there is a
  pool of server threads but not enough load to keep them all busy, it's
  better to reuse a few threads rather than spreading work across all of
  them; this produces better cache locality). This approach improves latency
  by 200-500ns at low loads.

* Problem: large messages have no pipelining. For example, copying bytes
  from user space to output buffers is not overlapped with sending packets,
  and copying bytes from buffers to user space doesn't start until the
  entire message has been received.
  * Tried overlapping packet transmission with packet creation (7/2019) but
    this made performance worse, not better. Not sure why.

* It is hard for the pacer to keep the uplink fully utilized, because it
  gets descheduled for long periods of time.
  * Tried disabling interrupts while the pacer is running, but this doesn't
   work: if a packet gets sent with interrupts disabled, the interrupts get
    reenabled someplace along the way, which can lead to deadlock. Also,
    the VLAN driver uses "interrupts off" as a signal that it should enter
    polling mode, which doesn't work.
  * Tried calling homa_pacer_xmit from multiple places; this helps a bit
    (5-10%).
  * Tried making the pacer thread a high-priority real-time thread; this
    actually made things a bit worse.

* There can be a long lag in sending grants. One problem is that Linux
  tries to collect large numbers of buffers before invoking the softirq
  handler; this causes grants to be delayed. Implemented max_gro_skbs to
  limit buffering. However, varying the parameter doesn't seem to affect
  throughput (11/13/2019).

* Without RPS enabled, Homa performance is limited by a single core handling
  all softirq actions. In order for RPS to work well, Homa must implement
  its own hash function for mapping packets to cores (the default IP hasher
  doesn't know about Homa ports, so it considers only the peer IP address.
  However, with RPS, packets can get spread out over too many cores, which
  causes poor latency when there is a single client and the server is
  underloaded.