Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Neigh refresh #1043

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Neigh refresh #1043

wants to merge 3 commits into from

Conversation

LaveenBrcm
Copy link

@LaveenBrcm LaveenBrcm commented Aug 1, 2022

Neighbor Refresh HLD.

Neighbor refresh proposal to replace existing arp_update script [ sonic-buildimage/files/scripts/arp_update ]

Repo PR Title State
sonic-swss Neighbor Cache Implementation
sonic-swss-commmon Neighbor Cache Implementation
sonic-swss Interface Cache Implementation
sonic-swss Refresh Timer

Neighbor Refresh HLD Initial Revision
Images for Neighbor Refresh HLD
@adyeung
Copy link
Collaborator

adyeung commented Aug 17, 2022

/easycla

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Aug 17, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.


Req-3:

MAC table entry in SONiC can be administratively cleared or could be removed due to control protocol handling (mclag mac delete on orphan ports during mclag peer reboot). Once the MAC is removed by external triggers then traffic could be dropped up untill the next neighbor refresh which could be in minutes. In order to avoid such traffic drop, if a MAC is removed, neighbor entry should be refreshed as early as possible to avoid blackholing of traffic .
Copy link
Collaborator

@venkatmahalingam venkatmahalingam Sep 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When STP is enabled, there could be a possibility of seeing multiple topology changes before it settles down, how do we handle so many FDB flushes leading to adding same set of nbrs to the Q for refresh? do we hold on the refresh until the topology settles and then refresh all the nbrs or we refresh the MAC associated nbrs as and when the MAC is flushed in the HW? there is a CoPP limit for ARP/Nbr Rx in the HW, did we hit this limit in the testing i.e CoPP rule dropped the packets in the HW? if we dont receive the response, kernel takes care of re-sending the packets few times but there can be scenarios where user intervention is required to refresh the nbrs manually beyond the configured limit. Did we hit those scenarios during testing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When STP is enabled, there could be a possibility of seeing multiple topology changes before it settles down, how do we handle so many FDB flushes leading to adding same set of nbrs to the Q for refresh?

If we can deduce STP convergence for a Vlan, the same can be used to avoid adding the corresponding nbrs to refresh Q, in the absence of which refreshing the nbrs will help in faster convergence.

do we hold on the refresh until the topology settles and then refresh all the nbrs or we refresh the MAC associated nbrs as and when the MAC is flushed in the HW?

In the current design, Nbrs are refreshed as and when the MAC is flushed.

there is a CoPP limit for ARP/Nbr Rx in the HW, did we hit this limit in the testing i.e CoPP rule dropped the packets in the HW?

Refresh timer has a Tx threshold which is hardcoded, this can be made to map the corresponding CoPP rule limit.

if we dont receive the response, kernel takes care of re-sending the packets few times but there can be scenarios where user intervention is required to refresh the nbrs manually beyond the configured limit. Did we hit those scenarios during testing?

Retry is in Nbrmgrd and is based on whether the MAC is learned or not, number of retry is 10 with 2 seconds apart. We did not hit those scenarios during testing.

SONiC depends upon the Linux kernel to manage the ARP/ND tables. SONiC then listens to ARP/ND events from the kernel and synchronizes the hardware as required. Some of the limitations with this are as follows,

- The kernel does not "see" the routed (in HW) through-traffic, and so cannot update its "hit bits" accordingly. Therefore the kernel may age out an entry that is still in use.
- The kernel does not "see" the HW MAC aging process, and so does not know that a MAC address associated with an ARP/ND entry has been aged out, and so does not refresh it. This can result in traffic black holes for a "quiet" neighbor (i.e. one that does not transmit much) / unidirectional traffic streams.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when the MAC is aged out in the kernel? do we refresh the ARP/Nbr to re-learn the MAC?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is recommended to have the same MAC age out time in kernel and the HW.

Nbrmgrd will be listing to state FDB DB changes, if there is no change in state FDB DB for kernel ageout, ARP/Nbr will not be refreshed.


- FDB Cache is to handle quick refresh/retry of neighbors whose mac address is removed admistratively or due to control protocol clear
- It has the list of neighbors(v4/v6) using the same mac on a VLAN interface. When the mac address is deleted, all the neighbors associated with the mac will be added to a fdb refresh queue.
- Entries in FDB refresh queue will be refreshed faster than the usual neighbor refresh interval and will be removed from the queue after 10 retries.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we report an error in this case i.e after 10 retries? during the STP topology settlement, 10 retries may not be sufficient, if we can present what is tested with this feature, that would act as the reference for the community.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes there will be a syslog NOTICE/ERROR msg.

Sure, will verify with STP topology and update the recommended scale / guidelines.


- Traverse the neighbor Cache entries periodically (every 30 secs)
- Check refresh timeout has elapsed for every neighbors
- If elapsed then send ARP/NS packet
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this unicast or broadcast ARP or multicast NS?

Copy link
Collaborator

@venkatmahalingam venkatmahalingam Sep 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the MAC is not refreshed in the kernel, there can be a possibility flooding the packets on all members of the VLAN, how are we avoiding this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the MAC is known then the Neighbor refresh will be unicast, if the MAC is not know then the neighbor refresh will be broadcast ARP /multicast NS.

As long as we have same MAC aging time in both the Kernel and the HW, we should not have this issue.


# 6 Unit Test

1. Verify ARP/ND entry is refreshed periodically before the ARP/ND entry agesout in kernel (or neighbor entries state changes from REACHABLE to STALE)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have the kernel patch to refresh during reachable to stale transition or nbr refresh module takes care of this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nbr refresh module takes care of this, expectation is that refresh should happen even before the kernel state transition from reachable to stale

# 6 Unit Test

1. Verify ARP/ND entry is refreshed periodically before the ARP/ND entry agesout in kernel (or neighbor entries state changes from REACHABLE to STALE)
2. Verify ARP/ND entry learned on VLAN interfaces is refreshed before their corresponding MAC ageout in HW
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering how are we making sure that ARP refresh is triggered before MAC ageout in the HW? when the traffic is active in the HW, we may not receive age-out notification to the CPU, clarify if we have anything in place to map HW MAC aliveness with Nbr refresh time in the kernel.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The refresh timeout is calculated as a MIN (NBR_AGEOUT, FDB_AGEOUT), this makes sure that the refresh pkt is sent before the HW MAC aging (if MAC age < ARP/ND age).


1. Verify ARP/ND entry is refreshed periodically before the ARP/ND entry agesout in kernel (or neighbor entries state changes from REACHABLE to STALE)
2. Verify ARP/ND entry learned on VLAN interfaces is refreshed before their corresponding MAC ageout in HW
3. Verify ARP/ND entry is refreshed within few seconds if their corresponding MAC in FDB table is deleted.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there is a CoPP limit for ARP/Nbr RX from HW, what if we send more than the allowed Rx limit? how do we handle the retries?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nbr refresh module has a Tx threshold, ideally this tx threshold should be lower than the CoPP Rx limit.

- All Static neighbor entries (MAC can be dynamic)
- Following Entries will not be added to the neighbor cache
- Neighbors learned from “eth0” interface
- Neighbors learned from BGP/EVPN MAC/IP type-2 route
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hope when the MAC becomes remote to local, we add the associated ARP/nbrs into the refresh Q.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remote EVPN MAC is a static / PERMANENT entry.

When the Host moves / MAC becomes local, MAC is learned local and the Neighbors will start using this MAC, So these nbrs are not added to refresh Q.


- Add neighbor entries to cache when the entry is added to NEIGH_REFRESH_TABLE
- All Dynamically learned neighbor entries [ARP, ND (Global, LinkLocal)]
- All Static neighbor entries (MAC can be dynamic)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why do we need to support this case, what if we get different than configured MAC from the peer as part of the refresh? cant we mandate MAC also be static, any use-case?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a special case and not a regular use-case scenario.

Static Neigh and Static MAC are independent, hence it is possible to have one without the other.

Nbr refresh module will look for the MAC in the static neighbor config, hence even if the ARP response is received with different MAC (misconfig), kernel will not accept the pkt as the neighbor is PERMANENT and nbr refresh module will retry until the MAC given in the static neighbor config is learned.

- During Warmboot, MAC aging will be disabled and will be enabled back after the system is restared. Hence traffic blackholing due to MAC ageout will not happen. When the system boots up, ARP/ND refresh cache table will be populated again and then the neighbor entries will be refreshed as normal.

## 1.5 Limitiations

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when the ARP/Nbr is flushed in the kernel by the user? if any ARP/Nbr associated with the route, do we leave those routes to drop the packets in the HW or refresh the those nbrs alone?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Orchagent will not delete the neighbor if there is a route associated, hence the HW will continue to forward the packet until the route is removed and then the neighbor will be removed from HW.


Req-2:

In case of Vlan interfaces, for an neighbor entry, in addition to the neighbor and nexthop entry programmed in ASIC, FDB entry corresponding to the resolved MAC to outgoing interface will be programmed. Without the FDB entry packets may not be forwarded in HW. Since the FDB resources are limited its quite normal to ageout an unsed FDB entry in HW. If the FDB entry ages out before the neighbor ageout then packets will be dropped in ASIC. So the neighbor refresh should ideally happen before the MAC/FDB ageout.
Copy link
Collaborator

@venkatmahalingam venkatmahalingam Sep 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the FDB entry packets may not be forwarded in HW.

I believe when the FDB entry is not there, we expect to flood the IP packets in the HW i.e when nbr and FDB are clearly separated, is this not true as per SAI?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, For routed traffic over Vlan interface, without FDB, pkts will not be forwarded.

; New table
; Has the ARP/ND entries that needs to be refreshed periodically

key = NEIGH_REFRESH_TABLE:ifname:IPprefix
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IPPrefix or full IPv4/IPv6 address?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full IPv4/IPv6 address

3) "family"
4) "IPv4"
5) "state"
6) "2"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why cant we have the meaningful state names instead of numbers?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The state number directly map with kernel states like NUD_REACHABLE, else there will be string validation rather than int. If for any reason if the state has bitwise ORed values it will be directly validated in corresponding subscriber modules.

- Interface IP address will be used in building the ARP (Sender IP address) and NS (Soure IP) packets
- Subscribe to redis-db tables
- IP address
- CONFIG_DB: INTERFACE, VLAN_INTERFACE, SAG_INTERFACE, PORTCHANNEL_INTERFACE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the regular L3 interfaces (non-VLAN interfaces), why cant we patch the kernel to refresh the nbr when the state is transitioned from reachable to stale? do we need to add those nbrs as well into the refresh Q. Kernel level refresh may be faster and efficient. any thoughts?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way is to change the state transition from reachable to stale to reachable to delay/probe, so that the kernel state machine takes care of pkt tx, retry etc. However this introduces state transition which further have to be processed by neighsyncd.

With scale, if kernel does the refresh there could be possible large number of tx and corresponding copp rx limit mismatch.

Building a local neighbor cache in one of the swss modules will help add additional functionalities other than neighbor refresh in future.

@zhangyanzhao
Copy link
Collaborator

From SONiC community meeting, team wants to understand deeper on the motivation to introduce this feature (why part). @LaveenBrcm will check with @prsunny offline.

@zhangyanzhao
Copy link
Collaborator

added to 202305 release plan

@vboykox
Copy link
Member

vboykox commented Aug 12, 2024

using NTF_MANAGED instead of NTF_USE might help with refreshing

commit 7482e3841d520a368426ac196720601687e2dc47
Author: Daniel Borkmann <daniel@iogearbox.net>
Date:   Mon Oct 11 14:12:38 2021 +0200

    net, neigh: Add NTF_MANAGED flag for managed neighbor entries

    Allow a user space control plane to insert entries with a new NTF_EXT_MANAGED
    flag. The flag then indicates to the kernel that the neighbor entry should be
    periodically probed for keeping the entry in NUD_REACHABLE state iff possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: MovedToBacklog
Development

Successfully merging this pull request may close these issues.

5 participants