-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Neigh refresh #1043
base: master
Are you sure you want to change the base?
Neigh refresh #1043
Conversation
Neighbor Refresh HLD Initial Revision
Images for Neighbor Refresh HLD
/easycla |
|
||
Req-3: | ||
|
||
MAC table entry in SONiC can be administratively cleared or could be removed due to control protocol handling (mclag mac delete on orphan ports during mclag peer reboot). Once the MAC is removed by external triggers then traffic could be dropped up untill the next neighbor refresh which could be in minutes. In order to avoid such traffic drop, if a MAC is removed, neighbor entry should be refreshed as early as possible to avoid blackholing of traffic . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When STP is enabled, there could be a possibility of seeing multiple topology changes before it settles down, how do we handle so many FDB flushes leading to adding same set of nbrs to the Q for refresh? do we hold on the refresh until the topology settles and then refresh all the nbrs or we refresh the MAC associated nbrs as and when the MAC is flushed in the HW? there is a CoPP limit for ARP/Nbr Rx in the HW, did we hit this limit in the testing i.e CoPP rule dropped the packets in the HW? if we dont receive the response, kernel takes care of re-sending the packets few times but there can be scenarios where user intervention is required to refresh the nbrs manually beyond the configured limit. Did we hit those scenarios during testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When STP is enabled, there could be a possibility of seeing multiple topology changes before it settles down, how do we handle so many FDB flushes leading to adding same set of nbrs to the Q for refresh?
If we can deduce STP convergence for a Vlan, the same can be used to avoid adding the corresponding nbrs to refresh Q, in the absence of which refreshing the nbrs will help in faster convergence.
do we hold on the refresh until the topology settles and then refresh all the nbrs or we refresh the MAC associated nbrs as and when the MAC is flushed in the HW?
In the current design, Nbrs are refreshed as and when the MAC is flushed.
there is a CoPP limit for ARP/Nbr Rx in the HW, did we hit this limit in the testing i.e CoPP rule dropped the packets in the HW?
Refresh timer has a Tx threshold which is hardcoded, this can be made to map the corresponding CoPP rule limit.
if we dont receive the response, kernel takes care of re-sending the packets few times but there can be scenarios where user intervention is required to refresh the nbrs manually beyond the configured limit. Did we hit those scenarios during testing?
Retry is in Nbrmgrd and is based on whether the MAC is learned or not, number of retry is 10 with 2 seconds apart. We did not hit those scenarios during testing.
SONiC depends upon the Linux kernel to manage the ARP/ND tables. SONiC then listens to ARP/ND events from the kernel and synchronizes the hardware as required. Some of the limitations with this are as follows, | ||
|
||
- The kernel does not "see" the routed (in HW) through-traffic, and so cannot update its "hit bits" accordingly. Therefore the kernel may age out an entry that is still in use. | ||
- The kernel does not "see" the HW MAC aging process, and so does not know that a MAC address associated with an ARP/ND entry has been aged out, and so does not refresh it. This can result in traffic black holes for a "quiet" neighbor (i.e. one that does not transmit much) / unidirectional traffic streams. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when the MAC is aged out in the kernel? do we refresh the ARP/Nbr to re-learn the MAC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is recommended to have the same MAC age out time in kernel and the HW.
Nbrmgrd will be listing to state FDB DB changes, if there is no change in state FDB DB for kernel ageout, ARP/Nbr will not be refreshed.
|
||
- FDB Cache is to handle quick refresh/retry of neighbors whose mac address is removed admistratively or due to control protocol clear | ||
- It has the list of neighbors(v4/v6) using the same mac on a VLAN interface. When the mac address is deleted, all the neighbors associated with the mac will be added to a fdb refresh queue. | ||
- Entries in FDB refresh queue will be refreshed faster than the usual neighbor refresh interval and will be removed from the queue after 10 retries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we report an error in this case i.e after 10 retries? during the STP topology settlement, 10 retries may not be sufficient, if we can present what is tested with this feature, that would act as the reference for the community.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes there will be a syslog NOTICE/ERROR msg.
Sure, will verify with STP topology and update the recommended scale / guidelines.
|
||
- Traverse the neighbor Cache entries periodically (every 30 secs) | ||
- Check refresh timeout has elapsed for every neighbors | ||
- If elapsed then send ARP/NS packet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this unicast or broadcast ARP or multicast NS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the MAC is not refreshed in the kernel, there can be a possibility flooding the packets on all members of the VLAN, how are we avoiding this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the MAC is known then the Neighbor refresh will be unicast, if the MAC is not know then the neighbor refresh will be broadcast ARP /multicast NS.
As long as we have same MAC aging time in both the Kernel and the HW, we should not have this issue.
|
||
# 6 Unit Test | ||
|
||
1. Verify ARP/ND entry is refreshed periodically before the ARP/ND entry agesout in kernel (or neighbor entries state changes from REACHABLE to STALE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have the kernel patch to refresh during reachable to stale transition or nbr refresh module takes care of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nbr refresh module takes care of this, expectation is that refresh should happen even before the kernel state transition from reachable to stale
# 6 Unit Test | ||
|
||
1. Verify ARP/ND entry is refreshed periodically before the ARP/ND entry agesout in kernel (or neighbor entries state changes from REACHABLE to STALE) | ||
2. Verify ARP/ND entry learned on VLAN interfaces is refreshed before their corresponding MAC ageout in HW |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering how are we making sure that ARP refresh is triggered before MAC ageout in the HW? when the traffic is active in the HW, we may not receive age-out notification to the CPU, clarify if we have anything in place to map HW MAC aliveness with Nbr refresh time in the kernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The refresh timeout is calculated as a MIN (NBR_AGEOUT, FDB_AGEOUT), this makes sure that the refresh pkt is sent before the HW MAC aging (if MAC age < ARP/ND age).
|
||
1. Verify ARP/ND entry is refreshed periodically before the ARP/ND entry agesout in kernel (or neighbor entries state changes from REACHABLE to STALE) | ||
2. Verify ARP/ND entry learned on VLAN interfaces is refreshed before their corresponding MAC ageout in HW | ||
3. Verify ARP/ND entry is refreshed within few seconds if their corresponding MAC in FDB table is deleted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there is a CoPP limit for ARP/Nbr RX from HW, what if we send more than the allowed Rx limit? how do we handle the retries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nbr refresh module has a Tx threshold, ideally this tx threshold should be lower than the CoPP Rx limit.
- All Static neighbor entries (MAC can be dynamic) | ||
- Following Entries will not be added to the neighbor cache | ||
- Neighbors learned from “eth0” interface | ||
- Neighbors learned from BGP/EVPN MAC/IP type-2 route |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hope when the MAC becomes remote to local, we add the associated ARP/nbrs into the refresh Q.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remote EVPN MAC is a static / PERMANENT entry.
When the Host moves / MAC becomes local, MAC is learned local and the Neighbors will start using this MAC, So these nbrs are not added to refresh Q.
|
||
- Add neighbor entries to cache when the entry is added to NEIGH_REFRESH_TABLE | ||
- All Dynamically learned neighbor entries [ARP, ND (Global, LinkLocal)] | ||
- All Static neighbor entries (MAC can be dynamic) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why do we need to support this case, what if we get different than configured MAC from the peer as part of the refresh? cant we mandate MAC also be static, any use-case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a special case and not a regular use-case scenario.
Static Neigh and Static MAC are independent, hence it is possible to have one without the other.
Nbr refresh module will look for the MAC in the static neighbor config, hence even if the ARP response is received with different MAC (misconfig), kernel will not accept the pkt as the neighbor is PERMANENT and nbr refresh module will retry until the MAC given in the static neighbor config is learned.
- During Warmboot, MAC aging will be disabled and will be enabled back after the system is restared. Hence traffic blackholing due to MAC ageout will not happen. When the system boots up, ARP/ND refresh cache table will be populated again and then the neighbor entries will be refreshed as normal. | ||
|
||
## 1.5 Limitiations | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when the ARP/Nbr is flushed in the kernel by the user? if any ARP/Nbr associated with the route, do we leave those routes to drop the packets in the HW or refresh the those nbrs alone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Orchagent will not delete the neighbor if there is a route associated, hence the HW will continue to forward the packet until the route is removed and then the neighbor will be removed from HW.
|
||
Req-2: | ||
|
||
In case of Vlan interfaces, for an neighbor entry, in addition to the neighbor and nexthop entry programmed in ASIC, FDB entry corresponding to the resolved MAC to outgoing interface will be programmed. Without the FDB entry packets may not be forwarded in HW. Since the FDB resources are limited its quite normal to ageout an unsed FDB entry in HW. If the FDB entry ages out before the neighbor ageout then packets will be dropped in ASIC. So the neighbor refresh should ideally happen before the MAC/FDB ageout. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the FDB entry packets may not be forwarded in HW.
I believe when the FDB entry is not there, we expect to flood the IP packets in the HW i.e when nbr and FDB are clearly separated, is this not true as per SAI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, For routed traffic over Vlan interface, without FDB, pkts will not be forwarded.
; New table | ||
; Has the ARP/ND entries that needs to be refreshed periodically | ||
|
||
key = NEIGH_REFRESH_TABLE:ifname:IPprefix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IPPrefix or full IPv4/IPv6 address?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Full IPv4/IPv6 address
3) "family" | ||
4) "IPv4" | ||
5) "state" | ||
6) "2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why cant we have the meaningful state names instead of numbers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The state number directly map with kernel states like NUD_REACHABLE, else there will be string validation rather than int. If for any reason if the state has bitwise ORed values it will be directly validated in corresponding subscriber modules.
- Interface IP address will be used in building the ARP (Sender IP address) and NS (Soure IP) packets | ||
- Subscribe to redis-db tables | ||
- IP address | ||
- CONFIG_DB: INTERFACE, VLAN_INTERFACE, SAG_INTERFACE, PORTCHANNEL_INTERFACE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the regular L3 interfaces (non-VLAN interfaces), why cant we patch the kernel to refresh the nbr when the state is transitioned from reachable to stale? do we need to add those nbrs as well into the refresh Q. Kernel level refresh may be faster and efficient. any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way is to change the state transition from reachable to stale to reachable to delay/probe, so that the kernel state machine takes care of pkt tx, retry etc. However this introduces state transition which further have to be processed by neighsyncd.
With scale, if kernel does the refresh there could be possible large number of tx and corresponding copp rx limit mismatch.
Building a local neighbor cache in one of the swss modules will help add additional functionalities other than neighbor refresh in future.
From SONiC community meeting, team wants to understand deeper on the motivation to introduce this feature (why part). @LaveenBrcm will check with @prsunny offline. |
added to 202305 release plan |
using
|
Neighbor Refresh HLD.
Neighbor refresh proposal to replace existing arp_update script [ sonic-buildimage/files/scripts/arp_update ]