-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIB Suppress Announcements of Routes Not Installed in HW #1103
FIB Suppress Announcements of Routes Not Installed in HW #1103
Conversation
Signed-off-by: stepanblyschak <stepanb@nvidia.com>
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
@StormLiangMS could you please help to review and approve? |
```abnf | ||
; DEVICE_METADATA table | ||
key = DEVICE_METADATA|localhost ; Device metadata configuration table | ||
suppress-fib-pending = "enabled"/"disabled" ; Globally enable/disable routes announcement suppression feature, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this controls only for BGP routes, please be explicit in the field name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is implemented in fpmsyncd/zebra which does not care if it receives BGP or other protocol routes, so, if SONiC supports, the same feature could be applied to other routing protocols.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we talk only about BGP routes in the HLD, I thought it is better to have a name like "suppress-bgp-routes-fib-pending" and handle only BGP routes case. If you think, that's not the case and you'll enable support for all route types (BGP, Static, OSPF..etc), we can have the generic field name (suppress-fib-pending)
|
||
Example data: | ||
``` | ||
127.0.0.1:6379[14]> hgetall "ROUTE_TABLE:193.5.96.0/25" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why dont we have 'vrf' as the key in the ROUTE_TABLE? we could be supporting only 'default' VRF now, but it's important to have the VRF as the key for future enhancements with data VRF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any plan for bulking the routes in the APPL STATE_DB ROUTE_TABLE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why dont we have 'vrf' as the key in the ROUTE_TABLE? we could be supporting only 'default' VRF now, but it's important to have the VRF as the key for future enhancements with data VRF.
This example is for default VRF. The key in APPL_STATE_DB is the same as in APPL_DB, with VRF included (if not default), so VRF is supported.
Any plan for bulking the routes in the APPL STATE_DB ROUTE_TABLE?
The bulking is done by using redis pipeline for ResponsePublisher.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ROUTE_TABLE in APPL_DB is legacy and we did not have VRF support originally and when we added the VRF support, we did not want to disrupt the existing schema and added the VRF key only for user VRF cases but APPL STATE_DB ROUTE_TABLE is a new schema, not sure why we still need to follow ROUTE_TABLE (APPL_DB) approach rather than having the VRF as the key by default even for 'default' VRF case as well. IMO. having VRF as one of the key by default is cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why ROUTE_TABLE in APPL_DB is legacy? Is there a new table with a new schema?
Here's the latest master and ROUTE_TABLE entry in APPL_DB:
127.0.0.1:6379> keys "ROUTE_TABLE:193.8.224.0/25"
1) "ROUTE_TABLE:193.8.224.0/25"
If it were a user created VRF route the key would be in a format :.
The idea of APPL_STATE_DB is to reflect the same schema that APPL_DB has.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea of APPL_STATE_DB is to reflect the same schema that APPL_DB has.
APPL_DB:
------------
Default VRF: ROUTE_TABLE:<prefix>
Non-default VRF: ROUTE_TABLE:<vrf>:<prefix>
IMO, having the below schema helps to cleanly get the VRF (e.g default) for every route instead of assuming if the VRF is not there in the key, the route is meant for default route, we don't need to follow APPL_DB ROUTE_TABLE schema for new tables but I leave it to you to decide on the final schema.
APPL_STATE_DB:
---------------------
All VRFs including default VRF: ROUTE_TABLE:<vrf>:<prefix>
|
||
### 3. Overview | ||
|
||
As of today, SONiC BGP advertises learnt prefixes regardless of whether these prefixes were successfully programmed into ASIC. While route programming failure is followed by orchagent crash and all services restart, even for successfully created routes there is a short period of time when the peer will be black holing traffic. Also, in the following scenario, a credit loop occurs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When there are multiple updates for the given route (there are NH changes) from fpmsyncd to route-orch, route-orch will send the response to every route changes after ASIC programming? or it will send the response to the latest route-update to fpmsyncd? please clarify.
Also, what happens in the case where we get the route-del from zebra before updating the RTM_NEWROUTE message with RTM_F_OFFLOAD flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Orchagent will send a response for every route update it gets from APPL_DB. As far as I know there is a logic inside ProducerStateTable/ConsumerStateTable that will consolidate multiple SET/DEL operations for a single key which makes orchagent get the most up to date requested field values for a route entry. FRR, on the other hand, currently is only interested in the first route creation status. It does not care about the installation status after the route next hop or other attributes have changed.
If zebra sends RTM_DELROUTE before fpmsyncd sends RTM_NEWROUTE with RTM_F_OFFLOAD, zebra will ignore it, considering it as a stale update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know, thanks for the clarification.
2. Withdraw 10 prefix to DUT through *exabgp* from T0 Arista VM | ||
3. Once reached the required number of cycles the loop breaks after first step | ||
4. Consistency check is applied, there are no withdrawn routes and announced routes were successfully installed and correctly marked as offloaded in zebra (In case a race condition happens or a notification is missed for some reason this test will try to catch it) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we have any test case to simulate orchagent crash and delay the response to fpmsyncd from route-orch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case orchagent crash there will be no response from routeorch. Which tests case do you propose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will happen if fpmsyncd doesnot notify RTM_F_OFFLOAD to Zebra due to orchagent crash? Is there any timeout in Zebra to declare FIB programming failed? what would happen for those routes in Zebra that are waiting for the FIB programming after OrchAgent back up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no logic in zebra that considers a route beeing failed in HW if it did not receive RTM_F_OFFLOAD flag within some time interval. These routes will stay in queue-ed state.
Orchagent crash itself triggers BGP container(zebra + bgpd) restart. So in that case, zebra starts fresh with no routes.
|
||
Routing is a crucial part of a network switch. This feature adds additional flows in existing route processing pipeline and so along the way there might be unexpected failures leading to routes being not marked as offloaded in zebra - missed notification, race conditions leading to in-consistency, etc. It is required to monitor the consistency of routes state periodically and mitigate problems. | ||
|
||
At the moment *route_check.py* verifies routes between APPL_DB & ASIC_DB and makes sure they are in sync. In addition to that it is required to extend this script to check for consistency APPL_DB, APPL_STATE_DB and zebra by calling ```show ip route json``` and verify every route installed in ASIC_DB is marked as offloaded. The script will retry the check for 3 times every 15 sec and if zebra FIB is not in sync with ASIC_DB a mitigation action is performed. The mitigation action publishes required notifications to trigger fpmsyncd flows to send RTM_NEWROUTE message to zebra and log alerts in the syslog. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There can be Static/OSPF/ISIS route updates apart from BGP routes as well, hope you'll allow offloaded/not-offloaded routes to be present in the DB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please clarify your comment? What do you mean by allowing offloaded/not-offloaded routes to be present in the DB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, this HLD talks about adding "suppress-fib-pending" support only for BGP routes, what happens if we have other protocol routes (e.g OSPF) in the switch, what would be the behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would assume OSPF to implement a similar logic. We dropped the "bgp" word in configuration flag because from our point of view there's nothing related to BGP in these new flows. It is on the level of zebra/fpmsyncd/orchagent and they do not care which protocol the route they are working with is originated.
|
||
During fast reboot BGP session is closed by SONiC device without the notification. BGP session is preserved in graceful restart mode. BGP routes on the peer are still active, because nexthop interfaces are up. Once interfaces go down, received BGP routes on the peer are removed from the routing table. Nothing is sent to SONiC device since then. After interfaces go up and BGP sessions re-establish the peer's BGP re-learns advertised routes. | ||
|
||
Due to additional response publishing in orchagent there might be a slight delay in fast reboot reconfiguration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the fast boot time with this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Must be less than 30 sec down time
@StormLiangMS @prsunny please provide your inputs. If no additional feedback please approve. |
The feature and its current content is expected to be in 202305 and under discussion if to take into 202211 as originally aimed. |
@venkatmahalingam , could you please sign off? |
@stepanblyschak , can you please update the Sonic-mgmt PR to the description? |
@abdosi , @arlakshm , @judyjoseph for viz |
Two test PRs are still open. Need be merged @echuawu |
@zhangyanzhao , we are handling the comments from @StormLiangMS, and investigating the possiblity of implementing performance test in sonic-mgmt. |
For stress and performance test implementation, the implementation plan is under discussion, add a PR to track this requirement: sonic-net/sonic-mgmt#8409 |
@echuawu could you put this to the description of this PR. |
@stepanblyschak , please help do that as @StormLiangMS suggested. |
@echuawu @StormLiangMS Added |
This document describes a feedback mechanism that allows BGP not to advertise routes that haven't been programmed yet.
For stress and performance test implementation, the implementation plan is under discussion, add a PR to track this requirement: