-
Notifications
You must be signed in to change notification settings - Fork 525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EVPN]Handling error scenarios during route programming and IMR add #2670
Conversation
@srj102 Can you please review? |
@prsunny can you please help to review/approve? |
@@ -2376,6 +2376,13 @@ bool EvpnRemoteVnip2pOrch::addOperation(const Request& request) | |||
return false; | |||
} | |||
|
|||
VRFOrch* vrf_orch = gDirectory.get<VRFOrch*>(); | |||
if (vrf_orch->isL3VniVlan(vni_id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- wouldn't this check is applicable for fdborch also ?
- If the vni became an L3VNI after IMR routes were installed then OA will remove VLAN-VNI mappings and add the VRF-VNI mapping, leaving us with a scenario where the SAI has MAC/IMR routes with no corresponding vlan vni mappings.
- What happens without this fix ? Are there SAI errors we see ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- In case of FDB orch we have check to ensure Vlan membership which will prevent FDB getting added since the IMR is prevented
- Yes that's true. But as per the current behavior we will fail at removing VLAN VNI mapping at SAI level due to references. Do you think we should add a check at orchagent. I am not sure if this check is valid for other platforms and so I didn't prefer to add
- Without the fix SAI returns error and orchagent crashes.
if (vrf_orch->isL3VniVlan(vni_id)) | ||
{ | ||
SWSS_LOG_WARN("Ignoring remote VNI add for L3 VNI:%d, remote:%s", vni_id, remote_vtep.c_str()); | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be return true instead of return false ?
The log says "Ignore" but in reality it is being deferred.
Also a large number of IMR routes being held in the m_tosync queue and being processed in a continous loop will load the OA unnecessarily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can correct the log. I prefer not to ignore but defer. If we ignore we might miss those remote. Since this is a corner error scenario I don't think it will overwhelm orchagent during regular processing
|
||
if (!l3Vni) | ||
{ | ||
it++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment as for IMR handling. Not Removing this from the m_tosyncqueue in a scale scenario and in a misconfiguration case will unnecessarily load the OA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, completely ignoring will lead to the message being lost. Since this is a error scenario, I would rather prefer retrying here. It won't affect regular use case scenarios
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we see this in startup scenarios as well ? In which case the retry becomes necessary.
For example in an L3VNI scenario, FRR is configured correctly and SWSS is configured for both L2 and L3VNIs.
The vrforch processes the L3VNI configuration after receiving the update from fpmsyncd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srj102 It is quite possible but I haven't encountered it. I believe my change would handle that scenario as well
for (auto &vni_str: vni_labelv) | ||
{ | ||
vni = static_cast<uint32_t>(std::stoul(vni_str)); | ||
if (!m_vrfOrch->isL3VniVlan(vni)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the vrf-vni configuration is removed from OA and not from FRR after routes are installed then the routes still remain in the HW.
Is this scenario handled ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our case the VNI removal leads to error logs in SAI since it has references. This scenario again depends on references to maintain which I believe we don't do it. IMO I am fixing the scenarios which are easier to fix with the current implementation and then document the ones which are harder to fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes thats correct we don't maintain the routes in the OA and handling this will be harder. If this leads to a SAI error and an OA crash we will need to see whether this can be handled at the SAI level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently SAI error is not creating orchagent crash since we are not handling status for this call. So we will get syslog error. For now I consider this a lower priority and document this.
Update sonic-swss submodule pointer to include the following: * a2c9a61 [EVPN]Handling error scenarios during route programming and IMR add ([sonic-net#2670](sonic-net/sonic-swss#2670)) * 115efe8 [bfdorch] add default TOS value for BFD session ([sonic-net#2689](sonic-net/sonic-swss#2689)) * a198289 [orchagent, SRv6]: create seglist support to set sid list type ([sonic-net#2406](sonic-net/sonic-swss#2406)) Signed-off-by: AntonHryshchuk <antonh@nvidia.com>
Update sonic-swss submodule pointer to include the following: * 98a16cf [ACL] Write ACL table/rule creation status into STATE_DB ([sonic-net#2662](sonic-net/sonic-swss#2662)) * a2c9a61 [EVPN]Handling error scenarios during route programming and IMR add ([sonic-net#2670](sonic-net/sonic-swss#2670)) * 115efe8 [bfdorch] add default TOS value for BFD session ([sonic-net#2689](sonic-net/sonic-swss#2689)) * a198289 [orchagent, SRv6]: create seglist support to set sid list type ([sonic-net#2406](sonic-net/sonic-swss#2406)) Signed-off-by: dgsudharsan <sudharsand@nvidia.com>
Update sonic-swss submodule pointer to include the following: * 98a16cf [ACL] Write ACL table/rule creation status into STATE_DB ([#2662](sonic-net/sonic-swss#2662)) * a2c9a61 [EVPN]Handling error scenarios during route programming and IMR add ([#2670](sonic-net/sonic-swss#2670)) * 115efe8 [bfdorch] add default TOS value for BFD session ([#2689](sonic-net/sonic-swss#2689)) * a198289 [orchagent, SRv6]: create seglist support to set sid list type ([#2406](sonic-net/sonic-swss#2406))
What I did
Handling error scenarios for the following cases
Why I did it
These scenarios will lead to issues and result in SAI failures which needs to be avoided
How I verified it
Added UT to verify.
Details if related