Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EVPN]Handling race condition when remote VNI arrives before tunnel map entry #2642

Merged
merged 1 commit into from
Feb 2, 2023

Conversation

dgsudharsan
Copy link
Collaborator

What I did
Added check in remote VNI add to ensure vxlan tunnel map is created before adding the remote end point.

Why I did it
In some race conditions, the VXLAN_REMOTE_VNI_TABLE arrives before VXLAN_TUNNEL_MAP_TABLE. The below log shows race condition as for VNI 500101 is set seen before vxlan tunnel map entry.

2023-01-18.09:35:21.223861|VXLAN_TUNNEL_MAP_TABLE:vtep101032:map_500100_Vlan100|SET|vlan:Vlan100|vni:500100
2023-01-18.09:35:21.326760|VXLAN_FDB_TABLE:Vlan101:aa:e9:21:5a:48:3b|SET|remote_vtep:30.0.0.2|type:dynamic|vni:500101
2023-01-18.09:35:21.327053|VXLAN_FDB_TABLE:Vlan100:5e:3d:f7:1e:93:f6|SET|remote_vtep:30.0.0.2|type:dynamic|vni:500100
2023-01-18.09:35:21.327079|VXLAN_FDB_TABLE:Vlan100:2e:a1:55:f6:59:2e|SET|remote_vtep:40.0.0.3|type:dynamic|vni:500100
2023-01-18.09:35:21.328590|VXLAN_REMOTE_VNI_TABLE:Vlan100:30.0.0.2|SET|vni:500100
2023-01-18.09:35:21.328643|VXLAN_REMOTE_VNI_TABLE:Vlan100:40.0.0.3|SET|vni:500100
2023-01-18.09:35:21.328768|VXLAN_REMOTE_VNI_TABLE:Vlan101:30.0.0.2|SET|vni:500101

This results in SAI failure as below
Jan 18 11:35:21.450681 r-alligator-04 ERR syncd#SDK: [TUNNEL.ERR] Could not find the entry in the tunnel entries map
Jan 18 11:35:21.450801 r-alligator-04 ERR syncd#SDK: [TUNNEL.ERR] Failed to get from tunnel[0x08c00000] entry with FID 101 , err = Entry Not Found
Jan 18 11:35:21.450873 r-alligator-04 ERR syncd#SDK: [FDB_FLOOD.ERR] Failed to get tunnel mapping from fid (101)
Jan 18 11:35:21.450943 r-alligator-04 ERR syncd#SDK: [FDB_FLOOD.ERR] Failed to set flood_vector to fid (101)
Jan 18 11:35:21.451108 r-alligator-04 ERR syncd#SDK: [CORE_API.ERR] Failed to set fdb flood params, FDB module, return message: [Entry Not Found]
Jan 18 11:35:21.451182 r-alligator-04 ERR syncd#SDK: [SAI_L2MC_GROUP.ERR] mlnx_sai_l2mcgroup.c[853]- mlnx_l2mc_group_fid_uc_bc_flood_ctrl_update: Failed to SET ports to fid 101 flood set - Entry Not Found.
Jan 18 11:35:21.451367 r-alligator-04 ERR swss#orchagent: :- create: create status: SAI_STATUS_ITEM_NOT_FOUND
Jan 18 11:35:21.451501 r-alligator-04 ERR swss#orchagent: :- addVlanFloodGroups: Failed to create l2mc group member for adding tunnel 30.0.0.2 to vlan 101
Jan 18 11:35:21.451656 r-alligator-04 ERR swss#orchagent: :- handleSaiCreateStatus: Encountered failure in create operation, exiting orchagent, SAI API: SAI_API_L2MC_GROUP, status: SAI_STATUS_ITEM_NOT_FOUND
Jan 18 11:35:21.451863 r-alligator-04 ERR syncd#SDK: :- sendApiResponse: api SAI_COMMON_API_CREATE failed in syncd mode: SAI_STATUS_ITEM_NOT_FOUND

How I verified it
Added UT to verify it.

Details if related

@dgsudharsan dgsudharsan requested a review from prsunny as a code owner January 26, 2023 00:30
@dgsudharsan
Copy link
Collaborator Author

@srj102 Can you please review? This is to fix a bug we are hitting around 20% of time during testing.


if (!gPortsOrch->getVlanByVlanId(vlan_id, vlanPort))
{
SWSS_LOG_WARN("Vxlan tunnel map vlan id doesn't exist: %d", vlan_id);
return false;
}

/* Remote end point can be added only after local VLAN to VNI map gets created */
Copy link
Contributor

@srj102 srj102 Feb 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why we need this change.
Is there any constraint imposed by the SAI API for this ?
What happens without this change ?

The errors listed seem specific to the implementation. I think this should be handled within the SAI driver implementation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the ordering is correct logically. You cannot add the remote end point without the respective maps being created. We following the same approach as linux and expect the same ordering. In linux we first add the vlan to vni map and them advertise it after which we receive the remote end point which we program. The same should be the behavior in orchagent. I don't think there is anything wrong in adding this check.

@prsunny prsunny merged commit 8de52bf into sonic-net:master Feb 2, 2023
dprital added a commit to dprital/sonic-buildimage that referenced this pull request Feb 7, 2023
Update sonic-swss submodule pointer to include the following:
* 7d223d3 Remove TODO comment which is no longer relevant ([sonic-net#2645](sonic-net/sonic-swss#2645))
* 02c2267 [test_mux] add sleep in test_NH ([sonic-net#2648](sonic-net/sonic-swss#2648))
* 8de52bf [EVPN]Handling race condition when remote VNI arrives before tunnel map entry ([sonic-net#2642](sonic-net/sonic-swss#2642))
* e99e2e4 [voq][chassis] Remove created ports from the default vlan. ([sonic-net#2607](sonic-net/sonic-swss#2607))

Signed-off-by: dprital <drorp@nvidia.com>
liat-grozovik pushed a commit to sonic-net/sonic-buildimage that referenced this pull request Feb 8, 2023
Update sonic-swss submodule pointer to include the following:
* 7d223d3 Remove TODO comment which is no longer relevant ([#2645](sonic-net/sonic-swss#2645))
* 02c2267 [test_mux] add sleep in test_NH ([#2648](sonic-net/sonic-swss#2648))
* 8de52bf [EVPN]Handling race condition when remote VNI arrives before tunnel map entry ([#2642](sonic-net/sonic-swss#2642))
* e99e2e4 [voq][chassis] Remove created ports from the default vlan. ([#2607](sonic-net/sonic-swss#2607))

Signed-off-by: dprital <drorp@nvidia.com>
StormLiangMS pushed a commit that referenced this pull request Feb 10, 2023
…ap entry (#2642)

*Added check in remote VNI add to ensure vxlan tunnel map is created before adding the remote end point.
prsunny pushed a commit that referenced this pull request Feb 16, 2023
*Merge remote-tracking branch 'upstream/master' into dash (#2663)
* Modify coppmgr mergeConfig to support preserving copp tables through reboot. (#2548)
* Avoid aborting orchagent when setting TUNNEL attributes (#2591)
* Handle Mac address 'none' (#2593)
* Increase diff coverage to 80% (#2599)
* Add missing parameter to on_switch_shutdown_request method. (#2567)
* Add ZMQ based ProducerStateTable and CustomerStateTable.
* Revert "[voq][chassis]Add show fabric counters port/queue commands (#2522)" (#2611)
* Added new attributes for Vnet and Vxlan ecmp configurations. (#2584)
* added support for monitoring, primary and adv_prefix and overlay_dmac.
* [routesync] Fix for stale dynamic neighbor (#2553)
* [MuxOrch] Enabling neighbor when adding in active state (#2601)
* Changed the BFD default detect multiplier to 10x (#2614)
* Remove TODO comments that are no longer relevant (#2622)
* Fix issue #13341 ARP entry can be out of sync between kernel and APPL_DB if multiple updates are received from RTNL (#2619)
* [refactor]Refactoring sai handle status (#2621)
* Vxlan tunnel endpoint custom monitoring APPL DB table. (#2589)
* added support for monitoring, primary and adv_prefix. changed filter_mac to overlay_dmac
* Data Structures and code to write APP_DB VNET_MONITOR table entries for custom monitoring of Vxlan tunnel endpoints.
* [bfdorch] add local discriminator to state DB (#2629)
* [acl] Add new ACL key BTH_OPCODE and AETH_SYNDROME  (#2617)
* [voq][chassis] Remove created ports from the default vlan. (#2607)
* [EVPN]Handling race condition when remote VNI arrives before tunnel map entry (#2642)
*Added check in remote VNI add to ensure vxlan tunnel map is created before adding the remote end point.
* [test_mux] add sleep in test_NH (#2648)
* [autoneg]Fixing adv interface types to be set when AN is disabled (#2638)
* [hash]: Add UT infra. (#2660)
*Added UT infra for Generic Hash feature
*Aligned PBH tests with Generic Hash UT infra
* [sai_failure_dump]Invoking dump during SAI failure (#2644)
* [ResponsePublisher] add pipeline support  (#2511)
* [dash] Fix compilation issue caused by missing include.
@dgsudharsan dgsudharsan deleted the vxlan_race_fix branch March 9, 2023 02:02
lukasstockner pushed a commit to genesiscloud/sonic-swss that referenced this pull request Mar 14, 2023
…ap entry (sonic-net#2642)

*Added check in remote VNI add to ensure vxlan tunnel map is created before adding the remote end point.
lukasstockner pushed a commit to genesiscloud/sonic-swss that referenced this pull request Apr 2, 2023
…ap entry (sonic-net#2642)

*Added check in remote VNI add to ensure vxlan tunnel map is created before adding the remote end point.
theasianpianist pushed a commit to theasianpianist/sonic-swss that referenced this pull request Jul 21, 2023
)

*Merge remote-tracking branch 'upstream/master' into dash (sonic-net#2663)
* Modify coppmgr mergeConfig to support preserving copp tables through reboot. (sonic-net#2548)
* Avoid aborting orchagent when setting TUNNEL attributes (sonic-net#2591)
* Handle Mac address 'none' (sonic-net#2593)
* Increase diff coverage to 80% (sonic-net#2599)
* Add missing parameter to on_switch_shutdown_request method. (sonic-net#2567)
* Add ZMQ based ProducerStateTable and CustomerStateTable.
* Revert "[voq][chassis]Add show fabric counters port/queue commands (sonic-net#2522)" (sonic-net#2611)
* Added new attributes for Vnet and Vxlan ecmp configurations. (sonic-net#2584)
* added support for monitoring, primary and adv_prefix and overlay_dmac.
* [routesync] Fix for stale dynamic neighbor (sonic-net#2553)
* [MuxOrch] Enabling neighbor when adding in active state (sonic-net#2601)
* Changed the BFD default detect multiplier to 10x (sonic-net#2614)
* Remove TODO comments that are no longer relevant (sonic-net#2622)
* Fix issue #13341 ARP entry can be out of sync between kernel and APPL_DB if multiple updates are received from RTNL (sonic-net#2619)
* [refactor]Refactoring sai handle status (sonic-net#2621)
* Vxlan tunnel endpoint custom monitoring APPL DB table. (sonic-net#2589)
* added support for monitoring, primary and adv_prefix. changed filter_mac to overlay_dmac
* Data Structures and code to write APP_DB VNET_MONITOR table entries for custom monitoring of Vxlan tunnel endpoints.
* [bfdorch] add local discriminator to state DB (sonic-net#2629)
* [acl] Add new ACL key BTH_OPCODE and AETH_SYNDROME  (sonic-net#2617)
* [voq][chassis] Remove created ports from the default vlan. (sonic-net#2607)
* [EVPN]Handling race condition when remote VNI arrives before tunnel map entry (sonic-net#2642)
*Added check in remote VNI add to ensure vxlan tunnel map is created before adding the remote end point.
* [test_mux] add sleep in test_NH (sonic-net#2648)
* [autoneg]Fixing adv interface types to be set when AN is disabled (sonic-net#2638)
* [hash]: Add UT infra. (sonic-net#2660)
*Added UT infra for Generic Hash feature
*Aligned PBH tests with Generic Hash UT infra
* [sai_failure_dump]Invoking dump during SAI failure (sonic-net#2644)
* [ResponsePublisher] add pipeline support  (sonic-net#2511)
* [dash] Fix compilation issue caused by missing include.
theasianpianist pushed a commit to theasianpianist/sonic-swss that referenced this pull request Jul 25, 2023
)

*Merge remote-tracking branch 'upstream/master' into dash (sonic-net#2663)
* Modify coppmgr mergeConfig to support preserving copp tables through reboot. (sonic-net#2548)
* Avoid aborting orchagent when setting TUNNEL attributes (sonic-net#2591)
* Handle Mac address 'none' (sonic-net#2593)
* Increase diff coverage to 80% (sonic-net#2599)
* Add missing parameter to on_switch_shutdown_request method. (sonic-net#2567)
* Add ZMQ based ProducerStateTable and CustomerStateTable.
* Revert "[voq][chassis]Add show fabric counters port/queue commands (sonic-net#2522)" (sonic-net#2611)
* Added new attributes for Vnet and Vxlan ecmp configurations. (sonic-net#2584)
* added support for monitoring, primary and adv_prefix and overlay_dmac.
* [routesync] Fix for stale dynamic neighbor (sonic-net#2553)
* [MuxOrch] Enabling neighbor when adding in active state (sonic-net#2601)
* Changed the BFD default detect multiplier to 10x (sonic-net#2614)
* Remove TODO comments that are no longer relevant (sonic-net#2622)
* Fix issue #13341 ARP entry can be out of sync between kernel and APPL_DB if multiple updates are received from RTNL (sonic-net#2619)
* [refactor]Refactoring sai handle status (sonic-net#2621)
* Vxlan tunnel endpoint custom monitoring APPL DB table. (sonic-net#2589)
* added support for monitoring, primary and adv_prefix. changed filter_mac to overlay_dmac
* Data Structures and code to write APP_DB VNET_MONITOR table entries for custom monitoring of Vxlan tunnel endpoints.
* [bfdorch] add local discriminator to state DB (sonic-net#2629)
* [acl] Add new ACL key BTH_OPCODE and AETH_SYNDROME  (sonic-net#2617)
* [voq][chassis] Remove created ports from the default vlan. (sonic-net#2607)
* [EVPN]Handling race condition when remote VNI arrives before tunnel map entry (sonic-net#2642)
*Added check in remote VNI add to ensure vxlan tunnel map is created before adding the remote end point.
* [test_mux] add sleep in test_NH (sonic-net#2648)
* [autoneg]Fixing adv interface types to be set when AN is disabled (sonic-net#2638)
* [hash]: Add UT infra. (sonic-net#2660)
*Added UT infra for Generic Hash feature
*Aligned PBH tests with Generic Hash UT infra
* [sai_failure_dump]Invoking dump during SAI failure (sonic-net#2644)
* [ResponsePublisher] add pipeline support  (sonic-net#2511)
* [dash] Fix compilation issue caused by missing include.
theasianpianist pushed a commit to theasianpianist/sonic-swss that referenced this pull request Jul 25, 2023
)

*Merge remote-tracking branch 'upstream/master' into dash (sonic-net#2663)
* Modify coppmgr mergeConfig to support preserving copp tables through reboot. (sonic-net#2548)
* Avoid aborting orchagent when setting TUNNEL attributes (sonic-net#2591)
* Handle Mac address 'none' (sonic-net#2593)
* Increase diff coverage to 80% (sonic-net#2599)
* Add missing parameter to on_switch_shutdown_request method. (sonic-net#2567)
* Add ZMQ based ProducerStateTable and CustomerStateTable.
* Revert "[voq][chassis]Add show fabric counters port/queue commands (sonic-net#2522)" (sonic-net#2611)
* Added new attributes for Vnet and Vxlan ecmp configurations. (sonic-net#2584)
* added support for monitoring, primary and adv_prefix and overlay_dmac.
* [routesync] Fix for stale dynamic neighbor (sonic-net#2553)
* [MuxOrch] Enabling neighbor when adding in active state (sonic-net#2601)
* Changed the BFD default detect multiplier to 10x (sonic-net#2614)
* Remove TODO comments that are no longer relevant (sonic-net#2622)
* Fix issue #13341 ARP entry can be out of sync between kernel and APPL_DB if multiple updates are received from RTNL (sonic-net#2619)
* [refactor]Refactoring sai handle status (sonic-net#2621)
* Vxlan tunnel endpoint custom monitoring APPL DB table. (sonic-net#2589)
* added support for monitoring, primary and adv_prefix. changed filter_mac to overlay_dmac
* Data Structures and code to write APP_DB VNET_MONITOR table entries for custom monitoring of Vxlan tunnel endpoints.
* [bfdorch] add local discriminator to state DB (sonic-net#2629)
* [acl] Add new ACL key BTH_OPCODE and AETH_SYNDROME  (sonic-net#2617)
* [voq][chassis] Remove created ports from the default vlan. (sonic-net#2607)
* [EVPN]Handling race condition when remote VNI arrives before tunnel map entry (sonic-net#2642)
*Added check in remote VNI add to ensure vxlan tunnel map is created before adding the remote end point.
* [test_mux] add sleep in test_NH (sonic-net#2648)
* [autoneg]Fixing adv interface types to be set when AN is disabled (sonic-net#2638)
* [hash]: Add UT infra. (sonic-net#2660)
*Added UT infra for Generic Hash feature
*Aligned PBH tests with Generic Hash UT infra
* [sai_failure_dump]Invoking dump during SAI failure (sonic-net#2644)
* [ResponsePublisher] add pipeline support  (sonic-net#2511)
* [dash] Fix compilation issue caused by missing include.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants