Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FDB] [202012] Fix fbdorch to properly handle syncd FDB FLUSH Notif #2401

Merged
merged 3 commits into from
Aug 16, 2022

Conversation

vivekrnv
Copy link
Contributor

@vivekrnv vivekrnv commented Jul 29, 2022

What I did

Backport the fdborch fix to properly handle syncd FDB flush notifications.

Why I did it

How I verified it

1. config vlan add 40
2. config vlan member add 40 Ethernet0
On the other side. 
3. sudo ip link add link enp4s0f0 name enp4s0f0.40 type vlan id 40
4. sudo ip link set enp4s0f0.40 up
5. Make sure a fdb entry is learnt. 
root@sonic:/home/admin# fdbshow
  No.    Vlan  MacAddress         Port       Type
-----  ------  -----------------  ---------  -------
    1      40  7C:FE:90:12:22:EC  Ethernet0  Dynamic
root@sonic:/home/admin# redis-cli -n 6 keys "*FDB*"
"FDB_TABLE|Vlan40:7c:fe:90:12:22:ec"
6. config vlan member del 40 Ethernet0
root@sonic:/home/admin# redis-cli -n 6 keys "*FDB*"
(empty array)
8. config vlan del 40
vkarri@836abffb42fb:/sonic/src/sonic-swss/tests/mock_tests$ ./tests --gtest_filter=FdbOrchTest*
Running main() from /usr/src/gtest/src/gtest_main.cc
Note: Google Test filter = FdbOrchTest*
[==========] Running 6 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 6 tests from FdbOrchTest
[ RUN      ] FdbOrchTest.ConsolidatedFlushVlanandPort
[       OK ] FdbOrchTest.ConsolidatedFlushVlanandPort (16 ms)
[ RUN      ] FdbOrchTest.ConsolidatedFlushAll
[       OK ] FdbOrchTest.ConsolidatedFlushAll (16 ms)
[ RUN      ] FdbOrchTest.ConsolidatedFlushVlan
[       OK ] FdbOrchTest.ConsolidatedFlushVlan (15 ms)
[ RUN      ] FdbOrchTest.ConsolidatedFlushPort
[       OK ] FdbOrchTest.ConsolidatedFlushPort (14 ms)
[ RUN      ] FdbOrchTest.ConsolidatedFlushVlanandPortBridgeportDeleted
[       OK ] FdbOrchTest.ConsolidatedFlushVlanandPortBridgeportDeleted (14 ms)
[ RUN      ] FdbOrchTest.NonConsolidatedFlushVlanandPort
[       OK ] FdbOrchTest.NonConsolidatedFlushVlanandPort (15 ms)
[----------] 6 tests from FdbOrchTest (90 ms total)

[----------] Global test environment tear-down
[==========] 6 tests from 1 test case ran. (90 ms total)
[  PASSED  ] 6 tests.

Details if related

vivekrnv and others added 2 commits July 29, 2022 00:05
…#2254)

Vlan delete couldn't be handled by OA when there is fdb learnt on the member and when the member is deleted
This inability of handling APPL_DB notif is affecting warm-restart. FDB Entry from State DB is not removed.
OA doesn't have the logic to handle consolidate flush notif coming from syncd
FdbOrch doesn't have logic to clear internal cache and decrement corresponding fdb counters during a flush notification

Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
- What I did
using a copy of FDBEntry fields (stored in FDBUpdate) instead of a reference since the reference gets invalidated in the storeFdbEntryState()
simplified clearFdbEntry() interface

- Why I did it
To fix the memory usage issue
The issue is that the SWSS_LOG_INFO() uses the mac&, port_alias&, and bv_id& which are invalidated in the storeFdbEntryState().

- How I verified it
Run the tests that were used to find the issues and checked the ASAN report

Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com>
@volodymyrsamotiy
Copy link
Collaborator

This is backport of PR #2254 from master to 202012

@prsunny
Copy link
Collaborator

prsunny commented Aug 1, 2022

fdb flush behavior for 202012 has been set for some SAI version and backporting may have side effects. Since this is more than a bug fix, I think we may not need to backport this immediately.

@vivekrnv
Copy link
Contributor Author

vivekrnv commented Aug 1, 2022

fdb flush behavior for 202012 has been set for some SAI version and backporting may have side effects. Since this is more than a bug fix, I think we may not need to backport this immediately.

The only flow this changes affects is when there is a notification of type SAI_FDB_EVENT_FLUSHED coming from syncd. And as i understand, the SAI used in 202012 also suggests that there might be a consolidated flush notification which can be received from syncd. https://github.com/opencomputeproject/SAI/blob/v1.7/inc/saifdb.h#L292 and we've seen it on the latest 202012 image.

And currently in 202012, this seems unsupported (flush per port and per vlan) https://github.com/sonic-net/sonic-swss/blob/202012/orchagent/fdborch.cpp#L448 and thus the corresponding state_db entries are not removed.

syslog.30.gz:Jul 25 18:26:32.071296 r-anaconda-51 ERR swss#orchagent: :- update: Unsupported FDB Flush: [ 00:00:00:00:00:00 , Vlan121 ] = { port: Ethernet0 }
syslog.30.gz:Jul 25 18:26:32.902270 r-anaconda-51 ERR swss#orchagent: :- update: Unsupported FDB Flush: [ 00:00:00:00:00:00 , Vlan122 ] = { port: Ethernet124 }
syslog.30.gz:Jul 25 18:26:33.704752 r-anaconda-51 ERR swss#orchagent: :- update: Unsupported FDB Flush: [ 00:00:00:00:00:00 , Vlan123 ] = { port: Ethernet0 }
syslog.30.gz:Jul 25 18:26:34.547895 r-anaconda-51 ERR swss#orchagent: :- update: Unsupported FDB Flush: [ 00:00:00:00:00:00 , Vlan124 ] = { port: Ethernet124 }
syslog.30.gz:Jul 25 18:26:35.418237 r-anaconda-51 ERR swss#orchagent: :- update: Unsupported FDB Flush: [ 00:00:00:00:00:00 , Vlan125 ] = { port: Ethernet0 }

This is causing problems during warm-reboot:
fdborch refills these notifications from state_db FDB_TABLE during restore phase of warm-reboot https://github.com/sonic-net/sonic-swss/blob/202012/orchagent/fdborch.cpp#L59, this affects warm-reboot restore validation.

Jul 25 18:59:36.511246 r-anaconda-51 NOTICE swss#orchagent: :- warmRestoreValidation: There are pending consumer tasks after restore:
Jul 25 18:59:36.511312 r-anaconda-51 NOTICE swss#orchagent: :- warmRestoreValidation: FDB_TABLE:Vlan120:0c:42:a1:17:e6:fd|SET|port:Ethernet124|type:dynamic
Jul 25 18:59:36.511312 r-anaconda-51 NOTICE swss#orchagent: :- warmRestoreValidation: FDB_TABLE:Vlan121:0c:42:a1:4b:0b:6c|SET|port:Ethernet0|type:dynamic
Jul 25 18:59:36.511312 r-anaconda-51 NOTICE swss#orchagent: :- warmRestoreValidation: FDB_TABLE:Vlan122:0c:42:a1:17:e6:fd|SET|port:Ethernet124|type:dynamic
Jul 25 18:59:36.511353 r-anaconda-51 NOTICE swss#orchagent: :- warmRestoreValidation: FDB_TABLE:Vlan123:0c:42:a1:4b:0b:6c|SET|port:Ethernet0|type:dynamic
Jul 25 18:59:36.511353 r-anaconda-51 NOTICE swss#orchagent: :- warmRestoreValidation: FDB_TABLE:Vlan124:0c:42:a1:17:e6:fd|SET|port:Ethernet124|type:dynamic
Jul 25 18:59:36.511378 r-anaconda-51 NOTICE swss#orchagent: :- warmRestoreValidation: FDB_TABLE:Vlan125:0c:42:a1:4b:0b:6c|SET|port:Ethernet0|type:dynamic
Jul 25 18:59:36.511453 r-anaconda-51 NOTICE swss#orchagent: :- warmRestoreValidation: FDB_TABLE:Vlan30:0c:42:a1:17:e6:fc|SET|port:PortChannel0001|type:dynamic
Jul 25 18:59:36.511453 r-anaconda-51 NOTICE swss#orchagent: :- warmRestoreValidation: FDB_TABLE:Vlan30:0c:42:a1:4b:0b:6c|SET|port:Ethernet0|type:dynamic
Jul 25 18:59:36.511453 r-anaconda-51 NOTICE swss#orchagent: :- warmRestoreValidation: FDB_TABLE:Vlan30:1c:34:da:16:68:00|SET|port:Ethernet12|type:dynamic
Jul 25 18:59:36.511453 r-anaconda-51 NOTICE swss#orchagent: :- warmRestoreValidation: FDB_TABLE:Vlan4094:0c:42:a1:17:e6:fc|SET|port:PortChannel0001|type:dynamic
Jul 25 18:59:36.511505 r-anaconda-51 NOTICE swss#orchagent: :- warmRestoreValidation: FDB_TABLE:Vlan800:0c:42:a1:17:e6:fc|SET|port:Ethernet16|type:dynamic

so, i believe this fix is required.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Aug 8, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: vivekrnv / name: Vivek (7d8af09)
  • ✅ login: Yakiv-Huryk / name: Yakiv Huryk (ed50d20)
  • ✅ login: liat-grozovik (0de0d1c)

@vivekrnv
Copy link
Contributor Author

vivekrnv commented Aug 8, 2022

/easycla

@Blueve
Copy link

Blueve commented Aug 9, 2022

@prsunny do we have plan to merge this PR?
I observed the error log in Nokia-7215 testbed even the SAI returns flushed

@@ -5355,3 +5355,17 @@ std::unordered_set<std::string> PortsOrch::generateCounterStats(const string& ty
}
return counter_stats;
}

bool PortsOrch::decrFdbCount(const std::string& alias, int count)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont see it being done in the previous flow. Why is it newly done?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the m_fdb_count is decremented during the SAI_FDB_EVENT_AGED notification. So, for the SAI_FDB_EVENT_FLUSHED i think the same is required. https://github.com/sonic-net/sonic-swss/blob/202012/orchagent/fdborch.cpp#L227

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you confirm if this is tested on 202012? I see the ref count as a new change and not part of the FLUSH handling itself for 202012.

Copy link
Contributor Author

@vivekrnv vivekrnv Aug 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, when the corresponding fdb entry expires/removed, some vendor SDK's generate AGED notification and some vendors generate FLUSH notification.

In 202012, we have the code which decrements the m_fdb_count when there is an AGED notification https://github.com/sonic-net/sonic-swss/blob/202012/orchagent/fdborch.cpp#L311, so when the capability to handle FLUSH notification is added, i see no issues in decrementing m_fdb_count as this keeps the implementation consistent.

As for testing this change, do you mean for me to check if m_fdb_count is decremented when there is an AGED notification?

@prsunny prsunny merged commit aa7b546 into sonic-net:202012 Aug 16, 2022
dprital added a commit to dprital/sonic-buildimage that referenced this pull request Aug 29, 2022
Update sonic-swss submodule pointer to include the following:
* [FDB] [202012] Fix fbdorch to properly handle syncd FDB FLUSH Notif ([sonic-net#2401](sonic-net/sonic-swss#2401))
* Support for platforms based on Clounix Networks' device ([sonic-net#2399](sonic-net/sonic-swss#2399))

Signed-off-by: dprital <drorp@nvidia.com>
vivekrnv added a commit to vivekrnv/sonic-swss that referenced this pull request Aug 30, 2022
…onic-net#2401)

* [FDB] Fix fbdorch to properly handle syncd FDB FLUSH Notif (sonic-net#2254)

Vlan delete couldn't be handled by OA when there is fdb learnt on the member and when the member is deleted
This inability of handling APPL_DB notif is affecting warm-restart. FDB Entry from State DB is not removed.
OA doesn't have the logic to handle consolidate flush notif coming from syncd
FdbOrch doesn't have logic to clear internal cache and decrement corresponding fdb counters during a flush notification

Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants