Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizing teamd cleanup #6

Closed
wants to merge 1 commit into from
Closed

Optimizing teamd cleanup #6

wants to merge 1 commit into from

Conversation

dgsudharsan
Copy link
Owner

What I did
Optimized teamd cleanup to handle scaled portchannel configuration scenarios.

Why I did it
A process per portchannel exists in teamd which needs to be cleaned up on exit so remove port channel flow is handled. In the current logic this is done one by one. So in case of scaled configurations not all port channels get cleaned up resulting in some portchannel entries left stale.

Sep 24 14:37:18.181202 r-panther-23 NOTICE teamd#teamsyncd: :- cleanTeamSync: Cleaning up LAG teamd resources ...
Sep 24 14:37:20.448177 r-panther-23 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Cleaning up LAGs during shutdown...
Sep 24 14:38:14.612553 r-panther-23 NOTICE teamd#teammgrd: :- removeLag: Stop port channel PortChannel58
Sep 24 14:38:15.140316 r-panther-23 INFO dockerd[782]: time="2021-09-24T14:38:15.138013633Z" level=info msg="Container 59cf5a968e3f38f0cfdf9b9e4799c707fb2421554fe54bc0f98703c1fdc5f724 failed to exit within 60 seconds of signal 15 - using the force"
Sep 24 14:38:15.374049 r-panther-23 INFO /container: docker cmd: wait for teamd
Sep 24 14:38:15.375029 r-panther-23 INFO /container: docker cmd: stop for teamd
Sep 24 14:38:15.375203 r-panther-23 DEBUG /container: container_stop: END
Sep 24 14:38:15.419638 r-panther-23 NOTICE admin: Stopped teamd service...
Sep 24 14:38:15.423805 r-panther-23 INFO systemd[1]: teamd.service: Succeeded.

In the above example port channels after 58 (from the map) are not cleaned up even after 60 seconds. Each system command called from the process is costly and it takes a lot of time when the scale is high.
The netlink messages for stale portchannel come up during the initialization which if ordered differently might result in STATE_DB entry being removed
Sep 24 14:39:07.533092 r-panther-23 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:PortChannel7 admin:0 oper:0 addr:98:03:9b:f6:00:00 ifindex:85 master:0 type:team
Sep 24 14:39:07.559340 r-panther-23 NOTICE swss#portsyncd: :- onMsg: nlmsg type:17 key:PortChannel7 admin:0 oper:0 addr:98:03:9b:f6:00:00 ifindex:85 master:0 type:te
Sep 24 14:39:10.350253 r-panther-23 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel7' has been removed.

Added logic to reduce the number of calls executed via system command.

How I verified it
Oct 8 05:13:22.074152 mts-sonic-dut NOTICE teamd#teammgrd: :- cleanTeamProcesses: Cleaning up LAGs during shutdown...
Oct 8 05:13:22.088962 mts-sonic-dut NOTICE teamd#teammgrd: :- cleanTeamProcesses: Executed the cleanup command. Waiting for cleanup complete
Oct 8 05:13:22.131145 mts-sonic-dut NOTICE teamd#teammgrd: :- cleanTeamProcesses: Waiting for 64 processes to be cleaned up
Oct 8 05:13:23.141405 mts-sonic-dut NOTICE teamd#teammgrd: :- cleanTeamProcesses: LAGs cleanup is done
Oct 8 05:13:23.142193 mts-sonic-dut NOTICE teamd#teammgrd: :- main: Exiting
Oct 8 05:13:23.666751 mts-sonic-dut INFO teamd#supervisord 2021-10-08 05:13:23,665 INFO stopped: teammgrd (exit status 0)

The entire teamd cleanup with this optimization takes just one second instead of unable to clean up even after 1 minute.
Details if related

@nazariig
Copy link

nazariig commented Oct 8, 2021

@dgsudharsan please have a look at these PRs:

1. https://github.com/Azure/sonic-buildimage/pull/8045
2. https://github.com/Azure/sonic-buildimage/pull/6537
3. https://github.com/Azure/sonic-swss/pull/1841
4. https://github.com/Azure/sonic-swss/pull/1934

Also, the original code is working and very likely that you are using old codebase:

Sep 24 14:37:18.181202 r-panther-23 NOTICE teamd#teamsyncd: :- cleanTeamSync: Cleaning up LAG teamd resources ...
Sep 24 14:37:20.448177 r-panther-23 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Cleaning up LAGs during shutdown...
Sep 24 14:38:14.612553 r-panther-23 NOTICE teamd#teammgrd: :- removeLag: Stop port channel PortChannel58

P.S: the suggested fix doesn't assume warm/fast boot scenario

@dgsudharsan dgsudharsan closed this Oct 8, 2021
@dgsudharsan dgsudharsan deleted the teamd_fix branch March 9, 2023 02:00
dgsudharsan pushed a commit that referenced this pull request Aug 9, 2023
**What I did**

Fix the Mem Leak by moving the raw pointers in type_maps to use smart pointers

**Why I did it**

```
Indirect leak of 83776 byte(s) in 476 object(s) allocated from:
    #0 0x7f0a2a414647 in operator new(unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:99
    #1 0x5555590cc923 in __gnu_cxx::new_allocator, std::allocator > const, referenced_object> > >::allocate(unsigned long, void const*) /usr/include/c++/10/ext/new_allocator.h:115
    #2 0x5555590cc923 in std::allocator_traits, std::allocator > const, referenced_object> > > >::allocate(std::allocator, std::allocator > const, referenced_object> > >&, unsigned long) /usr/include/c++/10/bits/alloc_traits.h:460
    #3 0x5555590cc923 in std::_Rb_tree, std::allocator >, std::pair, std::allocator > const, referenced_object>, std::_Select1st, std::allocator > const, referenced_object> >, std::less, std::allocator > >, std::allocator, std::allocator > const, referenced_object> > >::_M_get_node() /usr/include/c++/10/bits/stl_tree.h:584
    #4 0x5555590cc923 in std::_Rb_tree_node, std::allocator > const, referenced_object> >* std::_Rb_tree, std::allocator >, std::pair, std::allocator > const, referenced_object>, std::_Select1st, std::allocator > const, referenced_object> >, std::less, std::allocator > >, std::allocator, std::allocator > const, referenced_object> > >::_M_create_node, std::allocator > const&>, std::tuple<> >(std::piecewise_construct_t const&, std::tuple, std::allocator > const&>&&, std::tuple<>&&) /usr/include/c++/10/bits/stl_tree.h:634
    #5 0x5555590cc923 in std::_Rb_tree_iterator, std::allocator > const, referenced_object> > std::_Rb_tree, std::allocator >, std::pair, std::allocator > const, referenced_object>, std::_Select1st, std::allocator > const, referenced_object> >, std::less, std::allocator > >, std::allocator, std::allocator > const, referenced_object> > >::_M_emplace_hint_unique, std::allocator > const&>, std::tuple<> >(std::_Rb_tree_const_iterator, std::allocator > const, referenced_object> >, std::piecewise_construct_t const&, std::tuple, std::allocator > const&>&&, std::tuple<>&&) /usr/include/c++/10/bits/stl_tree.h:2461
    #6 0x5555590e8757 in std::map, std::allocator >, referenced_object, std::less, std::allocator > >, std::allocator, std::allocator > const, referenced_object> > >::operator[](std::__cxx11::basic_string, std::allocator > const&) /usr/include/c++/10/bits/stl_map.h:501
    #7 0x5555590d48b0 in Orch::setObjectReference(std::map, std::allocator >, std::map, std::allocator >, referenced_object, std::less, std::allocator > >, std::allocator, std::allocator > const, referenced_object> > >*, std::less, std::allocator > >, std::allocator, std::allocator > const, std::map, std::allocator >, referenced_object, std::less, std::allocator > >, std::allocator, std::allocator > const, referenced_object> > >*> > >&, std::__cxx11::basic_string, std::allocator > const&, std::__cxx11::basic_string, std::allocator > const&, std::__cxx11::basic_string, std::allocator > const&, std::__cxx11::basic_string, std::allocator > const&) orchagent/orch.cpp:450
    #8 0x5555594ff66b in QosOrch::handleQueueTable(Consumer&, std::tuple, std::allocator >, std::__cxx11::basic_string, std::allocator >, std::vector, std::allocator >, std::__cxx11::basic_string, std::allocator > >, std::allocator, std::allocator >, std::__cxx11::basic_string, std::allocator > > > > >&) orchagent/qosorch.cpp:1763
    #9 0x5555594edbd6 in QosOrch::doTask(Consumer&) orchagent/qosorch.cpp:2179
    #10 0x5555590c8743 in Consumer::drain() orchagent/orch.cpp:241
    #11 0x5555590c8743 in Consumer::drain() orchagent/orch.cpp:238
    #12 0x5555590c8743 in Consumer::execute() orchagent/orch.cpp:235
    #13 0x555559090dad in OrchDaemon::start() orchagent/orchdaemon.cpp:755
    #14 0x555558e9be25 in main orchagent/main.cpp:766
    #15 0x7f0a299b6d09 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x23d09)
```
dgsudharsan pushed a commit that referenced this pull request Aug 18, 2023
**What I did**

Fix the Mem Leak by moving the raw pointers in type_maps to use smart pointers

**Why I did it**

```
Indirect leak of 83776 byte(s) in 476 object(s) allocated from:
    #0 0x7f0a2a414647 in operator new(unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:99
    #1 0x5555590cc923 in __gnu_cxx::new_allocator, std::allocator > const, referenced_object> > >::allocate(unsigned long, void const*) /usr/include/c++/10/ext/new_allocator.h:115
    #2 0x5555590cc923 in std::allocator_traits, std::allocator > const, referenced_object> > > >::allocate(std::allocator, std::allocator > const, referenced_object> > >&, unsigned long) /usr/include/c++/10/bits/alloc_traits.h:460
    #3 0x5555590cc923 in std::_Rb_tree, std::allocator >, std::pair, std::allocator > const, referenced_object>, std::_Select1st, std::allocator > const, referenced_object> >, std::less, std::allocator > >, std::allocator, std::allocator > const, referenced_object> > >::_M_get_node() /usr/include/c++/10/bits/stl_tree.h:584
    #4 0x5555590cc923 in std::_Rb_tree_node, std::allocator > const, referenced_object> >* std::_Rb_tree, std::allocator >, std::pair, std::allocator > const, referenced_object>, std::_Select1st, std::allocator > const, referenced_object> >, std::less, std::allocator > >, std::allocator, std::allocator > const, referenced_object> > >::_M_create_node, std::allocator > const&>, std::tuple<> >(std::piecewise_construct_t const&, std::tuple, std::allocator > const&>&&, std::tuple<>&&) /usr/include/c++/10/bits/stl_tree.h:634
    #5 0x5555590cc923 in std::_Rb_tree_iterator, std::allocator > const, referenced_object> > std::_Rb_tree, std::allocator >, std::pair, std::allocator > const, referenced_object>, std::_Select1st, std::allocator > const, referenced_object> >, std::less, std::allocator > >, std::allocator, std::allocator > const, referenced_object> > >::_M_emplace_hint_unique, std::allocator > const&>, std::tuple<> >(std::_Rb_tree_const_iterator, std::allocator > const, referenced_object> >, std::piecewise_construct_t const&, std::tuple, std::allocator > const&>&&, std::tuple<>&&) /usr/include/c++/10/bits/stl_tree.h:2461
    #6 0x5555590e8757 in std::map, std::allocator >, referenced_object, std::less, std::allocator > >, std::allocator, std::allocator > const, referenced_object> > >::operator[](std::__cxx11::basic_string, std::allocator > const&) /usr/include/c++/10/bits/stl_map.h:501
    #7 0x5555590d48b0 in Orch::setObjectReference(std::map, std::allocator >, std::map, std::allocator >, referenced_object, std::less, std::allocator > >, std::allocator, std::allocator > const, referenced_object> > >*, std::less, std::allocator > >, std::allocator, std::allocator > const, std::map, std::allocator >, referenced_object, std::less, std::allocator > >, std::allocator, std::allocator > const, referenced_object> > >*> > >&, std::__cxx11::basic_string, std::allocator > const&, std::__cxx11::basic_string, std::allocator > const&, std::__cxx11::basic_string, std::allocator > const&, std::__cxx11::basic_string, std::allocator > const&) orchagent/orch.cpp:450
    #8 0x5555594ff66b in QosOrch::handleQueueTable(Consumer&, std::tuple, std::allocator >, std::__cxx11::basic_string, std::allocator >, std::vector, std::allocator >, std::__cxx11::basic_string, std::allocator > >, std::allocator, std::allocator >, std::__cxx11::basic_string, std::allocator > > > > >&) orchagent/qosorch.cpp:1763
    #9 0x5555594edbd6 in QosOrch::doTask(Consumer&) orchagent/qosorch.cpp:2179
    #10 0x5555590c8743 in Consumer::drain() orchagent/orch.cpp:241
    #11 0x5555590c8743 in Consumer::drain() orchagent/orch.cpp:238
    #12 0x5555590c8743 in Consumer::execute() orchagent/orch.cpp:235
    #13 0x555559090dad in OrchDaemon::start() orchagent/orchdaemon.cpp:755
    #14 0x555558e9be25 in main orchagent/main.cpp:766
    #15 0x7f0a299b6d09 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x23d09)
```
dgsudharsan pushed a commit that referenced this pull request Sep 22, 2023
**What I did**

Fix the Mem Leak by moving the raw pointers in type_maps to use smart pointers

**Why I did it**

```
Indirect leak of 83776 byte(s) in 476 object(s) allocated from:
    #0 0x7f0a2a414647 in operator new(unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:99
    #1 0x5555590cc923 in __gnu_cxx::new_allocator, std::allocator > const, referenced_object> > >::allocate(unsigned long, void const*) /usr/include/c++/10/ext/new_allocator.h:115
    #2 0x5555590cc923 in std::allocator_traits, std::allocator > const, referenced_object> > > >::allocate(std::allocator, std::allocator > const, referenced_object> > >&, unsigned long) /usr/include/c++/10/bits/alloc_traits.h:460
    #3 0x5555590cc923 in std::_Rb_tree, std::allocator >, std::pair, std::allocator > const, referenced_object>, std::_Select1st, std::allocator > const, referenced_object> >, std::less, std::allocator > >, std::allocator, std::allocator > const, referenced_object> > >::_M_get_node() /usr/include/c++/10/bits/stl_tree.h:584
    #4 0x5555590cc923 in std::_Rb_tree_node, std::allocator > const, referenced_object> >* std::_Rb_tree, std::allocator >, std::pair, std::allocator > const, referenced_object>, std::_Select1st, std::allocator > const, referenced_object> >, std::less, std::allocator > >, std::allocator, std::allocator > const, referenced_object> > >::_M_create_node, std::allocator > const&>, std::tuple<> >(std::piecewise_construct_t const&, std::tuple, std::allocator > const&>&&, std::tuple<>&&) /usr/include/c++/10/bits/stl_tree.h:634
    #5 0x5555590cc923 in std::_Rb_tree_iterator, std::allocator > const, referenced_object> > std::_Rb_tree, std::allocator >, std::pair, std::allocator > const, referenced_object>, std::_Select1st, std::allocator > const, referenced_object> >, std::less, std::allocator > >, std::allocator, std::allocator > const, referenced_object> > >::_M_emplace_hint_unique, std::allocator > const&>, std::tuple<> >(std::_Rb_tree_const_iterator, std::allocator > const, referenced_object> >, std::piecewise_construct_t const&, std::tuple, std::allocator > const&>&&, std::tuple<>&&) /usr/include/c++/10/bits/stl_tree.h:2461
    #6 0x5555590e8757 in std::map, std::allocator >, referenced_object, std::less, std::allocator > >, std::allocator, std::allocator > const, referenced_object> > >::operator[](std::__cxx11::basic_string, std::allocator > const&) /usr/include/c++/10/bits/stl_map.h:501
    #7 0x5555590d48b0 in Orch::setObjectReference(std::map, std::allocator >, std::map, std::allocator >, referenced_object, std::less, std::allocator > >, std::allocator, std::allocator > const, referenced_object> > >*, std::less, std::allocator > >, std::allocator, std::allocator > const, std::map, std::allocator >, referenced_object, std::less, std::allocator > >, std::allocator, std::allocator > const, referenced_object> > >*> > >&, std::__cxx11::basic_string, std::allocator > const&, std::__cxx11::basic_string, std::allocator > const&, std::__cxx11::basic_string, std::allocator > const&, std::__cxx11::basic_string, std::allocator > const&) orchagent/orch.cpp:450
    #8 0x5555594ff66b in QosOrch::handleQueueTable(Consumer&, std::tuple, std::allocator >, std::__cxx11::basic_string, std::allocator >, std::vector, std::allocator >, std::__cxx11::basic_string, std::allocator > >, std::allocator, std::allocator >, std::__cxx11::basic_string, std::allocator > > > > >&) orchagent/qosorch.cpp:1763
    #9 0x5555594edbd6 in QosOrch::doTask(Consumer&) orchagent/qosorch.cpp:2179
    #10 0x5555590c8743 in Consumer::drain() orchagent/orch.cpp:241
    #11 0x5555590c8743 in Consumer::drain() orchagent/orch.cpp:238
    #12 0x5555590c8743 in Consumer::execute() orchagent/orch.cpp:235
    #13 0x555559090dad in OrchDaemon::start() orchagent/orchdaemon.cpp:755
    #14 0x555558e9be25 in main orchagent/main.cpp:766
    #15 0x7f0a299b6d09 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x23d09)
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants