Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[teamd]: Increase wait timeout for teamd docker stop to clean Port channels. #6537

Merged
merged 2 commits into from
Jan 24, 2021

Conversation

judyjoseph
Copy link
Contributor

@judyjoseph judyjoseph commented Jan 23, 2021

- Why I did it
The Portchannels were not getting cleaned up as the cleanup activity was taking more than 10 secs which is default docker timeout after which a SIGKILL will be send.
Fixes #6199
To check if it works out for this issue in 201911 ? #6503

This issue is significantly seen in master branch compared to 201911 because the Portchannel cleanup takes more time in master. Test on a DUT with 8 Port Channels.

master

    admin@str-s6000-acs-8:~$ time sudo systemctl stop teamd
    real    0m15.599s
    user    0m0.061s
    sys     0m0.038s

Sonic 201911.v58

    admin@str-s6000-acs-8:~$ time sudo systemctl stop teamd
    real    0m5.541s
    user    0m0.020s
    sys     0m0.028s

- How I did it
Increased the timeout, by passing the timeout argument of 60sec now to container stop in case of teamd container.

- How to verify it
Checked the following cases and made sure the Portchannels are cleanup properly. Checked that in case of Warm restart of teamd, it doesn't use more time.

Case 1:
Normal stop of teamd with the fix, used teh script below to try multilple times.

admin@str-s6000-acs-8:~$ cat a.sh 
#!/bin/bash

for i in {1..7}
do
  echo "Running $i times"
  sudo systemctl reset-failed swss.service
  sudo systemctl reset-failed syncd.service
  sudo systemctl reset-failed teamd.service
  time sudo systemctl stop teamd.service
  a=`ip link | grep PortChannel | wc -l`
  if [ $a -ne 0 ]
  then
    echo "FAILED"
  fi
  sleep 2m
done

admin@str-s6000-acs-8:~$ ./a.sh 
Running 1 times
real    0m17.692s
user    0m0.033s
sys     0m0.047s

Running 2 times
real    0m19.542s
user    0m0.059s
sys     0m0.039s

Running 3 times
real    0m19.171s
user    0m0.047s
sys     0m0.057s

Running 4 times
real    0m17.320s
user    0m0.061s
sys     0m0.038s

Running 5 times
real    0m16.903s
user    0m0.039s
sys     0m0.058s

Running 6 times
real    0m20.344s
user    0m0.042s
sys     0m0.040s

Running 7 times
real    0m19.468s
user    0m0.052s
sys     0m0.051s



Jan 22 23:29:59.452700 str-s6000-acs-8 NOTICE admin: Stopping teamd service...
Jan 22 23:30:00.545542 str-s6000-acs-8 NOTICE admin: Warm boot flag: teamd false.
Jan 22 23:30:00.568256 str-s6000-acs-8 NOTICE admin: Fast boot flag: teamd false.
Jan 22 23:30:02.066838 str-s6000-acs-8 DEBUG /container: read_data: config:True feature:teamd fields:[('set_owner', 'local'), ('no_fallback_to_local', False)] val:['local', False]
Jan 22 23:30:02.070179 str-s6000-acs-8 DEBUG /container: read_data: config:False feature:teamd fields:[('current_owner', 'none'), ('remote_state', 'none'), ('container_id', '')] val:['none', 'none', '']
Jan 22 23:30:02.073550 str-s6000-acs-8 DEBUG /container: container_stop: teamd: set_owner:local current_owner:none remote_state:none docker_id:teamd
Jan 22 23:30:03.048436 str-s6000-acs-8 INFO teamd#supervisord 2021-01-22 23:30:03,046 WARN received SIGTERM indicating exit request
Jan 22 23:30:03.050080 str-s6000-acs-8 INFO teamd#supervisord 2021-01-22 23:30:03,048 INFO waiting for supervisor-proc-exit-listener, rsyslogd, teammgrd, teamsyncd, tlm_teamd to die
Jan 22 23:30:04.072925 str-s6000-acs-8 NOTICE teamd#tlm_teamd: :- main: Exiting
Jan 22 23:30:04.073820 str-s6000-acs-8 NOTICE teamd#tlm_teamd: :- ~TeamdCtlMgr: Exiting. Disconnecting from teamd. LAG 'PortChannel0020'
Jan 22 23:30:04.074460 str-s6000-acs-8 NOTICE teamd#tlm_teamd: :- ~TeamdCtlMgr: Exiting. Disconnecting from teamd. LAG 'PortChannel0005'
Jan 22 23:30:04.075040 str-s6000-acs-8 NOTICE teamd#tlm_teamd: :- ~TeamdCtlMgr: Exiting. Disconnecting from teamd. LAG 'PortChannel0002'
Jan 22 23:30:04.075589 str-s6000-acs-8 NOTICE teamd#tlm_teamd: :- ~TeamdCtlMgr: Exiting. Disconnecting from teamd. LAG 'PortChannel0008'
Jan 22 23:30:04.076819 str-s6000-acs-8 NOTICE teamd#tlm_teamd: :- ~TeamdCtlMgr: Exiting. Disconnecting from teamd. LAG 'PortChannel0011'
Jan 22 23:30:04.078096 str-s6000-acs-8 NOTICE teamd#tlm_teamd: :- ~TeamdCtlMgr: Exiting. Disconnecting from teamd. LAG 'PortChannel0023'
Jan 22 23:30:04.080976 str-s6000-acs-8 NOTICE teamd#tlm_teamd: :- ~TeamdCtlMgr: Exiting. Disconnecting from teamd. LAG 'PortChannel0014'
Jan 22 23:30:04.082519 str-s6000-acs-8 NOTICE teamd#tlm_teamd: :- ~TeamdCtlMgr: Exiting. Disconnecting from teamd. LAG 'PortChannel0017'
Jan 22 23:30:04.407708 str-s6000-acs-8 INFO teamd#supervisord 2021-01-22 23:30:04,405 INFO stopped: tlm_teamd (exit status 0)
Jan 22 23:30:05.408798 str-s6000-acs-8 NOTICE teamd#teamsyncd: :- cleanTeamSync: Cleaning up LAG teamd resources ...
Jan 22 23:30:05.436476 str-s6000-acs-8 NOTICE teamd#teamsyncd: :- main: Received SIGTERM Exiting
Jan 22 23:30:06.034828 str-s6000-acs-8 INFO teamd#supervisord 2021-01-22 23:30:06,032 INFO stopped: teamsyncd (exit status 0)
Jan 22 23:30:07.036965 str-s6000-acs-8 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Cleaning up LAGs during shutdown...
Jan 22 23:30:07.039261 str-s6000-acs-8 INFO teamd#supervisord 2021-01-22 23:30:07,036 INFO waiting for supervisor-proc-exit-listener, rsyslogd, teammgrd to die
Jan 22 23:30:08.042911 str-s6000-acs-8 INFO teamd#supervisord 2021-01-22 23:30:08,041 INFO reaped unknown pid 26 (exit status 0)
Jan 22 23:30:08.068857 str-s6000-acs-8 NOTICE teamd#teammgrd: :- removeLag: Stop port channel PortChannel0002
Jan 22 23:30:09.046723 str-s6000-acs-8 INFO teamd#supervisord 2021-01-22 23:30:09,045 INFO reaped unknown pid 34 (exit status 0)
Jan 22 23:30:09.098420 str-s6000-acs-8 NOTICE teamd#teammgrd: :- removeLag: Stop port channel PortChannel0005
Jan 22 23:30:10.049789 str-s6000-acs-8 INFO teamd#supervisord 2021-01-22 23:30:10,048 INFO reaped unknown pid 42 (exit status 0)
Jan 22 23:30:10.051718 str-s6000-acs-8 INFO teamd#supervisord 2021-01-22 23:30:10,050 INFO waiting for supervisor-proc-exit-listener, rsyslogd, teammgrd to die
Jan 22 23:30:10.126647 str-s6000-acs-8 NOTICE teamd#teammgrd: :- removeLag: Stop port channel PortChannel0008
Jan 22 23:30:11.054045 str-s6000-acs-8 INFO teamd#supervisord 2021-01-22 23:30:11,052 INFO reaped unknown pid 50 (exit status 0)
Jan 22 23:30:11.063259 str-s6000-acs-8 NOTICE teamd#teammgrd: :- removeLag: Stop port channel PortChannel0011
Jan 22 23:30:12.057358 str-s6000-acs-8 INFO teamd#supervisord 2021-01-22 23:30:12,055 INFO reaped unknown pid 59 (exit status 0)
Jan 22 23:30:12.090348 str-s6000-acs-8 NOTICE teamd#teammgrd: :- removeLag: Stop port channel PortChannel0014
Jan 22 23:30:13.063335 str-s6000-acs-8 INFO teamd#supervisord 2021-01-22 23:30:13,060 INFO reaped unknown pid 68 (exit status 0)
Jan 22 23:30:13.065219 str-s6000-acs-8 INFO teamd#supervisord 2021-01-22 23:30:13,062 INFO waiting for supervisor-proc-exit-listener, rsyslogd, teammgrd to die
Jan 22 23:30:13.123586 str-s6000-acs-8 NOTICE teamd#teammgrd: :- removeLag: Stop port channel PortChannel0017
Jan 22 23:30:14.068622 str-s6000-acs-8 INFO teamd#supervisord 2021-01-22 23:30:14,066 INFO reaped unknown pid 76 (exit status 0)
Jan 22 23:30:14.155669 str-s6000-acs-8 NOTICE teamd#teammgrd: :- removeLag: Stop port channel PortChannel0020
Jan 22 23:30:15.071850 str-s6000-acs-8 INFO teamd#supervisord 2021-01-22 23:30:15,070 INFO reaped unknown pid 84 (exit status 0)
Jan 22 23:30:15.107814 str-s6000-acs-8 NOTICE teamd#teammgrd: :- removeLag: Stop port channel PortChannel0023
Jan 22 23:30:15.107814 str-s6000-acs-8 NOTICE teamd#teammgrd: :- main: Exiting
Jan 22 23:30:15.345921 str-s6000-acs-8 INFO teamd#supervisord 2021-01-22 23:30:15,344 INFO stopped: teammgrd (exit status 255)
Jan 22 23:30:16.867661 str-s6000-acs-8 INFO /container: docker cmd: wait for teamd
Jan 22 23:30:16.869580 str-s6000-acs-8 INFO /container: docker cmd: stop for teamd
Jan 22 23:30:17.047694 str-s6000-acs-8 NOTICE admin: Stopped teamd service...
Jan 22 23:30:17.060336 str-s6000-acs-8 INFO systemd[1]: teamd.service: Succeeded.

Case2:
Time it takes for swss restart

admin@str-s6000-acs-8:~$ time sudo systemctl restart swss

real    1m6.353s
user    0m0.062s
sys     0m0.061s

Case3:
With warm restart enabled, the stop of teamd still takes around 8 secs same as before this fix.
NOTE: There is a runtime error Runtime error: ( check below logs ) which needs to be fixed. It is an existing error.
Doesn't cause any functional issue,

admin@str-s6000-acs-8:~$ time sudo systemctl stop teamd

real    0m8.014s
user    0m0.047s
sys     0m0.028s


admin@str-s6000-acs-8:~$ sudo cat /var/log/syslog | grep teamd
Jan 22 23:57:11.365000 str-s6000-acs-8 NOTICE admin: Stopping teamd service...
Jan 22 23:57:12.309346 str-s6000-acs-8 NOTICE admin: Warm boot flag: teamd true.
Jan 22 23:57:12.329015 str-s6000-acs-8 NOTICE admin: Fast boot flag: teamd false.
Jan 22 23:57:12.797953 str-s6000-acs-8 INFO teamd#supervisord 2021-01-22 23:57:12,788 INFO exited: tlm_teamd (terminated by SIGUSR1; not expected)
Jan 22 23:57:13.794267 str-s6000-acs-8 INFO teamd#supervisord 2021-01-22 23:57:13,790 INFO reaped unknown pid 27 (exit status 0)
Jan 22 23:57:13.796290 str-s6000-acs-8 INFO teamd#supervisord 2021-01-22 23:57:13,792 INFO reaped unknown pid 35 (exit status 0)
Jan 22 23:57:13.798770 str-s6000-acs-8 INFO teamd#supervisord 2021-01-22 23:57:13,794 INFO reaped unknown pid 43 (exit status 0)
Jan 22 23:57:13.801216 str-s6000-acs-8 INFO teamd#supervisord 2021-01-22 23:57:13,796 INFO reaped unknown pid 51 (exit status 0)
Jan 22 23:57:13.806513 str-s6000-acs-8 INFO teamd#supervisord 2021-01-22 23:57:13,799 INFO reaped unknown pid 59 (exit status 0)
Jan 22 23:57:13.807219 str-s6000-acs-8 INFO teamd#supervisord 2021-01-22 23:57:13,801 INFO reaped unknown pid 67 (exit status 0)
Jan 22 23:57:13.809515 str-s6000-acs-8 INFO teamd#supervisord 2021-01-22 23:57:13,804 INFO reaped unknown pid 77 (exit status 0)
Jan 22 23:57:13.809515 str-s6000-acs-8 INFO teamd#supervisord 2021-01-22 23:57:13,805 INFO reaped unknown pid 85 (exit status 0)
Jan 22 23:57:13.840782 str-s6000-acs-8 INFO teamd#/supervisor-proc-exit-listener: Process tlm_teamd exited unxepectedly. Terminating supervisor...
Jan 22 23:57:13.844419 str-s6000-acs-8 INFO teamd#supervisord 2021-01-22 23:57:13,841 WARN received SIGTERM indicating exit request
Jan 22 23:57:13.846787 str-s6000-acs-8 INFO teamd#supervisord 2021-01-22 23:57:13,843 INFO waiting for supervisor-proc-exit-listener, rsyslogd, teammgrd, teamsyncd to die
Jan 22 23:57:14.316749 str-s6000-acs-8 DEBUG /container: read_data: config:True feature:teamd fields:[('set_owner', 'local'), ('no_fallback_to_local', False)] val:['local', False]
Jan 22 23:57:14.321106 str-s6000-acs-8 DEBUG /container: read_data: config:False feature:teamd fields:[('current_owner', 'none'), ('remote_state', 'none'), ('container_id', '')] val:['none', 'none', '']
Jan 22 23:57:14.326712 str-s6000-acs-8 DEBUG /container: container_stop: teamd: set_owner:local current_owner:none remote_state:none docker_id:teamd
Jan 22 23:57:14.851174 str-s6000-acs-8 INFO teamd#supervisord 2021-01-22 23:57:14,847 WARN received SIGTERM indicating exit request
Jan 22 23:57:15.851679 str-s6000-acs-8 NOTICE teamd#teamsyncd: :- cleanTeamSync: Cleaning up LAG teamd resources ...
Jan 22 23:57:15.881216 str-s6000-acs-8 NOTICE teamd#teamsyncd: :- main: Received SIGTERM Exiting
Jan 22 23:57:16.051175 str-s6000-acs-8 INFO teamd#supervisord 2021-01-22 23:57:16,046 INFO stopped: teamsyncd (exit status 0)
Jan 22 23:57:17.052926 str-s6000-acs-8 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Cleaning up LAGs during shutdown...
Jan 22 23:57:17.055018 str-s6000-acs-8 INFO teamd#supervisord 2021-01-22 23:57:17,051 INFO waiting for supervisor-proc-exit-listener, rsyslogd, teammgrd to die
Jan 22 23:57:17.079633 str-s6000-acs-8 INFO teamd#/supervisord: teammgrd Daemon not running
Jan 22 23:57:17.081212 str-s6000-acs-8 INFO teamd#/supervisord: teammgrd 
Jan 22 23:57:17.085532 str-s6000-acs-8 ERR teamd#teammgrd: :- main: Runtime error: /usr/bin/teamd -k -t "PortChannel0002" : 
Jan 22 23:57:17.507577 str-s6000-acs-8 INFO teamd#supervisord 2021-01-22 23:57:17,505 INFO stopped: teammgrd (exit status 255)
Jan 22 23:57:19.019533 str-s6000-acs-8 INFO /container: docker cmd: stop for teamd
Jan 22 23:57:19.024002 str-s6000-acs-8 INFO /container: docker cmd: wait for teamd
Jan 22 23:57:19.246157 str-s6000-acs-8 NOTICE admin: Stopped teamd service...
Jan 22 23:57:19.272515 str-s6000-acs-8 INFO systemd[1]: teamd.service: Succeeded.

- Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012

- Description for the changelog

- A picture of a cute animal (not mandatory but encouraged)

@lguohan lguohan changed the title Increase wait timeout for teamd docker stop to clean Port channels. [teamd]: Increase wait timeout for teamd docker stop to clean Port channels. Jan 23, 2021
@judyjoseph
Copy link
Contributor Author

retest vsimage please

@lguohan lguohan merged commit 46b3bd5 into sonic-net:master Jan 24, 2021
@liat-grozovik
Copy link
Collaborator

@lguohan as we can see the same problem on 201911, will you consider taking the fix to 201911 as well?
I also think we need to set the label as request for 202012.

@liat-grozovik
Copy link
Collaborator

liat-grozovik commented Feb 2, 2021

@judyjoseph and @dallan I believe this one should be cherry picked to 202012. there is no label . Can you please add?
@abdosi will you take it for 201911 as well?

daall pushed a commit that referenced this pull request Feb 6, 2021
…annels. (#6537)

The Portchannels were not getting cleaned up as the cleanup activity was taking more than 10 secs which is default docker timeout after which a SIGKILL will be send.
Fixes #6199
To check if it works out for this issue in 201911 ? #6503

This issue is significantly seen in master branch compared to 201911 because the Portchannel cleanup takes more time in master. Test on a DUT with 8 Port Channels.

master

    admin@str-s6000-acs-8:~$ time sudo systemctl stop teamd
    real    0m15.599s
    user    0m0.061s
    sys     0m0.038s
Sonic 201911.v58

    admin@str-s6000-acs-8:~$ time sudo systemctl stop teamd
    real    0m5.541s
    user    0m0.020s
    sys     0m0.028s
judyjoseph added a commit to judyjoseph/sonic-buildimage that referenced this pull request Feb 19, 2021
abdosi pushed a commit that referenced this pull request Feb 27, 2021
The Portchannels were not getting cleaned up as the cleanup activity was taking more than 10 secs which is default docker timeout after which a SIGKILL will be send.

Fix Issue #6537
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PortChannel is not fully cleared when teamd is stopped
4 participants