Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[201911][teamd] Interface PortChannel not getting IP assigned when large number of PortChannel RIF are defined #6503

Closed
nazariig opened this issue Jan 20, 2021 · 5 comments

Comments

@nazariig
Copy link
Collaborator

nazariig commented Jan 20, 2021

Description

PortChannel kernel leftovers observed after SWSS stop:

root@sonic:/home/admin# service swss stop

root@sonic:/home/admin# docker ps
CONTAINER ID        IMAGE                           COMMAND                  CREATED             STATUS              PORTS               NAMES
38e18bfc2863        docker-sonic-telemetry:latest   "/usr/bin/supervisord"   20 hours ago        Up 20 hours                             telemetry
cca41e39c4bd        docker-lldp-sv2:latest          "/usr/bin/docker-lld…"   20 hours ago        Up 20 hours                             lldp
b69a0e0d0b3b        docker-fpm-frr:latest           "/usr/bin/supervisord"   20 hours ago        Up 20 hours                             bgp
30a550725a7a        docker-database:latest          "/usr/local/bin/dock…"   20 hours ago        Up 20 hours  

root@sonic:/home/admin# ip link | grep PortC
1024: PortChannel79: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 9178 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
1025: PortChannel8: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 9178 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
1026: PortChannel80: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 9178 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
1027: PortChannel9: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 9178 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
1023: PortChannel78: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 9178 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000

The issue is caused by docker stop timeout:

Jan 19 12:50:14.880443 sonic DEBUG teamd#teammgrd: :< removeLag: exit
Jan 19 12:50:14.880443 sonic DEBUG teamd#teammgrd: :> removeLag: enter
Jan 19 12:50:14.987525 sonic DEBUG teamd#teammgrd: :- exec: /usr/bin/teamd -k -t "PortChannel76" :  : Exited with rc=0
Jan 19 12:50:14.987525 sonic NOTICE teamd#teammgrd: :- removeLag: Stop port channel PortChannel76
Jan 19 12:50:14.987525 sonic DEBUG teamd#teammgrd: :< removeLag: exit
Jan 19 12:50:14.987525 sonic DEBUG teamd#teammgrd: :> removeLag: enter
Jan 19 12:50:15.037884 sonic INFO dockerd[1201]: time="2021-01-19T12:50:15.037610320Z" level=info msg="Container 53023ce4f432583df5894f862b99b9b69ce1f3e8f3fa30e0814b5e8ba5e45b1f failed to exit within 10 seconds of signal 15 - using the force" 
Jan 19 12:50:15.186297 sonic INFO containerd[717]: time="2021-01-19T12:50:15.186100778Z" level=info msg="shim reaped" id=53023ce4f432583df5894f862b99b9b69ce1f3e8f3fa30e0814b5e8ba5e45b1f
Jan 19 12:50:15.196669 sonic INFO dockerd[1201]: time="2021-01-19T12:50:15.196514937Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete" 
Jan 19 12:50:15.272599 sonic INFO teamd.sh[23176]: teamd
Jan 19 12:50:15.273546 sonic INFO teamd.sh[19107]: 137
Jan 19 12:50:15.278155 sonic INFO systemd[1]: Stopped TEAMD container.
Jan 19 12:50:15.284218 sonic INFO systemd[1]: Stopping SNMP container...

Steps to reproduce the issue:

  1. Add >= 80 LAG RIFs
  2. config reload -y

Describe the results you received:
LAG RIF init flow:

Jan 18 16:31:26.106976 sonic DEBUG bgp#bgpcfgd: Received message : '('PortChannel78', 'SET', (('vrf', ''),))'
Jan 18 16:31:26.108598 sonic DEBUG swss#intfmgrd: :- exec: /sbin/ip address "add" "10.210.10.122/31" dev "PortChannel78" :  : Exited with rc=0
Jan 18 16:31:26.108977 sonic DEBUG bgp#bgpcfgd: Received message : '('PortChannel78|10.210.10.122/31', 'SET', (('state', 'ok'),))'
Jan 18 16:31:26.109089 sonic DEBUG swss#intfmgrd: :- isIntfStateOk: Lag PortChannel78 is ready
Jan 18 16:31:26.109104 sonic DEBUG swss#intfmgrd: :- isIntfCreated: Intf PortChannel78 is ready
Jan 18 16:31:26.110793 sonic DEBUG swss#intfmgrd: :- exec: /sbin/ip -6 address "add" "2001:558:1b0:8182::a7a/127" dev "PortChannel78" :  : Exited with rc=0
Jan 18 16:31:26.111217 sonic DEBUG bgp#bgpcfgd: Received message : '('PortChannel78|2001:558:1b0:8182::a7a/127', 'SET', (('state', 'ok'),))'
Jan 18 16:31:26.569054 sonic INFO teamd#supervisord: teammgrd Using team device "PortChannel78".
Jan 18 16:31:26.569054 sonic INFO teamd#supervisord: teammgrd Using PID file "/var/run/teamd/PortChannel78.pid" 
Jan 18 16:31:26.611803 sonic WARNING systemd-udevd[21687]: Could not generate persistent MAC address for PortChannel78: No such file or directory
Jan 18 16:31:26.615219 sonic INFO kernel: [72536.142794] PortChannel78: Mode changed to "loadbalance" 
Jan 18 16:31:26.627550 sonic NOTICE teamd#teammgrd: :- addLag: Start port channel PortChannel78 with teamd
Jan 18 16:31:26.633172 sonic NOTICE teamd#teammgrd: :- setLagAdminStatus: Set port channel PortChannel78 admin status to up
Jan 18 16:31:26.635222 sonic INFO kernel: [72536.160119] IPv6: ADDRCONF(NETDEV_UP): PortChannel78: link is not ready
Jan 18 16:31:26.635229 sonic INFO kernel: [72536.160123] 8021q: adding VLAN 0 to HW filter on device PortChannel78
Jan 18 16:31:26.640565 sonic NOTICE teamd#teammgrd: :- setLagMtu: Set port channel PortChannel78 MTU to 9178
Jan 18 16:31:43.775267 sonic INFO kernel: [72553.303776] PortChannel78: Port device Ethernet122 added
Jan 18 16:31:43.780827 sonic NOTICE teamd#teammgrd: :- addLagMember: Add Ethernet122 to port channel PortChannel78
Jan 18 16:31:44.743635 sonic NOTICE swss#orchagent: :- addLag: Create an empty LAG PortChannel78 lid:2000000001048
Jan 18 16:31:44.788708 sonic NOTICE swss#orchagent: :- addLagMember: Add member Ethernet122 to LAG PortChannel78 lid:2000000001048 pid:1000000000f6f
Jan 18 16:31:44.841060 sonic NOTICE swss#orchagent: :- addRouterIntfs: Create router interface PortChannel78 MTU 9178

Describe the results you expected:
LAG RIF init flow:

Jan 18 16:31:26.387373 sonic INFO teamd#supervisord: teammgrd Using team device "PortChannel72".
Jan 18 16:31:26.387399 sonic INFO teamd#supervisord: teammgrd Using PID file "/var/run/teamd/PortChannel72.pid" 
Jan 18 16:31:26.394357 sonic WARNING systemd-udevd[21482]: Could not generate persistent MAC address for PortChannel72: No such file or directory
Jan 18 16:31:26.399225 sonic INFO kernel: [72535.926678] PortChannel72: Mode changed to "loadbalance" 
Jan 18 16:31:26.405070 sonic NOTICE teamd#teammgrd: :- addLag: Start port channel PortChannel72 with teamd
Jan 18 16:31:26.408981 sonic NOTICE teamd#teammgrd: :- setLagAdminStatus: Set port channel PortChannel72 admin status to up
Jan 18 16:31:26.411223 sonic INFO kernel: [72535.936546] IPv6: ADDRCONF(NETDEV_UP): PortChannel72: link is not ready
Jan 18 16:31:26.411231 sonic INFO kernel: [72535.936550] 8021q: adding VLAN 0 to HW filter on device PortChannel72
Jan 18 16:31:26.411375 sonic NOTICE teamd#teammgrd: :- setLagMtu: Set port channel PortChannel72 MTU to 9178
Jan 18 16:31:27.222360 sonic DEBUG swss#intfmgrd: :- isIntfStateOk: Lag PortChannel72 is ready
Jan 18 16:31:27.222578 sonic DEBUG bgp#bgpcfgd: Received message : '('PortChannel72', 'SET', (('vrf', ''),))'
Jan 18 16:31:27.222606 sonic DEBUG swss#intfmgrd: :- isIntfStateOk: Lag PortChannel72 is ready
Jan 18 16:31:27.222606 sonic DEBUG swss#intfmgrd: :- isIntfCreated: Intf PortChannel72 is ready
Jan 18 16:31:27.223826 sonic DEBUG swss#intfmgrd: :- exec: /sbin/ip address "add" "10.210.10.110/31" dev "PortChannel72" :  : Exited with rc=0
Jan 18 16:31:27.224032 sonic DEBUG swss#intfmgrd: :- isIntfStateOk: Lag PortChannel72 is ready
Jan 18 16:31:27.224066 sonic DEBUG bgp#bgpcfgd: Received message : '('PortChannel72|10.210.10.110/31', 'SET', (('state', 'ok'),))'
Jan 18 16:31:27.224106 sonic DEBUG swss#intfmgrd: :- isIntfCreated: Intf PortChannel72 is ready
Jan 18 16:31:27.225363 sonic DEBUG swss#intfmgrd: :- exec: /sbin/ip -6 address "add" "2001:558:1b0:8182::a6e/127" dev "PortChannel72" :  : Exited with rc=0
Jan 18 16:31:27.225586 sonic DEBUG bgp#bgpcfgd: Received message : '('PortChannel72|2001:558:1b0:8182::a6e/127', 'SET', (('state', 'ok'),))'
Jan 18 16:31:42.599252 sonic INFO kernel: [72552.126121] PortChannel72: Port device Ethernet110 added
Jan 18 16:31:42.601621 sonic NOTICE teamd#teammgrd: :- addLagMember: Add Ethernet110 to port channel PortChannel72
Jan 18 16:31:44.740789 sonic NOTICE swss#orchagent: :- addLag: Create an empty LAG PortChannel72 lid:2000000001042
Jan 18 16:31:44.786850 sonic NOTICE swss#orchagent: :- addLagMember: Add member Ethernet110 to LAG PortChannel72 lid:2000000001042 pid:1000000000e73
Jan 18 16:31:44.838628 sonic NOTICE swss#orchagent: :- addRouterIntfs: Create router interface PortChannel72 MTU 9178

Additional information you deem important (e.g. issue happens only occasionally):
SONiC systemd default timeout:
https://www.freedesktop.org/software/systemd/man/systemd.service.html

root@sonic:/home/admin# cat /etc/systemd/system.conf | grep Timeout
#DefaultTimeoutStartSec=90s
#DefaultTimeoutStopSec=90s

SONiC docker default timeout:
https://docs.docker.com/engine/reference/commandline/stop/

Name, shorthand | Default | Description
--time , -t     | 10      | Seconds to wait for stop before killing it

Output of show version:

(paste your output here)

Attach debug file sudo generate_dump:

(paste your output here)
@nazariig
Copy link
Collaborator Author

@abdosi please have a look

@nazariig nazariig changed the title [teamd] Interface PortChannel not getting IP assigned when large number of PortChannel RIF are defined [201911][teamd] Interface PortChannel not getting IP assigned when large number of PortChannel RIF are defined Jan 20, 2021
@anshuv-mfst
Copy link

@abdosi @judyjoseph - could you please look into this issue, thanks.

lguohan pushed a commit that referenced this issue Jan 24, 2021
…annels. (#6537)

The Portchannels were not getting cleaned up as the cleanup activity was taking more than 10 secs which is default docker timeout after which a SIGKILL will be send.
Fixes #6199
To check if it works out for this issue in 201911 ? #6503

This issue is significantly seen in master branch compared to 201911 because the Portchannel cleanup takes more time in master. Test on a DUT with 8 Port Channels.

master

    admin@str-s6000-acs-8:~$ time sudo systemctl stop teamd
    real    0m15.599s
    user    0m0.061s
    sys     0m0.038s
Sonic 201911.v58

    admin@str-s6000-acs-8:~$ time sudo systemctl stop teamd
    real    0m5.541s
    user    0m0.020s
    sys     0m0.028s
@judyjoseph
Copy link
Contributor

judyjoseph commented Jan 29, 2021

@nazariig can you please check if this fix done in #6537, solves the issue for you ? I am not able to get a 201911 testbed up with the large number of Po's. We can quickly update on the DUT in the file
/usr/bin/teamd.sh

stop() {
docker stop -t 60 teamd$DEV
}

Thanks !

@nazariig
Copy link
Collaborator Author

@judyjoseph do we have a PR for 201911?

@judyjoseph
Copy link
Contributor

judyjoseph commented Feb 2, 2021

Yes I have created this one now, #6648. My observation is that the cleanup of Portchannel interfaces is faster in 201911 compared to master image.

daall pushed a commit that referenced this issue Feb 6, 2021
…annels. (#6537)

The Portchannels were not getting cleaned up as the cleanup activity was taking more than 10 secs which is default docker timeout after which a SIGKILL will be send.
Fixes #6199
To check if it works out for this issue in 201911 ? #6503

This issue is significantly seen in master branch compared to 201911 because the Portchannel cleanup takes more time in master. Test on a DUT with 8 Port Channels.

master

    admin@str-s6000-acs-8:~$ time sudo systemctl stop teamd
    real    0m15.599s
    user    0m0.061s
    sys     0m0.038s
Sonic 201911.v58

    admin@str-s6000-acs-8:~$ time sudo systemctl stop teamd
    real    0m5.541s
    user    0m0.020s
    sys     0m0.028s
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants