[orchagent] Orchagent keeps flooding ERROR log after teamd restarted #5971

bingwang-ms · 2020-11-19T08:42:27Z

Description
I noticed that orchagent kept flooding ERROR log if teamd was restarted when debugging.

...
Nov 19 08:32:06.695097 str-dx010-acs-4 ERR swss#orchagent: :- removeLag: Failed to remove ref count 1 LAG PortChannel0001
Nov 19 08:32:06.695097 str-dx010-acs-4 ERR swss#orchagent: :- removeLag: Failed to remove ref count 1 LAG PortChannel0002
Nov 19 08:32:06.695097 str-dx010-acs-4 ERR swss#orchagent: :- removeLag: Failed to remove ref count 1 LAG PortChannel0003
Nov 19 08:32:06.695097 str-dx010-acs-4 ERR swss#orchagent: :- removeLag: Failed to remove ref count 1 LAG PortChannel0004
Nov 19 08:32:06.695097 str-dx010-acs-4 ERR swss#orchagent: :- removeLag: Failed to remove ref count 1 LAG PortChannel0001
Nov 19 08:32:06.695247 str-dx010-acs-4 ERR swss#orchagent: :- removeLag: Failed to remove ref count 1 LAG PortChannel0002
Nov 19 08:32:06.695247 str-dx010-acs-4 ERR swss#orchagent: :- removeLag: Failed to remove ref count 1 LAG PortChannel0003
Nov 19 08:32:06.695378 str-dx010-acs-4 ERR swss#orchagent: :- removeLag: Failed to remove ref count 1 LAG PortChannel0004
...

And portchannel was unable to recover.

admin@str-dx010-acs-4:~$ netstat -apn|grep 179
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp        0      0 0.0.0.0:179             0.0.0.0:*               LISTEN      -                   
tcp6       0      0 :::179                  :::*                    LISTEN      -                   
admin@str-dx010-acs-4:~$ show ip bgp sum

IPv4 Unicast Summary:
BGP router identifier 10.1.0.32, local AS number 65100 vrf-id 0
BGP table version 31996
RIB entries 3, using 552 bytes of memory
Peers 4, using 83680 KiB of memory
Peer groups 4, using 256 bytes of memory


Neighbhor      V     AS    MsgRcvd    MsgSent    TblVer    InQ    OutQ  Up/Down    State/PfxRcd    NeighborName
-----------  ---  -----  ---------  ---------  --------  -----  ------  ---------  --------------  --------------
10.0.0.57      4  64600       3270       4344         0      0       0  00:04:36   Active          ARISTA01T1
10.0.0.59      4  64600       3271       4387         0      0       0  00:04:34   Active          ARISTA02T1
10.0.0.61      4  64600       3269       4285         0      0       0  00:04:33   Active          ARISTA03T1
10.0.0.63      4  64600       3270       5355         0      0       0  00:04:31   Active          ARISTA04T1

Total number of neighbors 4

Steps to reproduce the issue:

Issue config reload to initialize DUT
Kill a critical process in teamd container, say teammgrd.
Contailer teamd will be restarted, and ERROR is flooding.

Describe the results you received:
No error should be observed, and portchannel is recovered.

Describe the results you expected:
Orchagent is flooding ERROR, and portchannel is not recoverd.

Additional information you deem important (e.g. issue happens only occasionally):

**Output of `show version`:**

SONiC Software Version: SONiC.master.491-af654944
Distribution: Debian 10.6
Kernel: 4.19.0-9-2-amd64
Build commit: af654944
Build date: Sun Nov 15 11:20:25 UTC 2020
Built by: johnar@jenkins-worker-8

Platform: x86_64-cel_seastone-r0
HwSKU: Celestica-DX010-C32
ASIC: broadcom
Serial Number: DX010F2B118711MS100007
Uptime: 08:41:23 up 13:10,  2 users,  load average: 3.29, 2.95, 2.93

Docker images:
REPOSITORY                    TAG                   IMAGE ID            SIZE
docker-snmp                   latest                32e259e905f1        458MB
docker-snmp                   master.491-af654944   32e259e905f1        458MB
docker-teamd                  latest                7b81e6696fda        454MB
docker-teamd                  master.491-af654944   7b81e6696fda        454MB
docker-sonic-mgmt-framework   latest                5c1935209d08        578MB
docker-sonic-mgmt-framework   master.491-af654944   5c1935209d08        578MB
docker-router-advertiser      latest                35f218401a81        421MB
docker-router-advertiser      master.491-af654944   35f218401a81        421MB
docker-platform-monitor       latest                0c0fc1571140        534MB
docker-platform-monitor       master.491-af654944   0c0fc1571140        534MB
docker-lldp                   latest                4caf7e3adfbb        485MB
docker-lldp                   master.491-af654944   4caf7e3adfbb        485MB
docker-dhcp-relay             latest                bfa4574a3e32        428MB
docker-dhcp-relay             master.491-af654944   bfa4574a3e32        428MB
docker-database               latest                6f12088a213d        421MB
docker-database               master.491-af654944   6f12088a213d        421MB
docker-orchagent              latest                0afefebd7999        468MB
docker-orchagent              master.491-af654944   0afefebd7999        468MB
docker-nat                    latest                7a7557403aba        457MB
docker-nat                    master.491-af654944   7a7557403aba        457MB
docker-sonic-telemetry        latest                42b255c3f740        491MB
docker-sonic-telemetry        master.491-af654944   42b255c3f740        491MB
docker-fpm-frr                latest                80872db697c9        471MB
docker-fpm-frr                master.491-af654944   80872db697c9        471MB
docker-sflow                  latest                db4bcb44cb19        455MB
docker-sflow                  master.491-af654944   db4bcb44cb19        455MB
docker-syncd-brcm             latest                c4cd5010b4d2        536MB
docker-syncd-brcm             master.491-af654944   c4cd5010b4d2        536MB

**Attach debug file `sudo generate_dump`:**

syslog.tar.gz

```
(paste your output here)
```

The text was updated successfully, but these errors were encountered:

judyjoseph · 2020-12-07T00:54:34Z

@bingwang-ms when teamd docker goes down, it brings down both syncd and swss dockers along.
Please check that during this time the swss and syncd dockers are also going away and restarted.

The error message is expected and should stop when the swss goes down.

Later when all of them swss, syncd, teamd is up -- The portchannels should be back as normal
Could you give more details on that ?

bingwang-ms · 2020-12-07T08:37:05Z

@judyjoseph
I attempted to repro the issue on lattest image SONiC-OS-HEAD.85-61419f54 but failed. It looks like that the issue has been fixed somehow.
Then I reinstall SONiC.master.491-af654944 and the issue can repro.
I saw that swss and syncd were not restarted when teamd was down, and ERR kept flooding. The portchannels still not worked after several minutes.

admin@str-dx010-acs-4:~$ show ip bgp sum

IPv4 Unicast Summary:
BGP router identifier 10.1.0.32, local AS number 65100 vrf-id 0
BGP table version 38376
RIB entries 3, using 552 bytes of memory
Peers 4, using 83680 KiB of memory
Peer groups 4, using 256 bytes of memory


Neighbhor      V     AS    MsgRcvd    MsgSent    TblVer    InQ    OutQ  Up/Down    State/PfxRcd    NeighborName
-----------  ---  -----  ---------  ---------  --------  -----  ------  ---------  --------------  --------------
10.0.0.57      4  64600       3240       6406         0      0       0  00:08:12   Active          ARISTA01T1
10.0.0.59      4  64600       3240       3458         0      0       0  00:08:10   Active          ARISTA02T1
10.0.0.61      4  64600       3240       7417         0      0       0  00:08:09   Active          ARISTA03T1
10.0.0.63      4  64600       3241       4269         0      0       0  00:08:08   Active          ARISTA04T1

admin@str-dx010-acs-4:~$ docker ps
CONTAINER ID        IMAGE                                COMMAND                  CREATED             STATUS              PORTS               NAMES
2f68ca33f04a        docker-snmp:latest                   "/usr/bin/supervisord"   14 minutes ago      Up 10 minutes                           snmp
b41302930f42        docker-sonic-mgmt-framework:latest   "/usr/bin/supervisord"   14 minutes ago      Up 14 minutes                           mgmt-framework
117612f72ca8        docker-router-advertiser:latest      "/usr/bin/docker-ini…"   17 minutes ago      Up 10 minutes                           radv
bf28ad6cffb6        docker-dhcp-relay:latest             "/usr/bin/docker_ini…"   17 minutes ago      Up 10 minutes                           dhcp_relay
34da36b66a6c        docker-lldp:latest                   "/usr/bin/docker-lld…"   17 minutes ago      Up 10 minutes                           lldp
8606aa356d20        docker-syncd-brcm:latest             "/usr/bin/supervisord"   17 minutes ago      Up 10 minutes                           syncd
f5a06dfa1516        docker-teamd:latest                  "/usr/bin/supervisord"   17 minutes ago      Up 8 minutes                            teamd
6aa0bac0dd0b        docker-orchagent:latest              "/usr/bin/docker-ini…"   17 minutes ago      Up 10 minutes                           swss
a47239ba31a9        docker-fpm-frr:latest                "/usr/bin/docker_ini…"   17 minutes ago      Up 10 minutes                           bgp
4dec40936e40        docker-platform-monitor:latest       "/usr/bin/docker_ini…"   17 minutes ago      Up 10 minutes                           pmon
82df8a5fa6eb        docker-database:latest               "/usr/local/bin/dock…"   17 minutes ago      Up 17 minutes                           database

Please refer attached log.

syslog.tar.gz

daall · 2020-12-23T02:58:20Z

@judyjoseph pls update

chenkelly · 2020-12-23T05:35:52Z

Hi @judyjoseph
According to https://github.com/Azure/sonic-buildimage/blob/master/files/scripts/swss.sh#L4,
when syncd docker goes down, swss and teamd also go down, we encounter this LLDP issue (#6164) after finishing syncd, swss, teamd container restart.
Should we define lldp to MULTI_INST_DEPENDENT to fix the issue?
Do you have any suggestion?
Thanks.

yxieca · 2021-01-06T02:25:14Z

@judyjoseph please update.

judyjoseph · 2021-01-06T03:45:55Z

@bingwang-ms , @yxieca I had tried this earlier today.

In this old version SONiC.master.491-af654944, I too see the problem of swss/syncd not restarting when teamd docker goes away, the "python3 /usr/bin/docker-wait-any -s swss -d syncd teamd" wait process also not going off. Could this be some intermediate build when we were converting from python2 --> python3 ?

But If i move to the latest sonic master buildimage, I don't see this problem anymore, teamd, syncd, swss all of them goes off and the error message "swss#orchagent: :- removeLag: Failed to remove ref count" stops when swss docker goes away. This is the correct behaviour.

Note: I also observed a different behavior with Port Channels not getting cleaned up correctly when teamd docker goes away in the latest master build. I will follow up on that to see what changed as part of the other issue #6199

judyjoseph · 2021-01-06T07:07:13Z

@bingwang-ms, if you could confirm similar behavior with the latest master image, we could close this case and follow up the issue #6199.

bingwang-ms · 2021-01-06T08:18:53Z

@bingwang-ms, if you could confirm similar behavior with the latest master image, we could close this case and follow up the issue #6199.

I tried to repro the issue on lattest inage (SONiC.HEAD.215-c4156b87), and it seemd that the issue has been addressed. The error message was gone after swss is restarted. So I think we can close this issue now. But do you have any idea which PR fixed it?

judyjoseph · 2021-01-11T06:27:05Z

Thanks @bingwang-ms , ideally this ERROR message was verified to be fixed with #5628. If feel the image with issue was an intermediate image made during the py2 --> py3 conversion where docker-wait script had to be fixed.

bingwang-ms added the Master Branch Quality label Nov 19, 2020

anshuv-mfst assigned judyjoseph Nov 25, 2020

bingwang-ms mentioned this issue Dec 9, 2020

[lldp] lldp not works after teamd is restarted #6164

Closed

judyjoseph closed this as completed Jan 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[orchagent] Orchagent keeps flooding ERROR log after teamd restarted #5971

[orchagent] Orchagent keeps flooding ERROR log after teamd restarted #5971

bingwang-ms commented Nov 19, 2020 •

edited

Loading

judyjoseph commented Dec 7, 2020

bingwang-ms commented Dec 7, 2020

daall commented Dec 23, 2020

chenkelly commented Dec 23, 2020

yxieca commented Jan 6, 2021

judyjoseph commented Jan 6, 2021 •

edited

Loading

judyjoseph commented Jan 6, 2021

bingwang-ms commented Jan 6, 2021

judyjoseph commented Jan 11, 2021

[orchagent] Orchagent keeps flooding ERROR log after teamd restarted #5971

[orchagent] Orchagent keeps flooding ERROR log after teamd restarted #5971

Comments

bingwang-ms commented Nov 19, 2020 • edited Loading

judyjoseph commented Dec 7, 2020

bingwang-ms commented Dec 7, 2020

daall commented Dec 23, 2020

chenkelly commented Dec 23, 2020

yxieca commented Jan 6, 2021

judyjoseph commented Jan 6, 2021 • edited Loading

judyjoseph commented Jan 6, 2021

bingwang-ms commented Jan 6, 2021

judyjoseph commented Jan 11, 2021

bingwang-ms commented Nov 19, 2020 •

edited

Loading

judyjoseph commented Jan 6, 2021 •

edited

Loading