Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

syncd crash and hung seen with warm-reboot and fast-reboot on T0 topology- HEAD.253-2872d802 #3934

Closed
mini-nair-dell opened this issue Dec 20, 2019 · 12 comments
Assignees

Comments

@mini-nair-dell
Copy link

mini-nair-dell commented Dec 20, 2019

Description
+++++++++++++++

  • Observing orchagent and syncd crash while performing warm-reboot in Master image 157
  • The issue seen with only T0 topology and not T1-lag-64 topo
  • After the crash, many dockers are not running.
  • Attached the traces from the core

Pls find the logs below. The issue is not see in the master image 154

Syslog snippet:

Dec 20 06:15:56.828794 sonic-s6100-07 ERR swss#orchagent: :- sai_redis_internal_notify_syncd: notify syncd failed to get response result from select: 2
Dec 20 06:15:56.828794 sonic-s6100-07 ERR swss#orchagent: :- sai_redis_internal_notify_syncd: notify syncd failed to get response
Dec 20 06:15:56.828894 sonic-s6100-07 ERR swss#orchagent: :- sai_redis_notify_syncd: notify syncd failed: SAI_STATUS_FAILURE
Dec 20 06:15:56.828894 sonic-s6100-07 ERR swss#orchagent: :- initSaiRedis: Failed to notify syncd INIT_VIEW, rv:-1
Dec 20 06:15:56.829618 sonic-s6100-07 INFO swss#supervisord: orchagent terminate called without an active exception
Dec 20 06:15:58.010736 sonic-s6100-07 INFO swss#supervisor-proc-exit-listener: Process orchagent exited unxepectedly. Terminating supervisor...
Dec 20 06:15:58.571107 sonic-s6100-07 INFO swss.sh[1708]: No longer waiting on container 'syncd'
Dec 20 06:15:58.604890 sonic-s6100-07 NOTICE root: Stopping swss service...
Dec 20 06:15:58.612537 sonic-s6100-07 NOTICE root: Locking /tmp/swss-syncd-lock from swss service

root@sonic-s6100-07:/var/core# warm-reboot -vvv
Fri Dec 20 06:12:23 UTC 2019 Pausing orchagent ...
Fri Dec 20 06:12:23 UTC 2019 Stopping radv ...
Fri Dec 20 06:12:24 UTC 2019 Stopping bgp ...
Fri Dec 20 06:12:24 UTC 2019 Stopped bgp ...
Fri Dec 20 06:12:27 UTC 2019 Initialize pre-shutdown ...
Fri Dec 20 06:12:28 UTC 2019 Requesting pre-shutdown ...
Fri Dec 20 06:12:29 UTC 2019 Waiting for pre-shutdown ...
Fri Dec 20 06:16:20 UTC 2019 Syncd pre-shutdown failed: requesting ...
Fri Dec 20 06:16:20 UTC 2019 warm-reboot failure (11) cleanup ...
Fri Dec 20 06:16:21 UTC 2019 Cancel warm-reboot: code (1)

Core files :

root@sonic-s6100-07:/var/core# ls -ltr
total 10568
-rw-rw-rw- 1 root root 10261200 Dec 20 08:54 syncd.1576832093.28.core.gz
-rw-rw-rw- 1 root root 278329 Dec 20 08:56 orchagent.1576832194.45.core.gz
-rw-rw-rw- 1 root root 278347 Dec 20 08:58 orchagent.1576832301.47.core.gz
root@sonic-s6100-07:/var/core#

root@sonic-s6100-07:/var/core# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
7b13c13d2fe1 docker-dhcp-relay-dbg:latest "/usr/bin/docker_ini…" 3 hours ago Up 3 hours dhcp_relay
6ef8beec5762 docker-syncd-brcm-dbg:latest "/usr/bin/supervisord" 3 hours ago Up 3 hours syncd
fcedb3fa4cf6 docker-teamd-dbg:latest "/usr/bin/supervisord" 3 hours ago Up 3 hours teamd
689537cc97d1 docker-platform-monitor-dbg:latest "/usr/bin/docker_ini…" 3 hours ago Up 3 hours pmon
8cb6929f9659 docker-fpm-frr-dbg:latest "/usr/bin/supervisord" 3 hours ago Up 3 hours bgp
8934c8414ccd docker-database-dbg:latest "/usr/local/bin/dock…" 3 hours ago Up 3 hours database
root@sonic-s6100-07:/var/core#

Attached:

  • Syslog
  • Core traces

Fast-reboot
+++++++++

  • Fast-reboot stucks as well

root@sonic-s6100-07:~# fast-reboot -vvv
Fri Dec 20 12:08:14 UTC 2019 Stopping radv ...
Fri Dec 20 12:08:15 UTC 2019 Stopping bgp ...
Fri Dec 20 12:08:16 UTC 2019 Stopped bgp ...
Fri Dec 20 12:08:17 UTC 2019 Stopping teamd ...
Fri Dec 20 12:08:18 UTC 2019 Stopped teamd ...
Fri Dec 20 12:08:29 UTC 2019 Stopping syncd ...
Fri Dec 20 12:08:29 UTC 2019 Stopped syncd ...
Fri Dec 20 12:08:29 UTC 2019 Stopping all remaining containers ...
Fri Dec 20 12:08:30 UTC 2019 Stopped all remaining containers ...
Fri Dec 20 12:08:32 UTC 2019 Rebooting with /sbin/kexec -e to SONiC-OS-HEAD.157-dirty-20191219.005759 ...

Thanks
Mini

@mini-nair-dell
Copy link
Author

Core-Traces.txt

@mini-nair-dell
Copy link
Author

syslogs.syncd.docx

@mini-nair-dell
Copy link
Author

The issue is seen with the latest Jenkins Master image : #202

@mini-nair-dell
Copy link
Author

The fast-reboot and warm-reboot stucks with the latest master. And the below cores seen as well.

Image - HEAD.209-b8561545
root@sonic-s6100-07:/var/core# ls -ltr
total 34636
-rw-rw-rw- 1 root root 10441474 Feb 24 07:12 syncd.1582528323.69.core.gz
-rw-rw-rw- 1 root root 10483127 Feb 24 08:10 syncd.1582531855.92.core.gz
-rw-rw-rw- 1 root root 278605 Feb 24 08:12 orchagent.1582531955.131.core.gz
-rw-rw-rw- 1 root root 10533948 Feb 24 08:34 syncd.1582533274.102.core.gz
-rw-rw-rw- 1 root root 3715232 Feb 24 08:34 zebra.1582533284.119.core.gz
root@sonic-s6100-07:/var/core# date

Thanks
MIni

@mini-nair-dell
Copy link
Author

Couldnt verify this with the latest master, due to orchagent crash.

@mini-nair-dell mini-nair-dell changed the title syncd crash with warm-reboot on T0 topology syncd crash and hung seen with warm-reboot and fast-reboot on T0 topology- HEAD.253-2872d802 Apr 15, 2020
@mini-nair-dell
Copy link
Author

The issue is still seen with the latest master HEAD.253-2872d802.
Both warm-reboot and fast-reboot make the switch hung, which can be recovered only with a power-cycle.
Syncd crash is also seen once the switch is up after the powercycle.
Pls find the logs below.

Logs:
++++
root@sonic-s6100-07:/var/core# warm-reboot -vvv
Wed Apr 15 05:39:26 UTC 2020 Pausing orchagent ...
Wed Apr 15 05:39:27 UTC 2020 Stopping nat ...
Wed Apr 15 05:39:27 UTC 2020 Stopped nat ...
Wed Apr 15 05:39:27 UTC 2020 Stopping radv ...
Wed Apr 15 05:39:28 UTC 2020 Stopping bgp ...
Wed Apr 15 05:39:28 UTC 2020 Stopped bgp ...
Wed Apr 15 05:39:32 UTC 2020 Initialize pre-shutdown ...
Wed Apr 15 05:39:32 UTC 2020 Requesting pre-shutdown ...
Wed Apr 15 05:39:32 UTC 2020 Waiting for pre-shutdown ...
Wed Apr 15 05:41:45 UTC 2020 Syncd pre-shutdown failed: requesting ...
Wed Apr 15 05:41:45 UTC 2020 Backing up database ...
Wed Apr 15 05:41:46 UTC 2020 Stopping teamd ...
Wed Apr 15 05:41:46 UTC 2020 Stopped teamd ...
Wed Apr 15 05:41:47 UTC 2020 Stopping syncd ...
Wed Apr 15 05:41:51 UTC 2020 Stopped syncd ...
Wed Apr 15 05:41:51 UTC 2020 Stopping all remaining containers ...
Warning: Stopping telemetry.service, but it can still be activated by:
telemetry.timer
Wed Apr 15 05:41:56 UTC 2020 Stopped all remaining containers ...
Wed Apr 15 05:41:58 UTC 2020 Running x86_64-dell_s6100_c2538-r0 specific plugin...
Wed Apr 15 05:41:58 UTC 2020 Rebooting with /sbin/kexec -e to SONiC-OS-HEAD.253-2872d802 ...
[ 389.666031] kexec_core: Starting new kernel
[

Hung

root@sonic-s6100-07:~# cd /var/core
root@sonic-s6100-07:/var/core# ls -l
total 20424
-rw-rw-rw- 1 root root 10455628 Apr 15 05:30 syncd.1586928610.29.core.gz
-rw-rw-rw- 1 root root 10455365 Apr 15 05:39 syncd.1586929172.30.core.gz
root@sonic-s6100-07:/var/core# date
Wed Apr 15 05:47:01 UTC 2020
root@sonic-s6100-07:/var/core#

Thanks
Mini

@yxieca
Copy link
Contributor

yxieca commented Apr 22, 2020

There is a Broadcom SAI issue with 3.5.3.3, causing syncd to crash during fast/warm reboot.

Please move on to use latest build from master (I tried 259).

With this build. I can see that the fast reboot shutdown no longer generate sycnd cores.

With this build. warm reboot still generating core during pre-shutdown. This still looks like SAI issue. Will follow up with Broadcom.

@mini-nair-dell
Copy link
Author

I tested this on the image - 259, and could see syncd crash on both warm-reboot and fast-reboot. The switch gets stuck for both the reboots, and can be recovered only by a powercycle

root@sonic-s6100-07:/var/core# ls -l
total 20124
-rw-rw-rw- 1 root root 10239276 Apr 24 06:52 syncd.1587711123.30.core.gz
-rw-rw-rw- 1 root root 10366558 Apr 24 07:10 syncd.1587712223.29.core.gz

root@sonic-s6100-07:/var/core# show ver
SONiC Software Version: SONiC.master.259-583bfde4

LOgs:
++++

root@sonic-s6100-07:/var/core# warm-reboot -vvv
Fri Apr 24 07:10:12 UTC 2020 Pausing orchagent ...
Fri Apr 24 07:10:17 UTC 2020 Stopping nat ...
Fri Apr 24 07:10:18 UTC 2020 Stopped nat ...
Fri Apr 24 07:10:18 UTC 2020 Stopping radv ...
Fri Apr 24 07:10:18 UTC 2020 Stopping bgp ...
Fri Apr 24 07:10:19 UTC 2020 Stopped bgp ...
Fri Apr 24 07:10:22 UTC 2020 Initialize pre-shutdown ...
Fri Apr 24 07:10:22 UTC 2020 Requesting pre-shutdown ...
Fri Apr 24 07:10:23 UTC 2020 Waiting for pre-shutdown ...
Fri Apr 24 07:12:36 UTC 2020 Syncd pre-shutdown failed: requesting ...
Fri Apr 24 07:12:36 UTC 2020 Backing up database ...
Fri Apr 24 07:12:37 UTC 2020 Stopping teamd ...
Fri Apr 24 07:12:37 UTC 2020 Stopped teamd ...
Fri Apr 24 07:12:37 UTC 2020 Stopping syncd ...
Fri Apr 24 07:12:42 UTC 2020 Stopped syncd ...
Fri Apr 24 07:12:42 UTC 2020 Stopping all remaining containers ...
Warning: Stopping telemetry.service, but it can still be activated by:
telemetry.timer
Fri Apr 24 07:12:47 UTC 2020 Stopped all remaining containers ...
Fri Apr 24 07:12:49 UTC 2020 Running x86_64-dell_s6100_c2538-r0 specific plugin...
Fri Apr 24 07:12:49 UTC 2020 Rebooting with /sbin/kexec -e to SONiC-OS-master.259-583bfde4 ...
[ 395.106522] kexec_core: Starting new kernel

Stuck here

root@sonic-s6100-07:/var/core# fast-reboot
[ 143.884817] kexec_core: Starting new kernel

Stuck

Thanks
Mini

@yxieca yxieca assigned rlhui and unassigned yxieca May 20, 2020
@yxieca
Copy link
Contributor

yxieca commented May 20, 2020

@rlhui please arrange update Broadcom SAI in master branch

@mini-nair-dell
Copy link
Author

We tested the build (300) with the SAI merge, but could see that the orchagent process doesnt run.

  • S6100 T0 testbed
  • No crash seen
  • Orchagent doesnt run
  • show interface status has empty o/p
  • warm-reboot and fast-reboot fails

From syslogs :

Jun 2 05:28:20.780311 sonic-s6100-07 ERR monit[499]: 'orchagent' process is not running
Jun 2 05:28:20.799408 sonic-s6100-07 ERR monit[499]: 'snmp_subagent' process is not running

root@sonic-s6100-07:~# show logging|grep -B 10 libprotobuf|tail -2

Jun 2 05:26:18.048129 sonic-s6100-07 INFO syncd#supervisord: syncd /usr/bin/syncd:
Jun 2 05:26:18.053412 sonic-s6100-07 INFO syncd#supervisord: syncd error while loading shared libraries: libprotobuf.so.10: cannot open shared object file: No such file or directory#015

Attaching the syslogs

Thanks
Mini

@mini-nair-dell
Copy link
Author

@mini-nair-dell
Copy link
Author

The issue is fixed in 201911 - 88 build and the master image - 306. The warm-reboot and fastboot works fine

Thanks
Mini

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants