Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DellEmc S5232]: Orchagent crash is seen in latest master (#238) #4339

Closed
chitra-raghavan opened this issue Mar 30, 2020 · 10 comments · Fixed by #4360
Closed

[DellEmc S5232]: Orchagent crash is seen in latest master (#238) #4339

chitra-raghavan opened this issue Mar 30, 2020 · 10 comments · Fixed by #4360

Comments

@chitra-raghavan
Copy link
Contributor

Description

Orchagent crash is seen in S5232 , with latest master Jenkins image # 238

Steps to reproduce the issue:

  1. Load T1 config and check for crash

Describe the results you received:
syslog:

Mar 30 14:45:46.463461 sonic-s5232-01 ERR swss#orchagent: :- wait: failed to get response for getresponse
Mar 30 14:45:46.463598 sonic-s5232-01 ERR swss#orchagent: :- PortsOrch: Failed to get CPU port, rv:-1
Mar 30 14:45:46.464020 sonic-s5232-01 INFO swss#supervisord: orchagent terminate called after throwing an instance of 'std::runtime_error'
Mar 30 14:45:46.464886 sonic-s5232-01 INFO swss#supervisord: orchagent   what():  PortsOrch initialization failure
Mar 30 14:45:47.242247 sonic-s5232-01 INFO pmon#thermalctld: Starting up…
root@sonic-s5232-01:/var/core# ls -l
total 296
-rw-rw-rw- 1 root root 301367 Mar 30 14:45 orchagent.1585579546.44.core.gz
root@sonic-s5232-01:/var/core#

Describe the results you expected:

Additional information you deem important (e.g. issue happens only occasionally):

**Output of `show version`:**
root@sonic-s5232-01:/var/log# show ver

SONiC Software Version: SONiC.HEAD.238-84256314
Distribution: Debian 9.12
Kernel: 4.9.0-11-2-amd64
Build commit: 84256314
Build date: Sun Mar 29 06:23:05 UTC 2020
Built by: johnar@jenkins-worker-11

Platform: x86_64-dellemc_s5232f_c3538-r0
HwSKU: DellEMC-S5232f-C32
ASIC: broadcom
Serial Number: CN01WJVTCES0094Q0019
Uptime: 14:48:06 up 5 min,  2 users,  load average: 1.71, 1.34, 0.64
@lguohan
Copy link
Collaborator

lguohan commented Mar 30, 2020

does this affect other broadcom platform? can you get the show tech so that we can do some analysis?

@chitra-raghavan
Copy link
Contributor Author

chitra-raghavan commented Apr 1, 2020

syslog-crash.gz

@lguohan , show tech file is about 250 Mb , couldnt upload it , please find syslogs during that time.
so far, seen this error in S5232.

@ciju-juniper
Copy link
Contributor

@lguohan

Orchagent is crashing in the latest master for Juniper QFX5210 platform as well. We don't see any cores yet. So not sure if both the issue are related.

We started seeing crashes after the migration to libsaibcm_3.7.3.3-3. Opening a new issue.

@ciju-juniper
Copy link
Contributor

@lguohan Opened #4347

@lguohan
Copy link
Collaborator

lguohan commented Apr 1, 2020

Mar 30 14:43:46.399300 sonic-s5232-01 NOTICE syncd#syncd: :- helperLoadColdVids: read 1503 COLD VIDS
Mar 30 14:43:46.399376 sonic-s5232-01 NOTICE syncd#syncd: :- SaiSwitch: constructor took 0.702813 sec
Mar 30 14:43:46.399414 sonic-s5232-01 NOTICE syncd#syncd: :- startDiagShell: starting diag shell thread for switch RID oid:0x0
Mar 30 14:43:46.399884 sonic-s5232-01 INFO syncd#supervisord: syncd Hit enter to get drivshell prompt..#015
Mar 30 14:43:46.399928 sonic-s5232-01 INFO syncd#supervisord: syncd Enter 'quit' to exit the application.#015
Mar 30 14:43:46.400374 sonic-s5232-01 INFO syncd#supervisord: syncd drivshell>#015#015
Mar 30 14:43:46.400417 sonic-s5232-01 INFO syncd#supervisord: syncd drivshell>#015#015
Mar 30 14:43:46.400460 sonic-s5232-01 INFO syncd#supervisord: syncd drivshell>
Mar 30 14:43:47.359132 sonic-s5232-01 INFO syncd#supervisord: start.sh ledinit: started
Mar 30 14:43:47.364709 sonic-s5232-01 INFO syncd#supervisord: ledinit rcload /usr/share/sonic/plat
Mar 30 14:43:47.364936 sonic-s5232-01 INFO syncd#supervisord: ledinit form/led_proc_init.soc#015#015
Mar 30 14:43:47.365174 sonic-s5232-01 INFO syncd#supervisord: ledinit Loading M0 Firmware located at /usr/share/sonic/hwsku/linkscan_led_fw.bin#015
Mar 30 14:43:47.376466 sonic-s5232-01 INFO syncd#supervisord: ledinit Loading M0 Firmware located at /usr/share/sonic/hwsku/custom_led.bin#015
Mar 30 14:43:48.676135 sonic-s5232-01 ERR syncd#syncd: [none] sai_driver_process_command:302 Command "rcload /usr/share/sonic/platform/led_proc_init.soc" failed, rc = -1.
Mar 30 14:43:48.676728 sonic-s5232-01 INFO syncd#supervisord: ledinit 0:soc_iproc_data_send_wait: No response for msg 2#015
Mar 30 14:43:48.676728 sonic-s5232-01 INFO syncd#supervisord: ledinit 0:soc_cmicx_led_enable: Led msg id 0x2 send failed, Error Code -3#015
Mar 30 14:43:48.676728 sonic-s5232-01 INFO syncd#supervisord: ledinit Error:Unable to start LED FW#015
Mar 30 14:43:48.676749 sonic-s5232-01 INFO syncd#supervisord: ledinit Error: file /usr/share/sonic/platform/led_proc_init.soc: line 9 (error code -1): script terminated#015
Mar 30 14:43:48.676749 sonic-s5232-01 INFO syncd#supervisord: ledinit #015
Mar 30 14:43:48.676772 sonic-s5232-01 INFO syncd#supervisord: ledinit Failed to execute the diagnostic command. Error: Internal error.#015
Mar 30 14:43:48.676772 sonic-s5232-01 INFO syncd#supervisord: ledinit drivshell>

@lguohan
Copy link
Collaborator

lguohan commented Apr 1, 2020

looks like the failure is related to the led proc init.

@xinliu-seattle
Copy link
Contributor

Dell team: this seems a LED issue, can Dell team triage this?

@srideepDell
Copy link
Contributor

srideepDell commented Apr 1, 2020

@lguohan in the same log we are able to see crash seen as #4347. We did comment out the led.soc file to bypass the syncd error and below issue is seen which results no ports created

We have seen the issue on 9264 (TH3) platform too. Both TH3 and TD3 platforms were working with builds in JAN. even 201911 branch works on these platforms

Apr 2 12:26:38.509278 sonic NOTICE swss#orchagent: :- set: setting attribute 0x10000001 status: SAI_STATUS_SUCCESS
Apr 2 12:26:38.509278 sonic NOTICE swss#orchagent: :- initSaiRedis: Notify syncd INIT_VIEW
Apr 2 12:26:38.510098 sonic NOTICE swss#orchagent: :- create: request switch create with context 0
Apr 2 12:26:38.518734 sonic NOTICE swss#orchagent: :- allocateNewSwitchObjectId: created SWITCH VID oid:0x21000000000000 for hwinfo: ''
Apr 2 12:26:38.518960 sonic NOTICE swss#orchagent: :- Switch: created switch with hwinfo = ''
Apr 2 12:26:38.519124 sonic NOTICE swss#orchagent: :- main: Create a switch
Apr 2 12:26:38.632494 sonic INFO swss#supervisord: start.sh orchagent: started
Apr 2 12:26:45.829567 sonic INFO swss#supervisord 2020-04-02 12:26:37,619 INFO spawned: 'orchagent' with pid 42
Apr 2 12:26:45.829567 sonic INFO swss#supervisord 2020-04-02 12:26:38,622 INFO success: orchagent entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
Apr 2 12:27:38.562941 sonic ERR swss#orchagent: :- wait: SELECT operation result: TIMEOUT on getresponse
Apr 2 12:27:38.563202 sonic ERR swss#orchagent: :- wait: failed to get response for getresponse
Apr 2 12:27:38.563258 sonic ERR swss#orchagent: :- main: Fail to get switch virtual router ID -1
Apr 2 12:27:38.563306 sonic NOTICE swss#orchagent: :- uninitialize: begin
Apr 2 12:27:38.564599 sonic NOTICE swss#orchagent: :- uninitialize: begin
Apr 2 12:27:38.564667 sonic NOTICE swss#orchagent: :- clear_local_state: clearing local state
Apr 2 12:27:38.564776 sonic NOTICE swss#orchagent: :- meta_init_db: begin
Apr 2 12:27:38.564842 sonic NOTICE swss#orchagent: :- meta_init_db: end
Apr 2 12:27:38.564906 sonic NOTICE swss#orchagent: :- uninitialize: end
Apr 2 12:27:38.564969 sonic NOTICE swss#orchagent: :- stopRecording: stopped recording
Apr 2 12:27:38.565032 sonic NOTICE swss#orchagent: :- stopRecording: closed recording file: sairedis.rec
Apr 2 12:27:38.565094 sonic NOTICE swss#orchagent: :- uninitialize: end
Apr 2 12:27:45.880077 sonic INFO swss#supervisord 2020-04-02 12:27:38,565 INFO exited: orchagent (exit status 1; not expected)

@lguohan
Copy link
Collaborator

lguohan commented Apr 2, 2020

what about th, and td2 platform. do you observe the same problem as #4347 ?

@ciju-juniper
Copy link
Contributor

@lguohan #4347 is seen on TH1 platform as well. Updating the #4347 with more details regarding the point of failure and suspected commits.

@lguohan lguohan linked a pull request Apr 8, 2020 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants