Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orchagent crashes after moving to libsaibcm_3.7.3.3-3 #4347

Closed
ciju-juniper opened this issue Apr 1, 2020 · 38 comments · Fixed by #4379
Closed

Orchagent crashes after moving to libsaibcm_3.7.3.3-3 #4347

ciju-juniper opened this issue Apr 1, 2020 · 38 comments · Fixed by #4379
Assignees

Comments

@ciju-juniper
Copy link
Contributor

Orchagent process is crashing in Juniper QFX5210 platform after SAI library is moved to libsaibcm_3.7.3.3-3.

Prior to this commit, interfaces used to come up with libsaibcm_3.7.3.3-2:
590caaf

sonic_dump_sonic_20200401_165619.tar.gz

root@sonic:/var/dump# show version

SONiC Software Version: SONiC.HEAD.220-590caaf5
Distribution: Debian 9.12
Kernel: 4.9.0-11-2-amd64
Build commit: 590caaf
Build date: Sun Mar 8 05:59:38 UTC 2020
Built by: johnar@jenkins-worker-8

Platform: x86_64-juniper_qfx5210-r0
HwSKU: Juniper-QFX5210-64C
ASIC: broadcom

@ciju-juniper
Copy link
Contributor Author

@daall Please help to resolve this issue. Let me know if you need any further details.

We had verified the image prior to this commit and it was fine.

@xinliu-seattle
Copy link
Contributor

@gechiang, can you please take a look first.

@lguohan
Copy link
Collaborator

lguohan commented Apr 1, 2020

Apr  1 16:34:01.963292 sonic ERR swss#orchagent: :- wait: SELECT operation result: TIMEOUT on getresponse
Apr  1 16:34:01.963292 sonic ERR swss#orchagent: :- wait: failed to get response for getresponse
Apr  1 16:34:01.963292 sonic ERR swss#orchagent: :- main: Fail to get switch virtual router ID -1
Apr  1 16:34:01.963668 sonic NOTICE swss#orchagent: :- uninitialize: begin
Apr  1 16:34:01.963884 sonic NOTICE swss#orchagent: :- uninitialize: begin
Apr  1 16:34:01.963884 sonic NOTICE swss#orchagent: :- clear_local_state: clearing local state
Apr  1 16:34:01.963884 sonic NOTICE swss#orchagent: :- meta_init_db: begin
Apr  1 16:34:01.963903 sonic NOTICE swss#orchagent: :- meta_init_db: end
Apr  1 16:34:01.963925 sonic NOTICE swss#orchagent: :- uninitialize: end
Apr  1 16:34:01.964048 sonic NOTICE swss#orchagent: :- stopRecording: stopped recording
Apr  1 16:34:01.964048 sonic NOTICE swss#orchagent: :- stopRecording: closed recording file: sairedis.rec
Apr  1 16:34:01.964048 sonic NOTICE swss#orchagent: :- uninitialize: end
Apr  1 16:34:09.982390 sonic INFO swss#supervisord 2020-04-01 16:34:01,964 INFO exited: orchagent (exit status 1; not expected)

looks like orchagent crash due to TIMEOUT.

Seems syncd is also fine.

Apr  1 16:33:10.184770 sonic INFO syncd#supervisord 2020-04-01 16:33:01,420 INFO spawned: 'syncd' with pid 21
Apr  1 16:33:10.184770 sonic INFO syncd#supervisord 2020-04-01 16:33:02,932 INFO success: syncd entered RUNNING state, process has stayed up for
 > than 1 seconds (startsecs)

I do not see the syncd crashed, not sure what is root cause. it seems to me that syncd just did not respond to SAI query.

@gechiang
Copy link
Collaborator

gechiang commented Apr 1, 2020

Here are some information that I gathered from the compressed dump file:

From syslog.1:
Apr 1 16:13:56.774737 sonic NOTICE swss#orchagent: :- main: Get switch virtual router ID 3000000000042
...
Apr 1 16:34:01.963292 sonic ERR swss#orchagent: :- wait: SELECT operation result: TIMEOUT on getresponse
Apr 1 16:34:01.963292 sonic ERR swss#orchagent: :- wait: failed to get response for getresponse
Apr 1 16:34:01.963292 sonic ERR swss#orchagent: :- main: Fail to get switch virtual router ID -1
...
From syslog:
Apr 1 16:45:08.714550 sonic ERR swss#orchagent: :- wait: SELECT operation result: TIMEOUT on getresponse
Apr 1 16:45:08.714550 sonic ERR swss#orchagent: :- wait: failed to get response for getresponse
Apr 1 16:45:08.714550 sonic ERR swss#orchagent: :- main: Fail to get switch virtual router ID -1
Apr 1 16:45:08.715011 sonic NOTICE swss#orchagent: :- uninitialize: begin
Apr 1 16:45:08.715247 sonic NOTICE swss#orchagent: :- uninitialize: begin
Apr 1 16:45:08.715247 sonic NOTICE swss#orchagent: :- clear_local_state: clearing local state
Apr 1 16:45:08.715284 sonic NOTICE swss#orchagent: :- meta_init_db: begin
Apr 1 16:45:08.715304 sonic NOTICE swss#orchagent: :- meta_init_db: end
Apr 1 16:45:08.715342 sonic NOTICE swss#orchagent: :- uninitialize: end
Apr 1 16:45:08.715365 sonic NOTICE swss#orchagent: :- stopRecording: stopped recording
Apr 1 16:45:08.715384 sonic NOTICE swss#orchagent: :- stopRecording: closed recording file: sairedis.rec
Apr 1 16:45:08.715409 sonic NOTICE swss#orchagent: :- uninitialize: end

Perhspa the second syslog is capturing the 2nd attemptwhich also failed the same way?

From sairedis.rec I observed the folowing in regard to the Virtual router ID access.
It seems that most of the time it failed except once it was succcessful at time 16:13:56.774567:

2020-04-01.14:59:00.857756|c|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_INIT_SWITCH=true|SAI_SWITCH_ATTR_FDB_EVENT_NOTIFY=0x558f81c8ccc0|SAI_SWITCH_ATTR_PORT_STATE_CHANGE_NOTIFY=0x558f81c8ccd0|SAI_SWITCH_ATTR_SWITCH_SHUTDOWN_REQUEST_NOTIFY=0x558f81c8cce0|SAI_SWITCH_ATTR_SRC_MAC_ADDRESS=78:4F:9B:65:6D:88
2020-04-01.14:59:00.857913|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_DEFAULT_VIRTUAL_ROUTER_ID=oid:0x0
2020-04-01.15:00:00.915051|G|SAI_STATUS_FAILURE|
...
2020-04-01.15:06:19.151220|c|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_INIT_SWITCH=true|SAI_SWITCH_ATTR_FDB_EVENT_NOTIFY=0x55f294dc2cc0|SAI_SWITCH_ATTR_PORT_STATE_CHANGE_NOTIFY=0x55f294dc2cd0|SAI_SWITCH_ATTR_SWITCH_SHUTDOWN_REQUEST_NOTIFY=0x55f294dc2ce0|SAI_SWITCH_ATTR_SRC_MAC_ADDRESS=78:4F:9B:65:6D:88
2020-04-01.15:06:19.151469|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_DEFAULT_VIRTUAL_ROUTER_ID=oid:0x0
2020-04-01.15:07:19.210983|G|SAI_STATUS_FAILURE|
...
2020-04-01.16:03:41.184078|c|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_INIT_SWITCH=true|SAI_SWITCH_ATTR_FDB_EVENT_NOTIFY=0x5609ffc58cc0|SAI_SWITCH_ATTR_PORT_STATE_CHANGE_NOTIFY=0x5609ffc58cd0|SAI_SWITCH_ATTR_SWITCH_SHUTDOWN_REQUEST_NOTIFY=0x5609ffc58ce0|SAI_SWITCH_ATTR_SRC_MAC_ADDRESS=78:4F:9B:65:6D:88
2020-04-01.16:03:41.184319|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_DEFAULT_VIRTUAL_ROUTER_ID=oid:0x0
2020-04-01.16:04:41.244588|G|SAI_STATUS_FAILURE|
...
2020-04-01.16:13:53.684487|c|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_INIT_SWITCH=true|SAI_SWITCH_ATTR_FDB_EVENT_NOTIFY=0x558751c6fee0|SAI_SWITCH_ATTR_PORT_STATE_CHANGE_NOTIFY=0x558751c6fef0|SAI_SWITCH_ATTR_SWITCH_SHUTDOWN_REQUEST_NOTIFY=0x558751c6ff00|SAI_SWITCH_ATTR_SRC_MAC_ADDRESS=78:4F:9B:65:6D:88
2020-04-01.16:13:53.684692|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_DEFAULT_VIRTUAL_ROUTER_ID=oid:0x0
2020-04-01.16:13:56.774567|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_DEFAULT_VIRTUAL_ROUTER_ID=oid:0x3000000000042
2020-04-01.16:13:56.775085|c|SAI_OBJECT_TYPE_ROUTER_INTERFACE:oid:0x6000000000acb|SAI_ROUTER_INTERFACE_ATTR_VIRTUAL_ROUTER_ID=oid:0x3000000000042|SAI_ROUTER_INTERFACE_ATTR_TYPE=SAI_ROUTER_INTERFACE_TYPE_LOOPBACK
2020-04-01.16:13:56.787123|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_CPU_PORT=oid:0x0
2020-04-01.16:13:56.788042|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_CPU_PORT=oid:0x1000000000001
...
and then 64 rotuer interface entries that referred to the VIRTUAL_ROUTER_ID as following:
2020-04-01.16:13:59.825266|c|SAI_OBJECT_TYPE_ROUTER_INTERFACE:oid:0x6000000000b63|SAI_ROUTER_INTERFACE_ATTR_VIRTUAL_ROUTER_ID=oid:0x3000000000042|SAI_ROUTER_INTERFACE_ATTR_SRC_MAC_ADDRESS=78:4F:9B:65:6D:88|SAI_ROUTER_INTERFACE_ATTR_TYPE=SAI_ROUTER_INTERFACE_TYPE_PORT|SAI_ROUTER_INTERFACE_ATTR_PORT_ID=oid:0x1000000000014|SAI_ROUTER_INTERFACE_ATTR_MTU=9100
2020-04-01.16:13:59.827070|c|SAI_OBJECT_TYPE_ROUTE_ENTRY:{"dest":"10.0.0.0/32","switch_id":"oid:0x21000000000000","vr":"oid:0x3000000000042"}|SAI_ROUTE_ENTRY_ATTR_PACKET_ACTION=SAI_PACKET_ACTION_FORWARD|SAI_ROUTE_ENTRY_ATTR_NEXT_HOP_ID=oid:0x1000000000001
2020-04-01.16:13:59.827589|c|SAI_OBJECT_TYPE_ROUTER_INTERFACE:oid:0x6000000000b64|SAI_ROUTER_INTERFACE_ATTR_VIRTUAL_ROUTER_ID=oid:0x3000000000042|SAI_ROUTER_INTERFACE_ATTR_SRC_MAC_ADDRESS=78:4F:9B:65:6D:88|SAI_ROUTER_INTERFACE_ATTR_TYPE=SAI_ROUTER_INTERFACE_TYPE_PORT|SAI_ROUTER_INTERFACE_ATTR_PORT_ID=oid:0x1000000000033|SAI_ROUTER_INTERFACE_ATTR_MTU=9100
2020-04-01.16:13:59.829635|c|SAI_OBJECT_TYPE_ROUTE_ENTRY:{"dest":"10.0.0.50/32","switch_id":"oid:0x21000000000000","vr":"oid:0x3000000000042"}|SAI_ROUTE_ENTRY_ATTR_PACKET_ACTION=SAI_PACKET_ACTION_FORWARD|SAI_ROUTE_ENTRY_ATTR_NEXT_HOP_ID=oid:0x1000000000001
2020-04-01.16:13:59.830143|c|SAI_OBJECT_TYPE_ROUTER_INTERFACE:oid:0x6000000000b65|SAI_ROUTER_INTERFACE_ATTR_VIRTUAL_ROUTER_ID=oid:0x3000000000042|SAI_ROUTER_INTERFACE_ATTR_SRC_MAC_ADDRESS=78:4F:9B:65:6D:88|SAI_ROUTER_INTERFACE_ATTR_TYPE=SAI_ROUTER_INTERFACE_TYPE_PORT|SAI_ROUTER_INTERFACE_ATTR_PORT_ID=oid:0x1000000000037|SAI_ROUTER_INTERFACE_ATTR_MTU=9100
2020-04-01.16:13:59.831626|c|SAI_OBJECT_TYPE_ROUTE_ENTRY:{"dest":"10.0.0.52/32","switch_id":"oid:0x21000000000000","vr":"oid:0x3000000000042"}|SAI_ROUTE_ENTRY_ATTR_PACKET_ACTION=SAI_PACKET_ACTION_FORWARD|SAI_ROUTE_ENTRY_ATTR_NEXT_HOP_ID=oid:0x1000000000001
...

2020-04-01.16:33:01.940696|c|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_INIT_SWITCH=true|SAI_SWITCH_ATTR_FDB_EVENT_NOTIFY=0x560645a1acc0|SAI_SWITCH_ATTR_PORT_STATE_CHANGE_NOTIFY=0x560645a1acd0|SAI_SWITCH_ATTR_SWITCH_SHUTDOWN_REQUEST_NOTIFY=0x560645a1ace0|SAI_SWITCH_ATTR_SRC_MAC_ADDRESS=78:4F:9B:65:6D:88
2020-04-01.16:33:01.940969|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_DEFAULT_VIRTUAL_ROUTER_ID=oid:0x0
2020-04-01.16:34:01.963149|G|SAI_STATUS_FAILURE|
...
2020-04-01.16:44:08.697339|c|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_INIT_SWITCH=true|SAI_SWITCH_ATTR_FDB_EVENT_NOTIFY=0x55c0016fbcc0|SAI_SWITCH_ATTR_PORT_STATE_CHANGE_NOTIFY=0x55c0016fbcd0|SAI_SWITCH_ATTR_SWITCH_SHUTDOWN_REQUEST_NOTIFY=0x55c0016fbce0|SAI_SWITCH_ATTR_SRC_MAC_ADDRESS=78:4F:9B:65:6D:88
2020-04-01.16:44:08.697547|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_DEFAULT_VIRTUAL_ROUTER_ID=oid:0x0
2020-04-01.16:45:08.714385|G|SAI_STATUS_FAILURE|
...

Here is what I can see from the ASIC_CB.json file:

"COLDVIDS": {
"type": "hash",
"value": {
...
"oid:0x3000000000042": "SAI_OBJECT_TYPE_VIRTUAL_ROUTER",
...
}
},
...
"HIDDEN": {
"type": "hash",
"value": {
"SAI_SWITCH_ATTR_CPU_PORT": "oid:0x100000000",
"SAI_SWITCH_ATTR_DEFAULT_1Q_BRIDGE_ID": "oid:0x3900000000",
"SAI_SWITCH_ATTR_DEFAULT_STP_INST_ID": "oid:0x1000000001",
"SAI_SWITCH_ATTR_DEFAULT_TRAP_GROUP": "oid:0x1100000000",
"SAI_SWITCH_ATTR_DEFAULT_VIRTUAL_ROUTER_ID": "oid:0x300000000",
"SAI_SWITCH_ATTR_DEFAULT_VLAN_ID": "oid:0x2600000001"
}
},
...
"ASIC_STATE:SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000": {
"type": "hash",
"value": {
"NULL": "NULL",
"SAI_SWITCH_ATTR_FDB_EVENT_NOTIFY": "0x55c0016fbcc0",
"SAI_SWITCH_ATTR_INIT_SWITCH": "true",
"SAI_SWITCH_ATTR_PORT_STATE_CHANGE_NOTIFY": "0x55c0016fbcd0",
"SAI_SWITCH_ATTR_SRC_MAC_ADDRESS": "78:4F:9B:65:6D:88",
"SAI_SWITCH_ATTR_SWITCH_SHUTDOWN_REQUEST_NOTIFY": "0x55c0016fbce0"
}
},
"ASIC_STATE:SAI_OBJECT_TYPE_VIRTUAL_ROUTER:oid:0x3000000000042": {
"type": "hash",
"value": {
"NULL": "NULL"
}
},
...

Here is what is shown under saidump:

SAI_OBJECT_TYPE_SWITCH oid:0x21000000000000
SAI_SWITCH_ATTR_FDB_EVENT_NOTIFY : 0x55c0016fbcc0
SAI_SWITCH_ATTR_INIT_SWITCH : true
SAI_SWITCH_ATTR_PORT_STATE_CHANGE_NOTIFY : 0x55c0016fbcd0
SAI_SWITCH_ATTR_SRC_MAC_ADDRESS : 78:4F:9B:65:6D:88
SAI_SWITCH_ATTR_SWITCH_SHUTDOWN_REQUEST_NOTIFY : 0x55c0016fbce0

SAI_OBJECT_TYPE_VIRTUAL_ROUTER oid:0x3000000000042

@lguohan
Copy link
Collaborator

lguohan commented Apr 2, 2020

@gechiang, it looks like this is a universal problem on broadom, we got report that this problem happens on td3, th2, th3 platform. likely to happen on th, td2.

Thanks for you logs. but I cannot get much useful information here. can you load the master image on a broadcom switch can see if you can repro this problem?

@ciju-juniper
Copy link
Contributor Author

We have further narrowed down this issue between two images.

Build #216 (Mar 4, 2020 12:22:35 AM) is the last good label on which things are fine.
march_4_sonic_dump_sonic_20200402_122706.tar.gz

Build #217 (Mar 5, 2020 12:22:33 AM) is the build which has the problem.
march_5_sonic_dump_sonic_20200402_123352.tar.gz

These are the commits gone into Build #217

  1. [Services] Restart BGP service upon unexpected critical process exit. (detail / githubweb)
  2. [sairedis] Advance pointer to origin/master (bigrebase) ([sairedis] Advance pointer to origin/master (bigrebase) #4217) (detail / githubweb)
  3. [Mellanox]Integrate hw-mgmt 7.0000.3012 and advance the linux kernel (detail / githubweb)
  4. [spytest] Create new sonic-mgmt docker for running spytests ([spytest] Create new sonic-mgmt docker for running spytests #4212) (detail / githubweb)

@kcudnik @lguohan @gechiang Could we analyze these commits? Especially the sairedis chages?

This issue is observed on TH1 based platform also.

@lguohan
Copy link
Collaborator

lguohan commented Apr 2, 2020

then, I think the leading suspect is the commit #4217

@kcudnik , what further information do you need to debug this #4217?

@lguohan
Copy link
Collaborator

lguohan commented Apr 2, 2020

@ciju-juniper, here is what I found from the show tech log.

it looks like syncd is running 100% cpu.

root      2989  107  2.5 798140 411332 pts/0   Sl   16:44  13:14 /usr/bin/syncd --diag -u -p /etc/sai.d/sai.profile

@kcudnik , i suspect there is a looping in the syncd prevent it from responding of orchagent query. Can you check what is causing the syncd to be running 100%?

@ciju-juniper
Copy link
Contributor Author

@lguohan That's right. I've the system running with problematic image. Here is the top indicating the same:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3061 root 20 0 798136 412380 62820 S 108.9 2.5 40:14.36 syncd

@lguohan Please let me know if you need any further details. If needed, we can have a meeting for live debugging.

@lguohan
Copy link
Collaborator

lguohan commented Apr 2, 2020

I'll let @kcudnik to chime-in for this issue. btw, I do not see this issue on the vs platform. therefore, I suspect the issue is related certain interaction between the broadcom SAI and syncd.

@kcudnik
Copy link
Contributor

kcudnik commented Apr 2, 2020

@lguohan if syncd running 100% thats interesting, i see that it was started with "--diag" shell, does this syncd already contain updated patch that will unlock mutex before starting diag shell (sonic-net/sonic-sairedis#571)? Can we get gdb connected to see all threads stacktrace ?

@kcudnik
Copy link
Contributor

kcudnik commented Apr 2, 2020

this line from syslog is interesting:
Apr 1 16:44:11.954834 sonic NOTICE syncd#syncd: :- startDiagShell: starting diag shell thread for switch RID oid:0x0

not sure why switch RID is 0x0, it should be a real object id, so something went wrong there
ok i found the issue: sonic-net/sonic-sairedis#587

and my suspect is that when RID is zero like in this syslog dump, then it causes vendor SAI to hang up or spin forever, since that is the last line in syslog from syncd, after that orchagent timed out getting response from syncd

@gechiang
Copy link
Collaborator

gechiang commented Apr 3, 2020

@ciju-juniper Looks like Kamil has root caused this issue and provided a fix for this. Please wait for the next image that contains the fix for sonic-net/sonic-sairedis#587 and validate it.
Thanks!

@ciju-juniper
Copy link
Contributor Author

@gechiang The patch is not yet available in the sonic-buildimage.git. Manually applied the patch and started a build. Will update the results.

@kcudnik
Copy link
Contributor

kcudnik commented Apr 3, 2020

My fix is just fixing correct RID to be passed, i don't know if this is causing 100% cpu on vendor sai, if yes, then this will be the fix, lets wait for @ciju-juniper to confirm

@ciju-juniper
Copy link
Contributor Author

@gebelang I still don't see sonic-net/sonic-sairedis#587 in master of sonic-buildimage.git. Jenkins build is also not started for this commit. Am I missing anything?

I tried manually building an image with the sonic-net/sonic-sairedis#587 patch but hit #4366 Totally stuck at this moment.

@ciju-juniper
Copy link
Contributor Author

@lguohan @kcudnik @gechiang I built an image with the patch sonic-net/sonic-sairedis#587 Still no luck. Orchagent has stopped after hitting a timeout and 'syncd' is running at 100%.

Attached the generate_dump archive.
sonic_dump_sonic_20200404_122907.tar.gz

@kcudnik
Copy link
Contributor

kcudnik commented Apr 4, 2020

Could you make one more test but disable diag shell ? I took a look to your logs and i dont see anything suspicious on syncd side :/
Also it would be nice to have gdb attached to syncd when its running to see which thread is taking that 100%

@ciju-juniper
Copy link
Contributor Author

@kcudnik Could you tell me how to disable diag shell?

I will initiate another build with debugging tools installed.

@kcudnik
Copy link
Contributor

kcudnik commented Apr 4, 2020

that will require to modify startup script for syncd:
src/sonic-sairedis/syncd/scripts/syncd_init_common.sh:CMD_DSSERVE_ARGS="$CMD_SYNCD --diag"

@ciju-juniper
Copy link
Contributor Author

@kcudnik I did the following in /usr/bin/syncd_init_common.sh

 16    #if [ -x $CMD_DSSERVE ]; then
 17    #    CMD=$CMD_DSSERVE
 18    #    CMD_ARGS=$CMD_DSSERVE_ARGS
 19    #else
 20          CMD=$CMD_SYNCD
 21           CMD_ARGS=
 22    #fi

After a 'reboot' of the system, interfaces are showing up. So disabling the diagshell seems to be helpful.

I will post the gdb o/p shortly.

@ciju-juniper
Copy link
Contributor Author

ciju-juniper commented Apr 4, 2020

@kcudnik Here is the stack trace after attaching syncd to gdb. This is from the faulty scenario where interfaces are not up.
(gdb) bt
#0 0x00007fbafd04ef7c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007fbafd048bb5 in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x0000564ad5b7954a in __gthread_mutex_lock (__mutex=0x564ad7817790) at /usr/include/x86_64-linux-gnu/c++/6/bits/gthr-default.h:748
#3 std::mutex::lock (this=0x564ad7817790) at /usr/include/c++/6/bits/std_mutex.h:103
#4 std::lock_guardstd::mutex::lock_guard (__m=..., this=) at /usr/include/c++/6/bits/std_mutex.h:162
#5 syncd::VendorSai::get (this=0x564ad7817780, objectType=SAI_OBJECT_TYPE_SWITCH, objectId=13362198927839854592, attr_count=1, attr_list=0x564ad7825b20)
at VendorSai.cpp:234
#6 0x0000564ad5af7a0e in syncd::Syncd::processOidGet (this=this@entry=0x564ad7817940, objectType=objectType@entry=SAI_OBJECT_TYPE_SWITCH,
strObjectId="oid:0x21", '0' <repeats 12 times>, attr_count=attr_count@entry=1, attr_list=attr_list@entry=0x564ad7825b20) at Syncd.cpp:1607
#7 0x0000564ad5afe9c3 in syncd::Syncd::processOid (this=this@entry=0x564ad7817940, objectType=SAI_OBJECT_TYPE_SWITCH,
strObjectId="oid:0x21", '0' <repeats 12 times>, api=api@entry=SAI_COMMON_API_GET, attr_count=attr_count@entry=1,
attr_list=attr_list@entry=0x564ad7825b20) at Syncd.cpp:1412
#8 0x0000564ad5b04a9c in syncd::Syncd::processQuadEvent (this=this@entry=0x564ad7817940, api=api@entry=SAI_COMMON_API_GET,
kco=std::tuple containing = {...}) at Syncd.cpp:1320
#9 0x0000564ad5b05f50 in syncd::Syncd::processSingleEvent (this=this@entry=0x564ad7817940, kco=std::tuple containing = {...}) at Syncd.cpp:278
#10 0x0000564ad5b06241 in syncd::Syncd::processEvent (this=this@entry=0x564ad7817940, consumer=...) at Syncd.cpp:246
#11 0x0000564ad5b07a38 in syncd::Syncd::run (this=0x564ad7817940) at Syncd.cpp:3417
#12 0x0000564ad5af50fa in syncd_main (argc=, argv=) at syncd.cpp:71
#13 0x0000564ad5af479e in main (argc=5, argv=0x7ffeb70ac238) at main.cpp:9

Let me know if you need any further details.

@ciju-juniper
Copy link
Contributor Author

@kcudnik Attaching the back trace for all the threads of syncd
gdb_4347.txt

@kcudnik
Copy link
Contributor

kcudnik commented Apr 5, 2020

Thanks for dbg threads dump, from them i dont know why there would be 100% cpu usage, but i see another thing that main thread is stuck waiting for mutext on "GET" api querying switch attribute, while the diag shell is started, i dont see grabbing mutex on that diag shell which is good, but also still i would like to try this without starting diag shell,

i dont see any other thread blocking on the same mutex as main thread which is interesting here

can you confirm that your code has this path: https://github.com/Azure/sonic-sairedis/pull/571/files ?

@ciju-juniper
Copy link
Contributor Author

@kcudnik I don't see this patch (https://github.com/Azure/sonic-sairedis/pull/571/files) in the 'syncd/VendorSai.cpp'

I'm at commit fe94170 of sonic-sairedis repo. This commit was on March 11 (Way too old).

I'm at commit ea38d06 of sonic-buildimage.git This commit is latest (April 3'rd).

Is the 'sonic-sairedis' repo properly synced up with sonic-buildimage.git? Are there any breakages?

@kcudnik
Copy link
Contributor

kcudnik commented Apr 5, 2020

Can you cherry pick that commit ?

@ciju-juniper
Copy link
Contributor Author

@kcudnik With this patch (https://github.com/Azure/sonic-sairedis/pull/571/files) in the 'syncd/VendorSai.cpp', interfaces are coming up with the diag shell.

Syncd is still running at >100% cpu.

Attached the syncd back traces from gdb:
gdb_4347_with_571.txt

@gechiang
Copy link
Collaborator

gechiang commented Apr 6, 2020

I took a quick look at the syncd back traces you collected.
It seems to me that there is one thread that may be contributing to the "busy" with the following which seems to be some counter statistic running :

Thread 8 (Thread 0x7fe710d56700 (LWP 40)):
#0 0x00007fe720c2ecbb in pthread_spin_trylock () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007fe723a45394 in sal_spinlock_lock () from /usr/lib/libsai.so.1
#2 0x00007fe72325ee81 in soc_counter_collect_th_non_dma_entries () from /usr/lib/libsai.so.1
#3 0x00007fe723266937 in soc_counter_thread () from /usr/lib/libsai.so.1
#4 0x00007fe723a458bd in ?? () from /usr/lib/libsai.so.1
#5 0x00007fe720c284a4 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#6 0x00007fe71f8ded0f in clone () from /lib/x86_64-linux-gnu/libc.so.6

@kcudnik
Copy link
Contributor

kcudnik commented Apr 6, 2020

@ciju-juniper the patch unlocked deadlock happening in the starting of diag shell
And from the new dump as @glteache notices, this thread 8 looks suspicious, waiting in spin lock to tray lock, which is probably causing 100% cpu (not sure why it was 100% cpu before patch, since before that flex counters collection was not running yet). But from this new dump seems like maybe flex counter collection is causing 100% or something in SAI internals causing that :/

@ciju-juniper is there a way to test previous (working) SAI from your side on this sonic-buildimage that you have setup now ? if previous build would run fine, then that would indicate that something wrong with that new SAI

@ciju-juniper
Copy link
Contributor Author

@kcudnik In the previous working image also, syncd is occupying >100% cpu. I don't have the debug output from this image. I can try rebuilding it, if you need it. Let me know.

@kcudnik How can we get the https://github.com/Azure/sonic-sairedis/pull/571/files patch to sonic-buildimage master branch?

@kcudnik
Copy link
Contributor

kcudnik commented Apr 6, 2020

you can cherry pick https://github.com/Azure/sonic-sairedis/pull/571/files this on your build image in directory src/sonic-sairedis/

as for 100% usage, do you have any version of libsai that was previously build and working fine ?

@ciju-juniper
Copy link
Contributor Author

@kcudnik We need the fix in the sonic-buildimage master as we/customers will take the image from Jenkins directly. Also it's impacting all the Broadcom platforms. When will the fix be available to master branch? What's the process involved?

@ciju-juniper
Copy link
Contributor Author

@kcudnik We will dig through the various images to see if any one of them had Syncd working fine.

@kcudnik
Copy link
Contributor

kcudnik commented Apr 6, 2020

advancing pointer #4379

@kcudnik
Copy link
Contributor

kcudnik commented Apr 6, 2020

but also needs to be updated to fix mtu issue, im working on it and will post updates

@gechiang
Copy link
Collaborator

gechiang commented Apr 8, 2020

@ciju-juniper Wondering if you were able to find a previous image with Syncd not hitting 100% usage?

@lguohan lguohan linked a pull request Apr 8, 2020 that will close this issue
@lguohan
Copy link
Collaborator

lguohan commented Apr 9, 2020

@ciju-juniper , please reopen the ticket if you still see this issue.

@ciju-juniper
Copy link
Contributor Author

@lguohan I will verify again once if you accept #4376

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants