-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Orchagent crashes after moving to libsaibcm_3.7.3.3-3 #4347
Comments
@daall Please help to resolve this issue. Let me know if you need any further details. We had verified the image prior to this commit and it was fine. |
@gechiang, can you please take a look first. |
looks like orchagent crash due to TIMEOUT. Seems syncd is also fine.
I do not see the syncd crashed, not sure what is root cause. it seems to me that syncd just did not respond to SAI query. |
Here are some information that I gathered from the compressed dump file: From syslog.1: Perhspa the second syslog is capturing the 2nd attemptwhich also failed the same way? From sairedis.rec I observed the folowing in regard to the Virtual router ID access. 2020-04-01.14:59:00.857756|c|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_INIT_SWITCH=true|SAI_SWITCH_ATTR_FDB_EVENT_NOTIFY=0x558f81c8ccc0|SAI_SWITCH_ATTR_PORT_STATE_CHANGE_NOTIFY=0x558f81c8ccd0|SAI_SWITCH_ATTR_SWITCH_SHUTDOWN_REQUEST_NOTIFY=0x558f81c8cce0|SAI_SWITCH_ATTR_SRC_MAC_ADDRESS=78:4F:9B:65:6D:88 2020-04-01.16:33:01.940696|c|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_INIT_SWITCH=true|SAI_SWITCH_ATTR_FDB_EVENT_NOTIFY=0x560645a1acc0|SAI_SWITCH_ATTR_PORT_STATE_CHANGE_NOTIFY=0x560645a1acd0|SAI_SWITCH_ATTR_SWITCH_SHUTDOWN_REQUEST_NOTIFY=0x560645a1ace0|SAI_SWITCH_ATTR_SRC_MAC_ADDRESS=78:4F:9B:65:6D:88 Here is what I can see from the ASIC_CB.json file: "COLDVIDS": { Here is what is shown under saidump: SAI_OBJECT_TYPE_SWITCH oid:0x21000000000000 SAI_OBJECT_TYPE_VIRTUAL_ROUTER oid:0x3000000000042 |
@gechiang, it looks like this is a universal problem on broadom, we got report that this problem happens on td3, th2, th3 platform. likely to happen on th, td2. Thanks for you logs. but I cannot get much useful information here. can you load the master image on a broadcom switch can see if you can repro this problem? |
We have further narrowed down this issue between two images. Build #216 (Mar 4, 2020 12:22:35 AM) is the last good label on which things are fine. Build #217 (Mar 5, 2020 12:22:33 AM) is the build which has the problem. These are the commits gone into Build #217
@kcudnik @lguohan @gechiang Could we analyze these commits? Especially the sairedis chages? This issue is observed on TH1 based platform also. |
@ciju-juniper, here is what I found from the show tech log. it looks like syncd is running 100% cpu.
@kcudnik , i suspect there is a looping in the syncd prevent it from responding of orchagent query. Can you check what is causing the syncd to be running 100%? |
@lguohan That's right. I've the system running with problematic image. Here is the top indicating the same: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND @lguohan Please let me know if you need any further details. If needed, we can have a meeting for live debugging. |
I'll let @kcudnik to chime-in for this issue. btw, I do not see this issue on the vs platform. therefore, I suspect the issue is related certain interaction between the broadcom SAI and syncd. |
@lguohan if syncd running 100% thats interesting, i see that it was started with "--diag" shell, does this syncd already contain updated patch that will unlock mutex before starting diag shell (sonic-net/sonic-sairedis#571)? Can we get gdb connected to see all threads stacktrace ? |
this line from syslog is interesting: not sure why switch RID is 0x0, it should be a real object id, so something went wrong there and my suspect is that when RID is zero like in this syslog dump, then it causes vendor SAI to hang up or spin forever, since that is the last line in syslog from syncd, after that orchagent timed out getting response from syncd |
@ciju-juniper Looks like Kamil has root caused this issue and provided a fix for this. Please wait for the next image that contains the fix for sonic-net/sonic-sairedis#587 and validate it. |
@gechiang The patch is not yet available in the sonic-buildimage.git. Manually applied the patch and started a build. Will update the results. |
My fix is just fixing correct RID to be passed, i don't know if this is causing 100% cpu on vendor sai, if yes, then this will be the fix, lets wait for @ciju-juniper to confirm |
@gebelang I still don't see sonic-net/sonic-sairedis#587 in master of sonic-buildimage.git. Jenkins build is also not started for this commit. Am I missing anything? I tried manually building an image with the sonic-net/sonic-sairedis#587 patch but hit #4366 Totally stuck at this moment. |
@lguohan @kcudnik @gechiang I built an image with the patch sonic-net/sonic-sairedis#587 Still no luck. Orchagent has stopped after hitting a timeout and 'syncd' is running at 100%. Attached the generate_dump archive. |
Could you make one more test but disable diag shell ? I took a look to your logs and i dont see anything suspicious on syncd side :/ |
@kcudnik Could you tell me how to disable diag shell? I will initiate another build with debugging tools installed. |
that will require to modify startup script for syncd: |
@kcudnik I did the following in /usr/bin/syncd_init_common.sh
After a 'reboot' of the system, interfaces are showing up. So disabling the diagshell seems to be helpful. I will post the gdb o/p shortly. |
@kcudnik Here is the stack trace after attaching syncd to gdb. This is from the faulty scenario where interfaces are not up. Let me know if you need any further details. |
@kcudnik Attaching the back trace for all the threads of syncd |
Thanks for dbg threads dump, from them i dont know why there would be 100% cpu usage, but i see another thing that main thread is stuck waiting for mutext on "GET" api querying switch attribute, while the diag shell is started, i dont see grabbing mutex on that diag shell which is good, but also still i would like to try this without starting diag shell, i dont see any other thread blocking on the same mutex as main thread which is interesting here can you confirm that your code has this path: https://github.com/Azure/sonic-sairedis/pull/571/files ? |
@kcudnik I don't see this patch (https://github.com/Azure/sonic-sairedis/pull/571/files) in the 'syncd/VendorSai.cpp' I'm at commit fe94170 of sonic-sairedis repo. This commit was on March 11 (Way too old). I'm at commit ea38d06 of sonic-buildimage.git This commit is latest (April 3'rd). Is the 'sonic-sairedis' repo properly synced up with sonic-buildimage.git? Are there any breakages? |
Can you cherry pick that commit ? |
@kcudnik With this patch (https://github.com/Azure/sonic-sairedis/pull/571/files) in the 'syncd/VendorSai.cpp', interfaces are coming up with the diag shell. Syncd is still running at >100% cpu. Attached the syncd back traces from gdb: |
I took a quick look at the syncd back traces you collected. Thread 8 (Thread 0x7fe710d56700 (LWP 40)): |
@ciju-juniper the patch unlocked deadlock happening in the starting of diag shell @ciju-juniper is there a way to test previous (working) SAI from your side on this sonic-buildimage that you have setup now ? if previous build would run fine, then that would indicate that something wrong with that new SAI |
@kcudnik In the previous working image also, syncd is occupying >100% cpu. I don't have the debug output from this image. I can try rebuilding it, if you need it. Let me know. @kcudnik How can we get the https://github.com/Azure/sonic-sairedis/pull/571/files patch to sonic-buildimage master branch? |
you can cherry pick https://github.com/Azure/sonic-sairedis/pull/571/files this on your build image in directory src/sonic-sairedis/ as for 100% usage, do you have any version of libsai that was previously build and working fine ? |
@kcudnik We need the fix in the sonic-buildimage master as we/customers will take the image from Jenkins directly. Also it's impacting all the Broadcom platforms. When will the fix be available to master branch? What's the process involved? |
@kcudnik We will dig through the various images to see if any one of them had Syncd working fine. |
advancing pointer #4379 |
but also needs to be updated to fix mtu issue, im working on it and will post updates |
@ciju-juniper Wondering if you were able to find a previous image with Syncd not hitting 100% usage? |
@ciju-juniper , please reopen the ticket if you still see this issue. |
Orchagent process is crashing in Juniper QFX5210 platform after SAI library is moved to libsaibcm_3.7.3.3-3.
Prior to this commit, interfaces used to come up with libsaibcm_3.7.3.3-2:
590caaf
sonic_dump_sonic_20200401_165619.tar.gz
root@sonic:/var/dump# show version
SONiC Software Version: SONiC.HEAD.220-590caaf5
Distribution: Debian 9.12
Kernel: 4.9.0-11-2-amd64
Build commit: 590caaf
Build date: Sun Mar 8 05:59:38 UTC 2020
Built by: johnar@jenkins-worker-8
Platform: x86_64-juniper_qfx5210-r0
HwSKU: Juniper-QFX5210-64C
ASIC: broadcom
The text was updated successfully, but these errors were encountered: