Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[warm-reboot] APPLY_VIEW failure in set operation "SAI_API_FDB:brcm_sai_set_fdb_entry_attribute" #8367

Closed
vaibhavhd opened this issue Aug 6, 2021 · 3 comments · Fixed by sonic-net/sonic-swss#1866

Comments

@vaibhavhd
Copy link
Contributor

Description

SYNCD and OA failures seen during APPLY_VIEW. Two issues that are brought up here:

  1. As part of APPLY_VIEW, SET operation brcm_sai_set_fdb_entry_attribute failed during asic_process_event: field: SAI_FDB_ENTRY_ATTR_ALLOW_MAC_MOVE, value: true
  2. As a result, syncd sent sendShutdownRequest to OA, OA also crashed because of kernel GP in free call with error corrupted unsorted chunks

Steps to reproduce the issue:

  1. Run test_advanced_reboot::test_warm_reboot
  2. The test would fail with LAG flap errors.
  3. Check syslog to find that there were syncd and Orchagent failures.

Describe the results you received:

Describe the results you expected:

SAI REDIS:

2021-08-06.13:25:47.038362|a|APPLY_VIEW
2021-08-06.13:25:49.669961|A|SAI_STATUS_FAILURE
2021-08-06.13:25:49.670077|n|switch_shutdown_request|{"switch_id":"oid:0x21000000000000"}|

Sylog:

Aug  6 13:25:49.431633 str2-7050cx3-acs-01 NOTICE syncd#syncd: :- asicGetWithOptimizedRemoveOperations: moved 12 REMOVE operations upper in stack from total 545 operations
Aug  6 13:25:49.431670 str2-7050cx3-acs-01 NOTICE syncd#syncd: :- asicGetWithOptimizedRemoveOperations: optimizing asic remove operations took 0.000348 sec
Aug  6 13:25:49.431670 str2-7050cx3-acs-01 WARNING syncd#syncd: [none] SAI_API_HOSTIF:_brcm_sai_set_cpu_queue_shaper:13005 Set CPU Queue 0 shaping: cir 0, cbs 0
Aug  6 13:25:49.431670 str2-7050cx3-acs-01 WARNING syncd#syncd: [none] SAI_API_HOSTIF:_brcm_sai_set_cpu_queue_shaper:13005 Set CPU Queue 1 shaping: cir 0, cbs 0
Aug  6 13:25:49.434215 str2-7050cx3-acs-01 INFO syncd#syncd: [none] SAI_API_FDB:brcm_sai_set_fdb_entry_attribute:965 FDB MAC:00-06-07-08-09-0A port_tid:0xb0000006 port_type:Port vlan_id:1000 vid:0x73e8

>>>> Aug  6 13:25:49.434215 str2-7050cx3-acs-01 ERR syncd#syncd: :- asic_process_event: field: SAI_FDB_ENTRY_ATTR_ALLOW_MAC_MOVE, value: true
>>>> Aug  6 13:25:49.434215 str2-7050cx3-acs-01 ERR syncd#syncd: :- asic_process_event: failed to execute api: set, key: SAI_OBJECT_TYPE_FDB_ENTRY:{"bvid":"oid:0x2600000000063d","mac":"00:06:07:08:09:0A","switch_id":"oid:0x21000000000000"}, status: SAI_STATUS_INVALID_ATTRIBUTE_0
Aug  6 13:25:49.434344 str2-7050cx3-acs-01 NOTICE syncd#syncd: :- executeOperationsOnAsic: asic apply took 0.034392 sec
Aug  6 13:25:49.434344 str2-7050cx3-acs-01 ERR syncd#syncd: :- executeOperationsOnAsic: Error while executing asic operations, ASIC is in inconsistent state: :- asic_process_event: failed to execute api: set, key: SAI_OBJECT_TYPE_FDB_ENTRY:{"bvid":"oid:0x2600000000063d","mac":"00:06:07:08:09:0A","switch_id":"oid:0x21000000000000"}, status: SAI_STATUS_INVALID_ATTRIBUTE_0
Aug  6 13:25:49.669511 str2-7050cx3-acs-01 NOTICE syncd#syncd: :- applyView: apply took 2.625379 sec
Aug  6 13:25:49.669801 str2-7050cx3-acs-01 ERR syncd#syncd: :- run: Runtime error: :- asic_process_event: failed to execute api: set, key: SAI_OBJECT_TYPE_FDB_ENTRY:{"bvid":"oid:0x2600000000063d","mac":"00:06:07:08:09:0A","switch_id":"oid:0x21000000000000"}, status: SAI_STATUS_INVALID_ATTRIBUTE_0
Aug  6 13:25:49.669933 str2-7050cx3-acs-01 NOTICE syncd#syncd: :- sendShutdownRequest: sending switch_shutdown_request notification to OA for switch: oid:0x21000000000000
Aug  6 13:25:49.670140 str2-7050cx3-acs-01 ERR swss#orchagent: :- syncd_apply_view: Failed to notify syncd APPLY_VIEW -1
Aug  6 13:25:49.670347 str2-7050cx3-acs-01 ERR swss#orchagent: :- on_switch_shutdown_request: Syncd stopped
Aug  6 13:25:49.670537 str2-7050cx3-acs-01 NOTICE syncd#syncd: :- sendShutdownRequestAfterException: notification send successfully
Aug  6 13:25:49.670653 str2-7050cx3-acs-01 INFO swss#/supervisord: orchagent free(): corrupted unsorted chunks

>>>>  Aug  6 13:25:49.672697 str2-7050cx3-acs-01 INFO kernel: [   56.028221] traps: orchagent[3039] general protection ip:7fbf45905d57 sp:7fbf44d18710 error:0 in libc-2.28.so[7fbf458ee000+148000]

Aug  6 13:25:50.773457 str2-7050cx3-acs-01 INFO swss#supervisord 2021-08-06 13:25:50,771 INFO exited: orchagent (terminated by SIGSEGV (core dumped); not expected)
Aug  6 13:25:51.786777 str2-7050cx3-acs-01 INFO swss#/supervisor-proc-exit-listener: Process 'orchagent' exited unexpectedly. Terminating supervisor 'swss'
Aug  6 13:25:56.326707 str2-7050cx3-acs-01 NOTICE root: Stopping swss service...

BT from core:

(gdb) info threads
  Id   Target Id                       Frame 
* 1    Thread 0x7fbf44d19700 (LWP 409) 0x00007fbf45905d57 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
  2    Thread 0x7fbf4551dbc0 (LWP 40)  0x00007fbf459453e1 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
  3    Thread 0x7fbf3ffff700 (LWP 57)  0x00007fbf459c57ef in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
  4    Thread 0x7fbf3f7fe700 (LWP 58)  0x00007fbf459c57ef in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
  5    Thread 0x7fbf3e7fc700 (LWP 60)  0x00007fbf459c57ef in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
  6    Thread 0x7fbf4551a700 (LWP 55)  0x00007fbf459c57ef in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
  7    Thread 0x7fbf3effd700 (LWP 59)  0x00007fbf459ba819 in poll () from /lib/x86_64-linux-gnu/libc.so.6

  
  (gdb) t 1
[Switching to thread 1 (Thread 0x7fbf44d19700 (LWP 409))]
#0  0x00007fbf45905d57 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007fbf45905d57 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fbf45905eba in exit () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00005636d6bae709 in on_switch_shutdown_request () at notifications.cpp:27
#3  0x00007fbf462d92d5 in sairedis::NotificationSwitchShutdownRequest::executeCallback (this=0x7fbf38000ea0, switchNotifications=...) at NotificationSwitchShutdownRequest.cpp:49
#4  0x00007fbf462b77cf in sairedis::RedisRemoteSaiInterface::handleNotification (this=0x5636d707a560, name=..., serializedNotification="{\"switch_id\":\"oid:0x21", '0' <repeats 12 times>, "\"}", 
    values=std::vector of length 0, capacity 1) at /usr/include/c++/8/bits/shared_ptr_base.h:1018
#5  0x00007fbf462d7a21 in std::function<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&)>::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&) const (__args#2=std::vector of length 0, capacity 1, __args#1="{\"switch_id\":\"oid:0x21", '0' <repeats 12 times>, "\"}", __args#0="switch_shutdown_request", this=0x5636d7078128)
    at /usr/include/c++/8/bits/std_function.h:682
#6  sairedis::RedisChannel::notificationThreadFunction (this=0x5636d7078120) at RedisChannel.cpp:118
#7  0x00007fbf45ce7b2f in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007fbf4632ffa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#9  0x00007fbf459c54cf in clone () from /lib/x86_64-linux-gnu/libc.so.6


(gdb) t 2
[Switching to thread 2 (Thread 0x7fbf4551dbc0 (LWP 40))]
#0  0x00007fbf459453e1 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007fbf459453e1 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fbf4594bc1a in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007fbf4594d8a2 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00007fbf45db5beb in redisFree (c=0x5636d7063ef0) at hiredis.c:610
#4  0x00007fbf45f8a7ad in swss::Table::del (this=0x5636d7069740, key=...) at /usr/include/c++/8/bits/char_traits.h:352
#5  0x00007fbf45905eba in exit () from /lib/x86_64-linux-gnu/libc.so.6
#6  0x00005636d6b90f6b in syncd_apply_view () at main.cpp:117
#7  0x00005636d6b9bae1 in OrchDaemon::warmRestoreAndSyncUp (this=0x5636d708ed70) at orchdaemon.cpp:707
#8  0x00005636d6b9f244 in OrchDaemon::init (this=<optimized out>) at orchdaemon.cpp:529
#9  0x00005636d6b76b11 in main (argc=<optimized out>, argv=<optimized out>) at main.cpp:646
(gdb)

Output of show version:

admin@str2-7050cx3-acs-01:~$ show ver

SONiC Software Version: SONiC.master.27379-1bc3fdbf2
Distribution: Debian 10.10
Kernel: 4.19.0-12-2-amd64
Build commit: 1bc3fdbf2
Build date: Thu Aug  5 13:52:30 UTC 2021
Built by: AzDevOps@sonic-build-workers-000KJ9

Platform: x86_64-arista_7050cx3_32s
HwSKU: Arista-7050CX3-32S-C32
ASIC: broadcom
ASIC Count: 1
Serial Number: JPE20432334
Model Number: DCS-7050CX3-32S-SSD
Hardware Revision: N/A
Uptime: 22:41:09 up  4:57,  1 user,  load average: 0.37, 0.27, 0.27

SAI version:

$ docker exec -it syncd dpkg -s libsaibcm | head
Package: libsaibcm
Status: install ok installed
Priority: extra
Section: libs
Installed-Size: 245525
Maintainer: Guohan Lu <gulv@microsoft.com>
Architecture: amd64
Source: saibcm
Version: 5.0.0.6-1
Provides: libsai

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

@yxieca
Copy link
Contributor

yxieca commented Aug 10, 2021

We need to revert the follow PRs until the regression is addressed.

sonic-net/sonic-swss#1331
#4819

@Praveen-Brcm
Copy link
Contributor

@vaibhavhd : Can you please share the full syslog files, also the TechSupport file collected after the failure.? - Thanks.

@vaibhavhd
Copy link
Contributor Author

@Praveen-Brcm here is the syslog file. Techsupport I can pull up the next time I repro this.
syslog.107.log

vaibhavhd added a commit that referenced this issue Aug 17, 2021
Update sonic-swss submodule to include the latest changes:

Add autoconf-archive package to lgtm sonic-net/sonic-swss#1868
[cfgmgr] Update Makefile.am to consume lib zmq sonic-net/sonic-swss#1864
Fix warmboot issue PR##8367 sonic-net/sonic-swss#1866
judyjoseph added a commit that referenced this issue Aug 20, 2021
sonic-swss

e892dda Fix warmboot issue PR##8367 (#1866)
9c6023d Mclag enhacements support code changes. (#1331)

sonic-utilities

5465ea0 [MPLS][CLI] added config/show CLI for MPLS interface, MPLS CRM threshold config, updated CLI reference manual
3bac779  mclag enhancements as per HLD at sonic-net/SONiC#596 (#1138)
theasianpianist pushed a commit to theasianpianist/sonic-buildimage that referenced this issue Feb 5, 2022
During the WB when the MAC addresses are added via refillToSync(&m_fdbStateTable) , the attribute SAI_FDB_ENTRY_ATTR_ALLOW_MAC_MOVE is set wrongly for dynamic MAC as well.
Fix is to set SAI_FDB_ENTRY_ATTR_ALLOW_MAC_MOVE only for the case of VXLAN or MCLAG for fdb_type dyamic.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants