Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

orchagent crashes due to transfer_attributes: src vs dst attr id don't match GET mismatch #3832

Open
stephenxs opened this issue Nov 30, 2019 · 24 comments

Comments

@stephenxs
Copy link
Collaborator

stephenxs commented Nov 30, 2019

Description
On SONiC, orchagent crashes sporadically with the error message "orchagent failed due to transfer_attributes: src vs dst attr id don't match GET mismatch".
It looks like a sai-redis issue. I guess the flow is like someone tried fetching attribute SAI_SWITCH_ATTR_EGRESS_BUFFER_POOL_NUM to sai-redis but the sai-redis was busy handling some other request or stuck when handling this request and therefore blocking all other requests; later when crm tried fetching some counters the SAI_SWITCH_ATTR_EGRESS_BUFFER_POOL_NUM returned, causing mismatch error.

Steps to reproduce the issue:
With crm and counters is enabled this issue can take place sporadically.

Describe the results you received:
Error logs

Nov  4 10:58:49.887605 dev-r-vrt-232-010 INFO syncd#supervisord: syncd Nov 04 10:58:49 NOTICE  SAI_NEXT_HOP_GROUP: mlnx_sai_nexthopgroup.c[688]- mlnx_create_next_hop_group_member: Add next hop 1 hops : [ 2] to next hop group id 11
Nov  4 10:58:49.895194 dev-r-vrt-232-010 INFO syncd#supervisord: syncd Nov 04 10:58:49 NOTICE  SAI_NEXT_HOP_GROUP: mlnx_sai_nexthopgroup.c[658]- mlnx_create_next_hop_group_member: Create next hop group member, #0 NEXT_HOP_GROUP_ID=NEXT_HOP_GROUP,(0:0),b,0000,0 #1 NEXT_HOP_ID=NEXT_HOP,(0:0),4,0000,0
Nov  4 10:58:49.895349 dev-r-vrt-232-010 INFO syncd#supervisord: syncd Nov 04 10:58:49 NOTICE  SAI_NEXT_HOP_GROUP: mlnx_sai_nexthopgroup.c[688]- mlnx_create_next_hop_group_member: Add next hop 1 hops : [ 4] to next hop group id 11
Nov  4 10:58:49.966342 dev-r-vrt-232-010 INFO syncd#supervisord: syncd Nov 04 10:58:49 NOTICE  SAI_NEXT_HOP_GROUP: mlnx_sai_nexthopgroup.c[751]- mlnx_remove_next_hop_group_member: Remove next hop 2 from next hop group 11
Nov  4 10:58:49.975166 dev-r-vrt-232-010 INFO syncd#supervisord: syncd Nov 04 10:58:49 NOTICE  SAI_NEXT_HOP_GROUP: mlnx_sai_nexthopgroup.c[751]- mlnx_remove_next_hop_group_member: Remove next hop 4 from next hop group 11
Nov  4 10:58:49.976791 dev-r-vrt-232-010 INFO syncd#supervisord: syncd Nov 04 10:58:49 NOTICE  SAI_NEXT_HOP_GROUP: mlnx_sai_nexthopgroup.c[299]- mlnx_remove_next_hop_group: Remove next hop group next hop group id 11
Nov  4 10:58:50.817594 dev-r-vrt-232-010 NOTICE syncd#syncd: :- threadFunction:  span < 0 = -7 at 1572865130817223
Nov  4 10:58:50.817594 dev-r-vrt-232-010 NOTICE syncd#syncd: :- threadFunction:  new span  = 815949
Nov  4 10:58:50.921746 dev-r-vrt-232-010 NOTICE syncd#syncd: :- saiUpdateSupportedQueueCounters: QUEUE_STAT_COUNTER: counter SAI_QUEUE_STAT_DROPPED_BYTES is not supported on queue oid:0x1590000060015, rv: SAI_STATUS_ATTR_NOT_SUPPORTED_0
Nov  4 10:58:51.817785 dev-r-vrt-232-010 NOTICE syncd#syncd: :- threadFunction:  span < 0 = -7 at 1572865131817415
Nov  4 10:58:51.818103 dev-r-vrt-232-010 NOTICE syncd#syncd: :- threadFunction:  new span  = 891991
Nov  4 10:58:51.976938 dev-r-vrt-232-010 INFO syncd#supervisord: syncd Nov 04 10:58:51 NOTICE  SAI_NEXT_HOP_GROUP: mlnx_sai_nexthopgroup.c[751]- mlnx_remove_next_hop_group_member: Remove next hop 2 from next hop group 13
Nov  4 10:58:52.001444 dev-r-vrt-232-010 INFO syncd#supervisord: syncd Nov 04 10:58:52 NOTICE  SAI_NEXT_HOP_GROUP: mlnx_sai_nexthopgroup.c[751]- mlnx_remove_next_hop_group_member: Remove next hop 4 from next hop group 13
Nov  4 10:58:52.013450 dev-r-vrt-232-010 INFO syncd#supervisord: syncd Nov 04 10:58:52 NOTICE  SAI_NEXT_HOP_GROUP: mlnx_sai_nexthopgroup.c[751]- mlnx_remove_next_hop_group_member: Remove next hop 8 from next hop group 13
Nov  4 10:58:52.016801 dev-r-vrt-232-010 INFO syncd#supervisord: syncd Nov 04 10:58:52 NOTICE  SAI_NEXT_HOP_GROUP: mlnx_sai_nexthopgroup.c[299]- mlnx_remove_next_hop_group: Remove next hop group next hop group id 13
Nov  4 10:58:52.021945 dev-r-vrt-232-010 ERR swss#orchagent: :- transfer_attributes: src vs dst attr id don't match GET mismatch
Nov  4 10:58:52.023227 dev-r-vrt-232-010 INFO swss#supervisord: orchagent terminate called after throwing an instance of 'std::runtime_error'
Nov  4 10:58:52.023461 dev-r-vrt-232-010 INFO swss#supervisord: orchagent   what():  :- transfer_attributes: src vs dst attr id don't match GET mismatch
Nov  4 10:58:52.034393 dev-r-vrt-232-010 NOTICE syncd#syncd: :- saiUpdateSupportedQueueCounters: QUEUE_STAT_COUNTER: counter SAI_QUEUE_STAT_DROPPED_BYTES is not supported on queue oid:0x1590000070015, rv: SAI_STATUS_ATTR_NOT_SUPPORTED_0
Nov  4 10:58:52.163989 dev-r-vrt-232-010 NOTICE syncd#syncd: :- saiUpdateSupportedQueueCounters: QUEUE_STAT_COUNTER: counter SAI_QUEUE_STAT_BYTES is not supported on queue oid:0x1590000080015, rv: SAI_STATUS_ATTR_NOT_SUPPORTED_0
Nov  4 10:58:52.185425 dev-r-vrt-232-010 NOTICE syncd#syncd: :- saiUpdateSupportedQueueCounters: QUEUE_STAT_COUNTER: counter SAI_QUEUE_STAT_DROPPED_PACKETS is not supported on queue oid:0x1590000080015, rv: SAI_STATUS_ATTR_NOT_SUPPOR

coredump analysis

Most of the coredumps share the same characteristics:

  1. the backtrace are same: crmorch tried fetching the getResAvailableCounters but got the result whose attribute mismatching with the src attr.
  2. in all cases the src id returned by sairedis are same (49, SAI_SWITCH_ATTR_EGRESS_BUFFER_POOL_NUM) while dst id which is quired by crmorch differs.
  3. there are a lot of "G|SAI_STATUS_FAILURE" in sairedis.rec* files at the time of failure. For each failure the interval between timestamps of any two subsequent "G|SAI_STATUS_FAILURE" are 1 minute which is the timeout used in internal_redis_generic_get
  4. there are a lot of logs like "NOTICE syncd#syncd: :- threadFunction: span < 0 = -472 at" ahead of the error log of orchagent.

Assumptions

The following is my assuption (which is not proved).
Someone was trying fetching attr SAI_SWITCH_ATTR_EGRESS_BUFFER_POOL_NUM but somehow it took sairedis very long time to got the result, causing the fetching procedure timeout and return.
Just after that the CrmOrch tried fetching counters via redis. The first some fetching still failed. and suddenly the previously issued SAI_SWITCH_ATTR_EGRESS_BUFFER_POOL_NUM returned while CrmOrch was fetching some other counters, causing it fail.
I think it's something like the following sequence chart

sequence    orchagent side                                      sai side
0           sb fetching SAI_SWITCH_ATTR_EGRESS_BUFFER_POOL_NUM  
                                                                handling or blocked for other reason
            timeout
1           CrmOrch fetching 1st counters                       
                                                                still handling SAI_SWITCH_ATTR_EGRESS_BUFFER_POOL_NUM
            timeout
2           CrmOrch fetching 2nd counters                       
                                                                still handling SAI_SWITCH_ATTR_EGRESS_BUFFER_POOL_NUM
            timeout
            ... ...
n           CrmOrch fetching n-th counters                      
                                                                SAI_SWITCH_ATTR_EGRESS_BUFFER_POOL_NUM returned
            dst attr_id mismatched with src attr_id, causing orchagent abort

Coredump details
I analyzed 3 coredumps. They're similar
1.

Program terminated with signal SIGABRT, Aborted.
#0  0x00007fa9c4b59fff in raise () from /lib/x86_64-linux-gnu/libc.so.6
[Current thread is 1 (Thread 0x7fa9c6b4fb80 (LWP 82))]
(gdb) bt
#0  0x00007fa9c4b59fff in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fa9c4b5b42a in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007fa9c54720ad in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007fa9c5470066 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007fa9c54700b1 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007fa9c54702c9 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007fa9c5dec916 in swss::Logger::wthrow(swss::Logger::Priority, char const*, ...) () from /usr/lib/x86_64-linux-gnu/libswsscommon.so.0
#7  0x00007fa9c5ba36c2 in transfer_attributes (object_type=object_type@entry=SAI_OBJECT_TYPE_SWITCH, attr_count=attr_count@entry=1, src_attr_list=<optimized out>, dst_attr_list=dst_attr_list@entry=0x7ffdb1b0cbd0, 
    countOnly=countOnly@entry=false) at saiserialize.cpp:429
#8  0x00007fa9c6075f1d in internal_redis_get_process (object_type=object_type@entry=SAI_OBJECT_TYPE_SWITCH, attr_count=attr_count@entry=1, attr_list=attr_list@entry=0x7ffdb1b0cbd0, kco=...) at sai_redis_generic_get.cpp:31
#9  0x00007fa9c6076b39 in internal_redis_generic_get (object_type=object_type@entry=SAI_OBJECT_TYPE_SWITCH, serialized_object_id=..., attr_count=attr_count@entry=1, attr_list=attr_list@entry=0x7ffdb1b0cbd0)
    at sai_redis_generic_get.cpp:219
#10 0x00007fa9c60771f0 in redis_generic_get (object_type=object_type@entry=SAI_OBJECT_TYPE_SWITCH, object_id=9288674231451648, attr_count=attr_count@entry=1, attr_list=attr_list@entry=0x7ffdb1b0cbd0) at sai_redis_generic_get.cpp:263
#11 0x00007fa9c5b96bce in meta_sai_get_oid (object_type=object_type@entry=SAI_OBJECT_TYPE_SWITCH, object_id=<optimized out>, object_id@entry=9288674231451648, attr_count=attr_count@entry=1, attr_list=attr_list@entry=0x7ffdb1b0cbd0, 
    get=0x7fa9c6077190 <redis_generic_get(_sai_object_type_t, unsigned long, unsigned int, _sai_attribute_t*)>) at sai_meta.cpp:5814
#12 0x00007fa9c6068bb2 in redis_get_switch_attribute (switch_id=9288674231451648, attr_count=1, attr_list=0x7ffdb1b0cbd0) at sai_redis_switch.cpp:342
#13 0x0000564d1297c139 in CrmOrch::getResAvailableCounters (this=this@entry=0x564d13561c50) at crmorch.cpp:432
#14 0x0000564d1297c858 in CrmOrch::doTask (this=0x564d13561c50, timer=...) at crmorch.cpp:406
#15 0x0000564d128c37a2 in OrchDaemon::start (this=0x564d13558a40) at orchdaemon.cpp:403
#16 0x0000564d128b32c6 in main (argc=<optimized out>, argv=<optimized out>) at main.cpp:315
(gdb) frame 12
#12 0x00007fa9c6068bb2 in redis_get_switch_attribute (switch_id=9288674231451648, attr_count=1, attr_list=0x7ffdb1b0cbd0) at sai_redis_switch.cpp:342
342     sai_redis_switch.cpp: No such file or directory.
(gdb) info args
switch_id = 9288674231451648
attr_count = 1
attr_list = 0x7ffdb1b0cbd0
(gdb) x/16x 0x7ffdb1b0cbd0
0x7ffdb1b0cbd0: 0x00000039      0x0000564d      0x00000100      0x00007ffd
0x7ffdb1b0cbe0: 0x1432cd50      0x0000564d      0xb1b0cc20      0x00000020
0x7ffdb1b0cbf0: 0x13564da0      0x0000564d      0x4cf49100      0x84466228
0x7ffdb1b0cc00: 0x135e6eb8      0x0000564d      0x4cf49100      0x84466228
(gdb) frame 7
#7  0x00007fa9c5ba36c2 in transfer_attributes (object_type=object_type@entry=SAI_OBJECT_TYPE_SWITCH, attr_count=attr_count@entry=1, src_attr_list=<optimized out>, dst_attr_list=dst_attr_list@entry=0x7ffdb1b0cbd0, 
    countOnly=countOnly@entry=false) at saiserialize.cpp:429
429     saiserialize.cpp: No such file or directory.
(gdb) info local
src_attr = @0x564d14072bd0: {id = 49, value = {booldata = 45, chardata = "-\322\005", '\000' <repeats 28 times>, u8 = 45 '-', s8 = 45 '-', u16 = 53805, s16 = -11731, u32 = 381485, s32 = 381485, u64 = 381485, s64 = 381485, 
    ptr = 0x5d22d, mac = "-\322\005\000\000", ip4 = 381485, ip6 = "-\322\005", '\000' <repeats 12 times>, ipaddr = {addr_family = (SAI_IP_ADDR_FAMILY_IPV6 | unknown: 381484), addr = {ip4 = 0, ip6 = '\000' <repeats 15 times>}}, 
    ipprefix = {addr_family = (SAI_IP_ADDR_FAMILY_IPV6 | unknown: 381484), addr = {ip4 = 0, ip6 = '\000' <repeats 15 times>}, mask = {ip4 = 0, ip6 = '\000' <repeats 15 times>}}, oid = 381485, objlist = {count = 381485, list = 0x0}, 
    u8list = {count = 381485, list = 0x0}, s8list = {count = 381485, list = 0x0}, u16list = {count = 381485, list = 0x0}, s16list = {count = 381485, list = 0x0}, u32list = {count = 381485, list = 0x0}, s32list = {count = 381485, 
    list = 0x0}, u32range = {min = 381485, max = 0}, s32range = {min = 381485, max = 0}, vlanlist = {count = 381485, list = 0x0}, qosmap = {count = 381485, list = 0x0}, maplist = {count = 381485, list = 0x0}, aclfield = {
    enable = 45, mask = {u8 = 0 '\000', s8 = 0 '\000', u16 = 0, s16 = 0, u32 = 0, s32 = 0, mac = "\000\000\000\000\000", ip4 = 0, ip6 = '\000' <repeats 15 times>, u8list = {count = 0, list = 0x0}}, data = {booldata = false, 
        u8 = 0 '\000', s8 = 0 '\000', u16 = 0, s16 = 0, u32 = 0, s32 = 0, mac = "\000\000\000\000\000", ip4 = 0, ip6 = '\000' <repeats 15 times>, oid = 0, objlist = {count = 0, list = 0x0}, u8list = {count = 0, list = 0x0}}}, 
    aclaction = {enable = 45, parameter = {booldata = false, u8 = 0 '\000', s8 = 0 '\000', u16 = 0, s16 = 0, u32 = 0, s32 = 0, mac = "\000\000\000\000\000", ip4 = 0, ip6 = '\000' <repeats 15 times>, oid = 0, objlist = {count = 0, 
        list = 0x0}, ipaddr = {addr_family = SAI_IP_ADDR_FAMILY_IPV4, addr = {ip4 = 0, ip6 = '\000' <repeats 15 times>}}}}, aclcapability = {is_action_list_mandatory = 45, action_list = {count = 0, list = 0x0}}, aclresource = {
    count = 381485, list = 0x0}, tlvlist = {count = 381485, list = 0x0}, segmentlist = {count = 381485, list = 0x0}, ipaddrlist = {count = 381485, list = 0x0}, porteyevalues = {count = 381485, list = 0x0}, timespec = {
    tv_sec = 381485, tv_nsec = 0}}}
dst_attr = @0x7ffdb1b0cbd0: {id = 57, value = {booldata = false, chardata = "\000\001\000\000\375\177\000\000P\315\062\024MV\000\000 \314\260\261 \000\000\000\240MV\023MV\000", u8 = 0 '\000', s8 = 0 '\000', u16 = 256, s16 = 256, 
    u32 = 256, s32 = 256, u64 = 140724603453696, s64 = 140724603453696, ptr = 0x7ffd00000100, mac = "\000\001\000\000\375\177", ip4 = 256, ip6 = "\000\001\000\000\375\177\000\000P\315\062\024MV\000", ipaddr = {
    addr_family = (unknown: 256), addr = {ip4 = 32765, ip6 = "\375\177\000\000P\315\062\024MV\000\000 \314\260\261"}}, ipprefix = {addr_family = (unknown: 256), addr = {ip4 = 32765, 
        ip6 = "\375\177\000\000P\315\062\024MV\000\000 \314\260\261"}, mask = {ip4 = 32, ip6 = " \000\000\000\240MV\023MV\000\000\000\221\364L"}}, oid = 140724603453696, objlist = {count = 256, list = 0x564d1432cd50}, u8list = {
    count = 256, list = 0x564d1432cd50 "x\021\354\304\251\177"}, s8list = {count = 256, list = 0x564d1432cd50 "x\021\354\304\251\177"}, u16list = {count = 256, list = 0x564d1432cd50}, s16list = {count = 256, 
    list = 0x564d1432cd50}, u32list = {count = 256, list = 0x564d1432cd50}, s32list = {count = 256, list = 0x564d1432cd50}, u32range = {min = 256, max = 32765}, s32range = {min = 256, max = 32765}, vlanlist = {count = 256, 
    list = 0x564d1432cd50}, qosmap = {count = 256, list = 0x564d1432cd50}, maplist = {count = 256, list = 0x564d1432cd50}, aclfield = {enable = false, mask = {u8 = 80 'P', s8 = 80 'P', u16 = 52560, s16 = -12976, u32 = 338873680, 
        s32 = 338873680, mac = "P\315\062\024MV", ip4 = 338873680, ip6 = "P\315\062\024MV\000\000 \314\260\261 \000\000", u8list = {count = 338873680, list = 0x20b1b0cc20 <error: Cannot access memory at address 0x20b1b0cc20>}}, 
    data = {booldata = 160, u8 = 160 '\240', s8 = -96 '\240', u16 = 19872, s16 = 19872, u32 = 324423072, s32 = 324423072, mac = "\240MV\023MV", ip4 = 324423072, ip6 = "\240MV\023MV\000\000\000\221\364L(bF\204", 
        oid = 94889036893600, objlist = {count = 324423072, list = 0x844662284cf49100}, u8list = {count = 324423072, list = 0x844662284cf49100 <error: Cannot access memory at address 0x844662284cf49100>}}}, aclaction = {
    enable = false, parameter = {booldata = 80, u8 = 80 'P', s8 = 80 'P', u16 = 52560, s16 = -12976, u32 = 338873680, s32 = 338873680, mac = "P\315\062\024MV", ip4 = 338873680, 
        ip6 = "P\315\062\024MV\000\000 \314\260\261 \000\000", oid = 94889051344208, objlist = {count = 338873680, list = 0x20b1b0cc20}, ipaddr = {addr_family = (unknown: 338873680), addr = {ip4 = 22093, 
            ip6 = "MV\000\000 \314\260\261 \000\000\000\240MV\023"}}}}, aclcapability = {is_action_list_mandatory = false, action_list = {count = 338873680, list = 0x20b1b0cc20}}, aclresource = {count = 256, list = 0x564d1432cd50}, 
    tlvlist = {count = 256, list = 0x564d1432cd50}, segmentlist = {count = 256, list = 0x564d1432cd50}, ipaddrlist = {count = 256, list = 0x564d1432cd50}, porteyevalues = {count = 256, list = 0x564d1432cd50}, timespec = {
    tv_sec = 140724603453696, tv_nsec = 338873680}}}
meta = <optimized out>
i = <optimized out>
logger__LINE__ = {m_line = 418, m_fun = 0x7fa9c5bc3c00 <transfer_attributes(_sai_object_type_t, unsigned int, _sai_attribute_t const*, _sai_attribute_t*, bool)::__FUNCTION__> "transfer_attributes"}
__FUNCTION__ = "transfer_attributes" 
(gdb) bt
#0  0x00007f03e9e35fff in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f03e9e3742a in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007f03ea74e0ad in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007f03ea74c066 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f03ea74c0b1 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007f03ea74c2c9 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f03eb0c8916 in swss::Logger::wthrow(swss::Logger::Priority, char const*, ...) () from /usr/lib/x86_64-linux-gnu/libswsscommon.so.0
#7  0x00007f03eae7f6c2 in transfer_attributes (object_type=object_type@entry=SAI_OBJECT_TYPE_SWITCH, attr_count=attr_count@entry=1, src_attr_list=<optimized out>, dst_attr_list=dst_attr_list@entry=0x7fff33e93d30, 
    countOnly=countOnly@entry=false) at saiserialize.cpp:429
#8  0x00007f03eb351f1d in internal_redis_get_process (object_type=object_type@entry=SAI_OBJECT_TYPE_SWITCH, attr_count=attr_count@entry=1, attr_list=attr_list@entry=0x7fff33e93d30, kco=...) at sai_redis_generic_get.cpp:31
#9  0x00007f03eb352b39 in internal_redis_generic_get (object_type=object_type@entry=SAI_OBJECT_TYPE_SWITCH, serialized_object_id=..., attr_count=attr_count@entry=1, attr_list=attr_list@entry=0x7fff33e93d30)
    at sai_redis_generic_get.cpp:219
#10 0x00007f03eb3531f0 in redis_generic_get (object_type=object_type@entry=SAI_OBJECT_TYPE_SWITCH, object_id=9288674231451648, attr_count=attr_count@entry=1, attr_list=attr_list@entry=0x7fff33e93d30) at sai_redis_generic_get.cpp:263
#11 0x00007f03eae72bce in meta_sai_get_oid (object_type=object_type@entry=SAI_OBJECT_TYPE_SWITCH, object_id=<optimized out>, object_id@entry=9288674231451648, attr_count=attr_count@entry=1, attr_list=attr_list@entry=0x7fff33e93d30, 
    get=0x7f03eb353190 <redis_generic_get(_sai_object_type_t, unsigned long, unsigned int, _sai_attribute_t*)>) at sai_meta.cpp:5814
#12 0x00007f03eb344bb2 in redis_get_switch_attribute (switch_id=9288674231451648, attr_count=1, attr_list=0x7fff33e93d30) at sai_redis_switch.cpp:342
#13 0x000055b89440e3d8 in CrmOrch::getResAvailableCounters (this=this@entry=0x55b8960b6c70) at crmorch.cpp:451
#14 0x000055b89440e858 in CrmOrch::doTask (this=0x55b8960b6c70, timer=...) at crmorch.cpp:406
#15 0x000055b8943557a2 in OrchDaemon::start (this=0x55b8960ada00) at orchdaemon.cpp:403
#16 0x000055b8943452c6 in main (argc=<optimized out>, argv=<optimized out>) at main.cpp:315
(gdb) x/16x 0x7fff33e93d30
0x7fff33e93d30: 0x0000003d      0x000055b8      0x00000100      0x00007fff
0x7fff33e93d40: 0x96e7dcf0      0x000055b8      0x33e93d80      0x00000020
0x7fff33e93d50: 0x960b9dc0      0x000055b8      0xa573ba00      0xe58dc8c5
0x7fff33e93d60: 0x9613bf28      0x000055b8      0xa573ba00      0xe58dc8c5
(gdb) frame 7
#7  0x00007f03eae7f6c2 in transfer_attributes (object_type=object_type@entry=SAI_OBJECT_TYPE_SWITCH, attr_count=attr_count@entry=1, src_attr_list=<optimized out>, dst_attr_list=dst_attr_list@entry=0x7fff33e93d30, 
    countOnly=countOnly@entry=false) at saiserialize.cpp:429
429     saiserialize.cpp: No such file or directory.
(gdb) info local
src_attr = @0x55b896e757a0: {id = 49, value = {booldata = 45, chardata = "-\322\005", '\000' <repeats 28 times>, u8 = 45 '-', s8 = 45 '-', u16 = 53805, s16 = -11731, u32 = 381485, s32 = 381485, u64 = 381485, s64 = 381485, 
    ptr = 0x5d22d, mac = "-\322\005\000\000", ip4 = 381485, ip6 = "-\322\005", '\000' <repeats 12 times>, ipaddr = {addr_family = (SAI_IP_ADDR_FAMILY_IPV6 | unknown: 381484), addr = {ip4 = 0, ip6 = '\000' <repeats 15 times>}}, 
    ipprefix = {addr_family = (SAI_IP_ADDR_FAMILY_IPV6 | unknown: 381484), addr = {ip4 = 0, ip6 = '\000' <repeats 15 times>}, mask = {ip4 = 0, ip6 = '\000' <repeats 15 times>}}, oid = 381485, objlist = {count = 381485, list = 0x0}, 
    u8list = {count = 381485, list = 0x0}, s8list = {count = 381485, list = 0x0}, u16list = {count = 381485, list = 0x0}, s16list = {count = 381485, list = 0x0}, u32list = {count = 381485, list = 0x0}, s32list = {count = 381485, 
    list = 0x0}, u32range = {min = 381485, max = 0}, s32range = {min = 381485, max = 0}, vlanlist = {count = 381485, list = 0x0}, qosmap = {count = 381485, list = 0x0}, maplist = {count = 381485, list = 0x0}, aclfield = {
    enable = 45, mask = {u8 = 0 '\000', s8 = 0 '\000', u16 = 0, s16 = 0, u32 = 0, s32 = 0, mac = "\000\000\000\000\000", ip4 = 0, ip6 = '\000' <repeats 15 times>, u8list = {count = 0, list = 0x0}}, data = {booldata = false, 
        u8 = 0 '\000', s8 = 0 '\000', u16 = 0, s16 = 0, u32 = 0, s32 = 0, mac = "\000\000\000\000\000", ip4 = 0, ip6 = '\000' <repeats 15 times>, oid = 0, objlist = {count = 0, list = 0x0}, u8list = {count = 0, list = 0x0}}}, 
    aclaction = {enable = 45, parameter = {booldata = false, u8 = 0 '\000', s8 = 0 '\000', u16 = 0, s16 = 0, u32 = 0, s32 = 0, mac = "\000\000\000\000\000", ip4 = 0, ip6 = '\000' <repeats 15 times>, oid = 0, objlist = {count = 0, 
        list = 0x0}, ipaddr = {addr_family = SAI_IP_ADDR_FAMILY_IPV4, addr = {ip4 = 0, ip6 = '\000' <repeats 15 times>}}}}, aclcapability = {is_action_list_mandatory = 45, action_list = {count = 0, list = 0x0}}, aclresource = {
    count = 381485, list = 0x0}, tlvlist = {count = 381485, list = 0x0}, segmentlist = {count = 381485, list = 0x0}, ipaddrlist = {count = 381485, list = 0x0}, porteyevalues = {count = 381485, list = 0x0}, timespec = {
    tv_sec = 381485, tv_nsec = 0}}}
dst_attr = @0x7fff33e93d30: {id = 61, value = {booldata = false, chardata = "\000\001\000\000\377\177\000\000\360\334\347\226\270U\000\000\200=\351\063 \000\000\000\300\235\v\226\270U\000", u8 = 0 '\000', s8 = 0 '\000', u16 = 256, 
    s16 = 256, u32 = 256, s32 = 256, u64 = 140733193388288, s64 = 140733193388288, ptr = 0x7fff00000100, mac = "\000\001\000\000\377\177", ip4 = 256, ip6 = "\000\001\000\000\377\177\000\000\360\334\347\226\270U\000", ipaddr = {
    addr_family = (unknown: 256), addr = {ip4 = 32767, ip6 = "\377\177\000\000\360\334\347\226\270U\000\000\200=\351\063"}}, ipprefix = {addr_family = (unknown: 256), addr = {ip4 = 32767, 
        ip6 = "\377\177\000\000\360\334\347\226\270U\000\000\200=\351\063"}, mask = {ip4 = 32, ip6 = " \000\000\000\300\235\v\226\270U\000\000\000\272s\245"}}, oid = 140733193388288, objlist = {count = 256, list = 0x55b896e7dcf0}, 
    u8list = {count = 256, list = 0x55b896e7dcf0 ""}, s8list = {count = 256, list = 0x55b896e7dcf0 ""}, u16list = {count = 256, list = 0x55b896e7dcf0}, s16list = {count = 256, list = 0x55b896e7dcf0}, u32list = {count = 256, 
    list = 0x55b896e7dcf0}, s32list = {count = 256, list = 0x55b896e7dcf0}, u32range = {min = 256, max = 32767}, s32range = {min = 256, max = 32767}, vlanlist = {count = 256, list = 0x55b896e7dcf0}, qosmap = {count = 256, 
    list = 0x55b896e7dcf0}, maplist = {count = 256, list = 0x55b896e7dcf0}, aclfield = {enable = false, mask = {u8 = 240 '\360', s8 = -16 '\360', u16 = 56560, s16 = -8976, u32 = 2531777776, s32 = -1763189520, 
        mac = "\360\334\347\226\270U", ip4 = 2531777776, ip6 = "\360\334\347\226\270U\000\000\200=\351\063 \000\000", u8list = {count = 2531777776, list = 0x2033e93d80 <error: Cannot access memory at address 0x2033e93d80>}}, 
    data = {booldata = 192, u8 = 192 '\300', s8 = -64 '\300', u16 = 40384, s16 = -25152, u32 = 2517343680, s32 = -1777623616, mac = "\300\235\v\226\270U", ip4 = 2517343680, 
        ip6 = "\300\235\v\226\270U\000\000\000\272s\245\305\310\215\345", oid = 94251279687104, objlist = {count = 2517343680, list = 0xe58dc8c5a573ba00}, u8list = {count = 2517343680, 
        list = 0xe58dc8c5a573ba00 <error: Cannot access memory at address 0xe58dc8c5a573ba00>}}}, aclaction = {enable = false, parameter = {booldata = 240, u8 = 240 '\360', s8 = -16 '\360', u16 = 56560, s16 = -8976, 
        u32 = 2531777776, s32 = -1763189520, mac = "\360\334\347\226\270U", ip4 = 2531777776, ip6 = "\360\334\347\226\270U\000\000\200=\351\063 \000\000", oid = 94251294121200, objlist = {count = 2531777776, list = 0x2033e93d80}, 
        ipaddr = {addr_family = (unknown: 2531777776), addr = {ip4 = 21944, ip6 = "\270U\000\000\200=\351\063 \000\000\000\300\235\v\226"}}}}, aclcapability = {is_action_list_mandatory = false, action_list = {count = 2531777776, 
        list = 0x2033e93d80}}, aclresource = {count = 256, list = 0x55b896e7dcf0}, tlvlist = {count = 256, list = 0x55b896e7dcf0}, segmentlist = {count = 256, list = 0x55b896e7dcf0}, ipaddrlist = {count = 256, 
    list = 0x55b896e7dcf0}, porteyevalues = {count = 256, list = 0x55b896e7dcf0}, timespec = {tv_sec = 140733193388288, tv_nsec = 2531777776}}}
meta = <optimized out>
i = <optimized out>
logger__LINE__ = {m_line = 418, m_fun = 0x7f03eae9fc00 <transfer_attributes(_sai_object_type_t, unsigned int, _sai_attribute_t const*, _sai_attribute_t*, bool)::__FUNCTION__> "transfer_attributes"}
__FUNCTION__ = "transfer_attributes" 
(gdb) bt
#0  0x00007fefe3833fff in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fefe383542a in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007fefe414c0ad in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007fefe414a066 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007fefe414a0b1 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007fefe414a2c9 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007fefe4ac6916 in swss::Logger::wthrow(swss::Logger::Priority, char const*, ...) () from /usr/lib/x86_64-linux-gnu/libswsscommon.so.0
#7  0x00007fefe487d6c2 in transfer_attributes (object_type=object_type@entry=SAI_OBJECT_TYPE_SWITCH, attr_count=attr_count@entry=1, src_attr_list=<optimized out>, dst_attr_list=dst_attr_list@entry=0x7fff38800d80, 
    countOnly=countOnly@entry=false) at saiserialize.cpp:429
#8  0x00007fefe4d4ff1d in internal_redis_get_process (object_type=object_type@entry=SAI_OBJECT_TYPE_SWITCH, attr_count=attr_count@entry=1, attr_list=attr_list@entry=0x7fff38800d80, kco=...) at sai_redis_generic_get.cpp:31
#9  0x00007fefe4d50b39 in internal_redis_generic_get (object_type=object_type@entry=SAI_OBJECT_TYPE_SWITCH, serialized_object_id=..., attr_count=attr_count@entry=1, attr_list=attr_list@entry=0x7fff38800d80)
    at sai_redis_generic_get.cpp:219
#10 0x00007fefe4d511f0 in redis_generic_get (object_type=object_type@entry=SAI_OBJECT_TYPE_SWITCH, object_id=9288674231451648, attr_count=attr_count@entry=1, attr_list=attr_list@entry=0x7fff38800d80) at sai_redis_generic_get.cpp:263
#11 0x00007fefe4870bce in meta_sai_get_oid (object_type=object_type@entry=SAI_OBJECT_TYPE_SWITCH, object_id=<optimized out>, object_id@entry=9288674231451648, attr_count=attr_count@entry=1, attr_list=attr_list@entry=0x7fff38800d80, 
    get=0x7fefe4d51190 <redis_generic_get(_sai_object_type_t, unsigned long, unsigned int, _sai_attribute_t*)>) at sai_meta.cpp:5814
#12 0x00007fefe4d42bb2 in redis_get_switch_attribute (switch_id=9288674231451648, attr_count=1, attr_list=0x7fff38800d80) at sai_redis_switch.cpp:342
#13 0x00005573ed258139 in CrmOrch::getResAvailableCounters (this=this@entry=0x5573ee4499e0) at crmorch.cpp:432
#14 0x00005573ed258858 in CrmOrch::doTask (this=0x5573ee4499e0, timer=...) at crmorch.cpp:406
#15 0x00005573ed19f7a2 in OrchDaemon::start (this=0x5573ee445330) at orchdaemon.cpp:403
#16 0x00005573ed18f2c6 in main (argc=<optimized out>, argv=<optimized out>) at main.cpp:315
(gdb) frame 7
#7  0x00007fefe487d6c2 in transfer_attributes (object_type=object_type@entry=SAI_OBJECT_TYPE_SWITCH, attr_count=attr_count@entry=1, src_attr_list=<optimized out>, dst_attr_list=dst_attr_list@entry=0x7fff38800d80, 
    countOnly=countOnly@entry=false) at saiserialize.cpp:429
429     saiserialize.cpp: No such file or directory.
(gdb) info local
src_attr = @0x5573ee665260: {id = 49, value = {booldata = 48, chardata = "0\322\005", '\000' <repeats 28 times>, u8 = 48 '0', s8 = 48 '0', u16 = 53808, s16 = -11728, u32 = 381488, s32 = 381488, u64 = 381488, s64 = 381488, 
    ptr = 0x5d230, mac = "0\322\005\000\000", ip4 = 381488, ip6 = "0\322\005", '\000' <repeats 12 times>, ipaddr = {addr_family = (unknown: 381488), addr = {ip4 = 0, ip6 = '\000' <repeats 15 times>}}, ipprefix = {
    addr_family = (unknown: 381488), addr = {ip4 = 0, ip6 = '\000' <repeats 15 times>}, mask = {ip4 = 0, ip6 = '\000' <repeats 15 times>}}, oid = 381488, objlist = {count = 381488, list = 0x0}, u8list = {count = 381488, 
    list = 0x0}, s8list = {count = 381488, list = 0x0}, u16list = {count = 381488, list = 0x0}, s16list = {count = 381488, list = 0x0}, u32list = {count = 381488, list = 0x0}, s32list = {count = 381488, list = 0x0}, u32range = {
    min = 381488, max = 0}, s32range = {min = 381488, max = 0}, vlanlist = {count = 381488, list = 0x0}, qosmap = {count = 381488, list = 0x0}, maplist = {count = 381488, list = 0x0}, aclfield = {enable = 48, mask = {
        u8 = 0 '\000', s8 = 0 '\000', u16 = 0, s16 = 0, u32 = 0, s32 = 0, mac = "\000\000\000\000\000", ip4 = 0, ip6 = '\000' <repeats 15 times>, u8list = {count = 0, list = 0x0}}, data = {booldata = false, u8 = 0 '\000', 
        s8 = 0 '\000', u16 = 0, s16 = 0, u32 = 0, s32 = 0, mac = "\000\000\000\000\000", ip4 = 0, ip6 = '\000' <repeats 15 times>, oid = 0, objlist = {count = 0, list = 0x0}, u8list = {count = 0, list = 0x0}}}, aclaction = {
    enable = 48, parameter = {booldata = false, u8 = 0 '\000', s8 = 0 '\000', u16 = 0, s16 = 0, u32 = 0, s32 = 0, mac = "\000\000\000\000\000", ip4 = 0, ip6 = '\000' <repeats 15 times>, oid = 0, objlist = {count = 0, list = 0x0}, 
        ipaddr = {addr_family = SAI_IP_ADDR_FAMILY_IPV4, addr = {ip4 = 0, ip6 = '\000' <repeats 15 times>}}}}, aclcapability = {is_action_list_mandatory = 48, action_list = {count = 0, list = 0x0}}, aclresource = {count = 381488, 
    list = 0x0}, tlvlist = {count = 381488, list = 0x0}, segmentlist = {count = 381488, list = 0x0}, ipaddrlist = {count = 381488, list = 0x0}, porteyevalues = {count = 381488, list = 0x0}, timespec = {tv_sec = 381488, 
    tv_nsec = 0}}}
dst_attr = @0x7fff38800d80: {id = 51, value = {booldata = 96, chardata = "`\016\200\070\377\177\000\000\200\260L\356sU\000\000\320\r\200\070P\000\000\000\300\rE\356sU\000", u8 = 96 '`', s8 = 96 '`', u16 = 3680, s16 = 3680, 
    u32 = 947916384, s32 = 947916384, u64 = 140734141304416, s64 = 140734141304416, ptr = 0x7fff38800e60, mac = "`\016\200\070\377\177", ip4 = 947916384, ip6 = "`\016\200\070\377\177\000\000\200\260L\356sU\000", ipaddr = {
    addr_family = (unknown: 947916384), addr = {ip4 = 32767, ip6 = "\377\177\000\000\200\260L\356sU\000\000\320\r\200\070"}}, ipprefix = {addr_family = (unknown: 947916384), addr = {ip4 = 32767, 
        ip6 = "\377\177\000\000\200\260L\356sU\000\000\320\r\200\070"}, mask = {ip4 = 80, ip6 = "P\000\000\000\300\rE\356sU\000\000\000\351%u"}}, oid = 140734141304416, objlist = {count = 947916384, list = 0x5573ee4cb080}, 
    u8list = {count = 947916384, list = 0x5573ee4cb080 "\310\221M\355sU"}, s8list = {count = 947916384, list = 0x5573ee4cb080 "\310\221M\355sU"}, u16list = {count = 947916384, list = 0x5573ee4cb080}, s16list = {count = 947916384, 
    list = 0x5573ee4cb080}, u32list = {count = 947916384, list = 0x5573ee4cb080}, s32list = {count = 947916384, list = 0x5573ee4cb080}, u32range = {min = 947916384, max = 32767}, s32range = {min = 947916384, max = 32767}, 
    vlanlist = {count = 947916384, list = 0x5573ee4cb080}, qosmap = {count = 947916384, list = 0x5573ee4cb080}, maplist = {count = 947916384, list = 0x5573ee4cb080}, aclfield = {enable = 96, mask = {u8 = 128 '\200', 
        s8 = -128 '\200', u16 = 45184, s16 = -20352, u32 = 3998003328, s32 = -296963968, mac = "\200\260L\356sU", ip4 = 3998003328, ip6 = "\200\260L\356sU\000\000\320\r\200\070P\000\000", u8list = {count = 3998003328, 
        list = 0x5038800dd0 <error: Cannot access memory at address 0x5038800dd0>}}, data = {booldata = 192, u8 = 192 '\300', s8 = -64 '\300', u16 = 3520, s16 = 3520, u32 = 3997502912, s32 = -297464384, mac = "\300\rE\356sU", 
        ip4 = 3997502912, ip6 = "\300\rE\356sU\000\000\000\351%uZ\304,\364", oid = 93956407102912, objlist = {count = 3997502912, list = 0xf42cc45a7525e900}, u8list = {count = 3997502912, 
        list = 0xf42cc45a7525e900 <error: Cannot access memory at address 0xf42cc45a7525e900>}}}, aclaction = {enable = 96, parameter = {booldata = 128, u8 = 128 '\200', s8 = -128 '\200', u16 = 45184, s16 = -20352, 
        u32 = 3998003328, s32 = -296963968, mac = "\200\260L\356sU", ip4 = 3998003328, ip6 = "\200\260L\356sU\000\000\320\r\200\070P\000\000", oid = 93956407603328, objlist = {count = 3998003328, list = 0x5038800dd0}, ipaddr = {
        addr_family = (unknown: 3998003328), addr = {ip4 = 21875, ip6 = "sU\000\000\320\r\200\070P\000\000\000\300\rE\356"}}}}, aclcapability = {is_action_list_mandatory = 96, action_list = {count = 3998003328, 
        list = 0x5038800dd0}}, aclresource = {count = 947916384, list = 0x5573ee4cb080}, tlvlist = {count = 947916384, list = 0x5573ee4cb080}, segmentlist = {count = 947916384, list = 0x5573ee4cb080}, ipaddrlist = {
    count = 947916384, list = 0x5573ee4cb080}, porteyevalues = {count = 947916384, list = 0x5573ee4cb080}, timespec = {tv_sec = 140734141304416, tv_nsec = 3998003328}}}
meta = <optimized out>
i = <optimized out>
logger__LINE__ = {m_line = 418, m_fun = 0x7fefe489dc00 <transfer_attributes(_sai_object_type_t, unsigned int, _sai_attribute_t const*, _sai_attribute_t*, bool)::__FUNCTION__> "transfer_attributes"}
__FUNCTION__ = "transfer_attributes" 

Describe the results you expected:

No errors in log and no crash in orchagent.

Additional information you deem important (e.g. issue happens only occasionally):

**Output of `show version`:**

```

SONiC Software Version: SONiC.HEAD.135-5e6f8adb
Distribution: Debian 9.11
Kernel: 4.9.0-9-2-amd64
Build commit: 5e6f8ad
Build date: Wed Nov 27 08:28:41 UTC 2019
Built by: johnar@jenkins-worker-4

Platform: x86_64-mlnx_msn2410-r0
HwSKU: ACS-MSN2410
ASIC: mellanox
Serial Number: MT1848K10623
Uptime: 17:50:24 up 6 min, 0 users, load average: 4.90, 4.84, 2.44

Docker images:
REPOSITORY TAG IMAGE ID SIZE
docker-syncd-mlnx HEAD.135-5e6f8adb d366b0df8d28 373MB
docker-syncd-mlnx latest d366b0df8d28 373MB
docker-fpm-frr HEAD.135-5e6f8adb b33bc6d11daa 321MB
docker-fpm-frr latest b33bc6d11daa 321MB
docker-sflow HEAD.135-5e6f8adb 1b7f547ef006 305MB
docker-sflow latest 1b7f547ef006 305MB
docker-lldp-sv2 HEAD.135-5e6f8adb 3336bf112187 299MB
docker-lldp-sv2 latest 3336bf112187 299MB
docker-dhcp-relay HEAD.135-5e6f8adb 509dccee71ec 289MB
docker-dhcp-relay latest 509dccee71ec 289MB
docker-database HEAD.135-5e6f8adb 8fa2a47bab7a 281MB
docker-database latest 8fa2a47bab7a 281MB
docker-snmp-sv2 HEAD.135-5e6f8adb 86b9afca830b 335MB
docker-snmp-sv2 latest 86b9afca830b 335MB
docker-orchagent HEAD.135-5e6f8adb 6d6a3c9344a0 322MB
docker-orchagent latest 6d6a3c9344a0 322MB
docker-teamd HEAD.135-5e6f8adb 1ec23d161745 304MB
docker-teamd latest 1ec23d161745 304MB
docker-sonic-telemetry HEAD.135-5e6f8adb c1d93fb4f5c6 304MB
docker-sonic-telemetry latest c1d93fb4f5c6 304MB
docker-router-advertiser HEAD.135-5e6f8adb 1cbfbdce9acd 281MB
docker-router-advertiser latest 1cbfbdce9acd 281MB
docker-platform-monitor HEAD.135-5e6f8adb 3fb6f774b03c 565MB
docker-platform-monitor latest 3fb6f774b03c 565

```

**Attach debug file `sudo generate_dump`:**

```
(paste your output here)
```

sairedis.rec.3.gz
sairedis.rec.2.gz
logs.tar.gz
syslog.2.gz

@lguohan
Copy link
Collaborator

lguohan commented Dec 2, 2019

SAI header has been updated to 1.5.1. is that the reason?

@stephenxs
Copy link
Collaborator Author

stephenxs commented Dec 2, 2019

SAI header has been updated to 1.5.1. is that the reason?

I checked the log and found the first this error took place is at the beginning of Nov. when is the sai header update merged?

@kcudnik
Copy link
Contributor

kcudnik commented Dec 26, 2019

From sairedis.rec.3:
2019-11-14.12:05:11.162217|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1233539504 2019-11-14.12:06:11.165038|G|SAI_STATUS_FAILURE 2019-11-14.12:06:11.165926|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV6_ROUTE_ENTRY=1233539504 2019-11-14.12:07:11.226903|G|SAI_STATUS_FAILURE 2019-11-14.12:07:11.227339|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_NEXTHOP_ENTRY=1233539504 2019-11-14.12:08:11.288964|G|SAI_STATUS_FAILURE 2019-11-14.12:08:11.289995|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV6_NEXTHOP_ENTRY=1233539504 2019-11-14.12:09:11.296534|G|SAI_STATUS_FAILURE 2019-11-14.12:09:11.296936|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_NEIGHBOR_ENTRY=1233539504 2019-11-14.12:10:11.358113|G|SAI_STATUS_FAILURE 2019-11-14.12:10:11.358908|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV6_NEIGHBOR_ENTRY=1233539504 2019-11-14.12:11:11.416487|G|SAI_STATUS_FAILURE 2019-11-14.12:11:11.417122|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_NEXT_HOP_GROUP_MEMBER_ENTRY=1233539504 2019-11-14.12:12:11.478108|G|SAI_STATUS_FAILURE 2019-11-14.12:12:11.478817|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_NEXT_HOP_GROUP_ENTRY=1233539504 2019-11-14.12:13:11.540204|G|SAI_STATUS_FAILURE 2019-11-14.12:13:11.545293|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_ACL_TABLE={"count":256,"list":[{"avail_num":"0","bind_point":"SAI_ACL_BIND_POINT_TYPE_PORT","stage":"SAI_ACL_STAGE_INGRESS"},...]} 2019-11-14.12:14:11.605157|G|SAI_STATUS_FAILURE 2019-11-14.12:14:11.608540|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_ACL_TABLE_GROUP={"count":256,"li... 2019-11-14.12:16:25.465786|#|recording on: /var/log/swss/sairedis.rec 2019-11-14.12:16:25.465931|#|logrotate on: /var/log/swss/sairedis.rec 2019-11-14.12:16:25.466054|a|INIT_VIEW 2019-11-14.12:17:00.463547|A|SAI_STATUS_SUCCESS

since 2019-11-14.12:05:11.162217 every get started to timeout we can see that in syslog and also taht FAILURE happens after exactly 60 seconds from request, so syncd was not responding for those queries

transfer_attributes only happens when response is SUCCESS or BUFFER_OVERFLOW, its not happening on FAILURES,
on last line : 2019-11-14.12:14:11.608540

and syslog:
Nov 14 12:15:04.356776 mtbc-r730-05-vm01 ERR swss#orchagent: :- transfer_attributes: src vs dst attr id don't match GET mismatch
Nov 14 12:15:04.358064 mtbc-r730-05-vm01 INFO swss#supervisord: orchagent terminate called after throwing an instance of 'std::runtime_error'
Nov 14 12:15:04.358209 mtbc-r730-05-vm01 INFO swss#supervisord: orchagent what(): :- transfer_attributes: src vs dst attr id don't match GET mismatch

we got failure at Nov 14 12:15:04.356776, its less than 60 seconds, from 14.12:14:11.608540, so it seems like some response with SUCCESS arrived,
but it's not logged unfortunately, since look here:
https://github.com/Azure/sonic-sairedis/blob/master/lib/src/sai_redis_generic_get.cpp#L223
and actual logging is happening AFTER transfer_attributes on line:
https://github.com/Azure/sonic-sairedis/blob/master/lib/src/sai_redis_generic_get.cpp#L235

so this is one thing to fix, if we look at syslog we will see a lot of entries threadFunction: new span = from which we can see how many seconds each call to SAI API takes, and some of them exceeding 30 seconds syncd: :- timerWatchdogCallback: main loop execution exceeded 30634836 ms (this is actually usec not ms)

my guess here is that some previous response arrived when OA was already listening for one of the next ones since select timeout over 60 seconds, but this doesn't mean that response will not arrive, it can arrive, because syncd doesn't know that OA timed out, so it sends response that was already processing

@stephenxs
Copy link
Collaborator Author

got it.
i agree with your guess. it seems that:

1. OA has sent 10 messages
2. and then the msgs 0~8 are time out
3. and when it is waiting for reply of the 9th msg the 1st reply arrived, thus causing mismatching.

one question, does sairedis handle OA's request in one thread or in multiple threads? if it is handled in one thread, is it possible that log all the requests as well as the time handling it? by doing so probably we can catch the chief culprit.

@kcudnik
Copy link
Contributor

kcudnik commented Dec 27, 2019

every Get message is synchronous and its processed in single thread, each api is under mutex so only 1 access is done at given time. Messages don't any have id's, so if in communication process any message will time out, and it will arrive later, every next message ids will be mismatched and throw exception

@stephenxs
Copy link
Collaborator Author

is it possible to turn on the log of all the messages?
maybe sairedis.rec can help but seems it can also lost message sometimes. is there any debugging options which can be turned on from cli ?

@kcudnik
Copy link
Contributor

kcudnik commented Dec 27, 2019

you can enable INFO or DEBUG logging for sairdis/orchagent but it will NOT log message that arrived late, take a look here: https://github.com/Azure/sonic-sairedis/blob/master/lib/src/sai_redis_generic_get.cpp#L216
and values are not logged, only operand and key, values are logged here https://github.com/Azure/sonic-sairedis/blob/master/lib/src/sai_redis_generic_get.cpp#L232, which i mentioned before, is done after translation, current code doesent have that logging capability

and this was done on purpose to actually log correct data after deserialization, since if message arrive late, for example you will try to deserialize int, and maybe message is list of oids, depends what was queried, but movind that log lave before will potentially log invalid data (arrived late messages) but this could be done as debug message

@Junchao-Mellanox
Copy link
Collaborator

I reproduced this issue recently. In my test, there is no SAI API takes more than 30 seconds(I don't see log "main loop execution exceeded"). Instead, my test contains a lot of short SAI API calls, and those calls are async operations, such as remove/create route entry. My test flow is like below:

  1. Orchagent sent many "create route entry" requests to redis database.
  2. Syncd got those requests from redis and handle it one by one.
  3. Orchagent continued doing tasks and sent a request. Let's say request1. Request1 requires a response from syncd.
  4. Syncd was busy creating route entries and request1 got timeout.
  5. Orchagent ignored the timeout and sent a request2. Request2 also requires a response from syncd.
  6. Now syncd finished creating route entries, handled request1 and sent response1.
  7. Orchagent got response1 and it expected response2, and it crashed.

IMO, there are two things to fix:

  1. Improve the performance of creating/removing route entries.
  2. Improve orchagent to handle such situation better. If orchagent is designed to handle async and sync operation at the same time, we may need consider the following: (a) for critical request, if it timeout, orchagent should crash right after the timeout; (b) for non-critical request, if it timeout, orchagent should ignore the outdated response when it comes late; (c) Orchagent should be able to tell an outdated response against an invalid response.

@kcudnik
Copy link
Contributor

kcudnik commented Apr 26, 2020

To handle synchronous request i a manner that when late response1 will arrive and OA will at that time expecte response2, we need to add ID to each request/response pair and only expect those. currently no ID was required, since currently all messages are processed in single thread so order of processing messages is guaranteed. and processing many route entries in OA side is being addressed by Qi, adding bulk request/create for routes and next hop group.
Also this is interesting, we have tests (on virtual switch) that are populating asic with full asic configuration and many routes (about 10k), and it never took over 30 seconds to do so. Maybe actual asic is taking a lot more

@Junchao-Mellanox
Copy link
Collaborator

Thanks for your quick reply. I agree that ASIC might need improve its performance for certain operation. Adding bulk request/create will definitely help route entries issue. But consider following case:

  1. Orchagent sends a big number of async requests to syncd
  2. Orchagent sends a sync request right after that

Adding entries to redis is relatively fast, but processing such entries in syncd probably not that fast. In this situation, step 2 is likely to timeout. And syncd doesn't know that orchagent timeout, it will probably send the response and crash orchagent. So, I am not sure if the current design can cover all cases for now and in future.

@rlhui
Copy link
Contributor

rlhui commented May 27, 2020

@volodymyrsamotiy will sync with @Junchao-Mellanox . thx.

@lguohan
Copy link
Collaborator

lguohan commented May 27, 2020

@kcudnik , if it is a timeout, then why won't we get a get timeout message?

@stephenxs
Copy link
Collaborator Author

now that we know almost all the reproductions of the transfer_attributes: src vs dst attr id don't match error are caused by time out, can we simply treat the case src vs dst attr id don't match as a timeout and return something like timeout in this case?
comparing to adding a sequence no approach, this solution will be much easier and require smaller effort.
the negative side of this solution is that it can hide some real issue of returning mismatched attribute. again, this is very unlikely to be hit.

@kcudnik
Copy link
Contributor

kcudnik commented Jun 16, 2020

If you still see this issue time to time on regression, please attache sairedis.rec.* logs and syslog.* logs from recent failure, so i could investigate and compare

@liat-grozovik
Copy link
Collaborator

we have monitored this issue for quite some time and found it not reproducible any more.
issue can be considered as closed.

@Junchao-Mellanox
Copy link
Collaborator

Junchao-Mellanox commented Aug 25, 2020

I reproduce this issue again. I have saved the dump file by command

show techsupport --since "1 days"

But it is too big to upload(40M). I only uploaded part of them. If you need the whole dump file for debugging, I can send it via email.

FW version: 13.2008.1310
SDK version: 4.4.1306

SONiC version:

SONiC Software Version: SONiC.201911.170-f6a8678d
Distribution: Debian 9.13
Kernel: 4.9.0-11-2-amd64
Build commit: f6a8678d
Build date: Sun Aug 16 03:36:48 UTC 2020
Built by: johnar@jenkins-worker-8

Platform: x86_64-mlnx_msn2700-r0
HwSKU: Mellanox-SN2700-D48C8
ASIC: mellanox
Serial Number: MT1822K07823
Uptime: 07:38:19 up 18 min,  1 user,  load average: 0.63, 1.64, 2.11

Docker images:
REPOSITORY                    TAG                                     IMAGE ID            SIZE
docker-wjh                    201911.master.0-dirty-20200824.164253   153a1db9aea2        327MB
docker-wjh                    latest                                  153a1db9aea2        327MB
docker-syncd-mlnx             201911.170-f6a8678d                     6e1f49b38ff8        391MB
docker-syncd-mlnx             latest                                  6e1f49b38ff8        391MB
docker-sonic-telemetry        201911.170-f6a8678d                     526baec08532        352MB
docker-sonic-telemetry        latest                                  526baec08532        352MB
docker-router-advertiser      201911.170-f6a8678d                     0f6b50f47a47        289MB
docker-router-advertiser      latest                                  0f6b50f47a47        289MB
docker-sonic-mgmt-framework   201911.170-f6a8678d                     9969f82ffad1        429MB
docker-sonic-mgmt-framework   latest                                  9969f82ffad1        429MB
docker-platform-monitor       201911.170-f6a8678d                     12d62d9eefff        658MB
docker-platform-monitor       latest                                  12d62d9eefff        658MB
docker-fpm-frr                201911.170-f6a8678d                     200befdf8b6e        333MB
docker-fpm-frr                latest                                  200befdf8b6e        333MB
docker-sflow                  201911.170-f6a8678d                     f3317a4a5cf5        313MB
docker-sflow                  latest                                  f3317a4a5cf5        313MB
docker-lldp-sv2               201911.170-f6a8678d                     46510db3ecba        310MB
docker-lldp-sv2               latest                                  46510db3ecba        310MB
docker-dhcp-relay             201911.170-f6a8678d                     0394eb9ad13b        299MB
docker-dhcp-relay             latest                                  0394eb9ad13b        299MB
docker-database               201911.170-f6a8678d                     a79c4d4fc5b7        289MB
docker-database               latest                                  a79c4d4fc5b7        289MB
docker-teamd                  201911.170-f6a8678d                     993f864427c5        313MB
docker-teamd                  latest                                  993f864427c5        313MB
docker-snmp-sv2               201911.170-f6a8678d                     3754f52c25c4        347MB
docker-snmp-sv2               latest                                  3754f52c25c4        347MB
docker-orchagent              201911.170-f6a8678d                     c31765fdce31        331MB
docker-orchagent              latest                                  c31765fdce31        331MB
docker-nat                    201911.170-f6a8678d                     7fef4c03b9ca        315MB
docker-nat                    latest                                  7fef4c03b9ca        315MB

log.zip

@keboliu keboliu reopened this Aug 25, 2020
@zhenggen-xu
Copy link
Collaborator

zhenggen-xu commented Aug 31, 2020

I hit the same issue on 201811 when I was using gdb to attach to syncd for debugging for a few minutes, and exist gdb later.

Aug 31 23:14:02.648696 lnos-x1-a-csw03 ERR swss#orchagent: :- internal_redis_generic_get: generic get failed due to SELECT operation result: TIMEOUT
Aug 31 23:14:02.649089 lnos-x1-a-csw03 ERR swss#orchagent: :- internal_redis_generic_get: generic get failed to get response
Aug 31 23:14:02.649089 lnos-x1-a-csw03 ERR swss#orchagent: :- meta_sai_get_oid: get status: SAI_STATUS_FAILURE
Aug 31 23:14:02.649089 lnos-x1-a-csw03 ERR swss#orchagent: :- getResAvailableCounters: Failed to get switch attribute 49 , rv:-1
Aug 31 23:18:42.821272 lnos-x1-a-csw03 WARNING kernel: [59400.006777] linux-bcm-knet (4715): bkn_get_next_dma_event dev 0 evt_idx 0
Aug 31 23:18:42.821315 lnos-x1-a-csw03 WARNING kernel: [59400.006782] linux-bcm-knet (4715): wait queue index 0
Aug 31 23:18:42.821840 lnos-x1-a-csw03 ERR swss#orchagent: :- transfer_attributes: src vs dst attr id don't match GET mismatch
Aug 31 23:18:42.823480 lnos-x1-a-csw03 INFO swss#supervisord: orchagent terminate called after throwing an instance of 'std::runtime_error'
Aug 31 23:18:42.823480 lnos-x1-a-csw03 INFO swss#supervisord: orchagent   what():  :- transfer_attributes: src vs dst attr id don't match GET mismatch
Aug 31 23:18:48.934744 lnos-x1-a-csw03 INFO swss#supervisord 2020-08-31 23:18:43,288 INFO exited: orchagent (terminated by SIGABRT (core dumped); not expected)

might be useful for reproduce the issue easily. I don't expect we should hit orchagent crash, but rather more graceful way to handle it.

@zhenggen-xu
Copy link
Collaborator

@kcudnik

@stephenxs
Copy link
Collaborator Author

@kcudnik probably the fastest way of avoiding this issue is:

  • in transfer_attributes, throw a certain exception rather than a generic one in case of src_attr.id != dst_attr.id
  • in the main loop in internal_redis_generic_get,
    • catch that exception and handle it by skipping the response causing the exception.
    • an extra timeout should be put in case of response skipped

@kcudnik
Copy link
Contributor

kcudnik commented Sep 1, 2020

no, thats the wrong approach, if attribute is is wrong, then something is wrong with reply, thats whyexception is throw, there is no way to make it skip and continue, since something went wrong, and we dont know why
we need to get to the root cause why this happening
@zhenggen-xu do you have any logs. recordings form that crash ? syslog/sairedis rec

@zhenggen-xu
Copy link
Collaborator

zhenggen-xu commented Sep 1, 2020

@kcudnik Another case with syslog

Sep  1 22:54:48.996414 lnos-x1-a-csw03 ERR swss#orchagent: :- internal_redis_generic_get: generic get failed due to SELECT operation result: TIMEOUT
Sep  1 22:54:48.996899 lnos-x1-a-csw03 ERR swss#orchagent: :- internal_redis_generic_get: generic get failed to get response
Sep  1 22:54:48.996899 lnos-x1-a-csw03 ERR swss#orchagent: :- meta_sai_get_oid: get status: SAI_STATUS_FAILURE
Sep  1 22:54:48.996899 lnos-x1-a-csw03 ERR swss#orchagent: :- getResAvailableCounters: Failed to get switch attribute 51 , rv:-1
Sep  1 22:58:49.200110 lnos-x1-a-csw03 WARNING kernel: [70730.925550] linux-bcm-knet (4582): bkn_get_next_dma_event dev 0 evt_idx 0
Sep  1 22:58:49.200146 lnos-x1-a-csw03 WARNING kernel: [70730.925554] linux-bcm-knet (4582): wait queue index 0
Sep  1 22:58:49.205306 lnos-x1-a-csw03 ERR swss#orchagent: :- transfer_attributes: src vs dst attr id don't match GET mismatch
Sep  1 22:58:49.206836 lnos-x1-a-csw03 INFO swss#supervisord: orchagent terminate called after throwing an instance of 'std::runtime_error'
Sep  1 22:58:49.210643 lnos-x1-a-csw03 INFO swss#supervisord: orchagent   what():  :- transfer_attributes: src vs dst attr id don't match GET mismatch
Sep  1 22:58:50.890427 lnos-x1-a-csw03 INFO swss#supervisord 2020-09-01 22:58:49,503 INFO exited: orchagent (terminated by SIGABRT (core dumped); not expected)
Sep  1 22:59:31.603422 lnos-x1-a-csw03 ERR monit[560]: 'orchagent' process is not running

sairedis record shared seperately. Also, if you want to live debug, using gdb to attach syncd and keep it idle for a few minutes and then exit, it should be able to reproduce the issue.

@kcudnik
Copy link
Contributor

kcudnik commented Sep 2, 2020

@zhenggen-xu i analysed this case (thanks for sairedis recording)

internal_redis_generic_get: generic get failed due to SELECT operation result: TIMEOUT
this is timeout, thats interesting
2020-09-01.22:36:48.722458|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=3551813968
2020-09-01.22:42:48.806119|G|SAI_STATUS_FAILURE
this failed with timeout after 86minutes

22:58:49.210643 lnos-x1-a-csw03 INFO swss#supervisord: orchagent what(): :- transfer_attributes: src vs dst attr id don't match GET mismatch

but this is 22:58:49. 10 minutes later, so i think i know whats happened, select operation timeout, since GET was still in progress, in syncd, then orchagent sends another get with new attribute and expects answer, at that time syncd finished processing previous request, and sends response for previous query, that's why attribute id don't match and those syslogs are from orchagent only, do you have also syslog from syncd ? there should be a information log that api execution takes too long

@kcudnik
Copy link
Contributor

kcudnik commented Sep 2, 2020

I corelated attached syslog.2 and sairedis.3 and this is what i found:

...
Nov 14 12:05:06.028330 mtbc-r730-05-vm01 NOTICE syncd#syncd: :- threadFunction: new span = 5754247
Nov 14 12:05:07.028532 mtbc-r730-05-vm01 NOTICE syncd#syncd: :- threadFunction: new span = 110170
Nov 14 12:05:08.028699 mtbc-r730-05-vm01 NOTICE syncd#syncd: :- threadFunction: new span = 1110344
Nov 14 12:05:09.028858 mtbc-r730-05-vm01 NOTICE syncd#syncd: :- threadFunction: new span = 414520
Nov 14 12:05:10.029036 mtbc-r730-05-vm01 NOTICE syncd#syncd: :- threadFunction: new span = 402724

2019-11-14.12:05:11.162217|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1233539504

Nov 14 12:05:11.029231 mtbc-r730-05-vm01 NOTICE syncd#syncd: :- threadFunction: new span = 718414
Nov 14 12:05:12.029436 mtbc-r730-05-vm01 NOTICE syncd#syncd: :- threadFunction: new span = 643710
...
Nov 14 12:06:10.059420 mtbc-r730-05-vm01 NOTICE syncd#syncd: :- threadFunction: new span = 508898
Nov 14 12:06:11.059628 mtbc-r730-05-vm01 NOTICE syncd#syncd: :- threadFunction: new span = 547318

2019-11-14.12:06:11.165038|G|SAI_STATUS_FAILURE

Nov 14 12:06:11.165659 mtbc-r730-05-vm01 ERR swss#orchagent: :- internal_redis_generic_get: generic get failed due to SELECT operation result: TIMEOUT
Nov 14 12:06:11.165878 mtbc-r730-05-vm01 ERR swss#orchagent: :- internal_redis_generic_get: generic get failed to get response
Nov 14 12:06:11.166116 mtbc-r730-05-vm01 ERR swss#orchagent: :- meta_sai_get_oid: get status: SAI_STATUS_FAILURE
Nov 14 12:06:11.166239 mtbc-r730-05-vm01 ERR swss#orchagent: :- getResAvailableCounters: Failed to get switch attribute 49 , rv:-1

Nov 14 12:06:12.059877 mtbc-r730-05-vm01 NOTICE syncd#syncd: :- threadFunction: new span = 1547518
Nov 14 12:06:13.060072 mtbc-r730-05-vm01 NOTICE syncd#syncd: :- threadFunction: new span = 2547745
Nov 14 12:06:14.060333 mtbc-r730-05-vm01 NOTICE syncd#syncd: :- threadFunction: new span = 3547920
Nov 14 12:06:15.060499 mtbc-r730-05-vm01 NOTICE syncd#syncd: :- threadFunction: new span = 4548106
Nov 14 12:06:16.060690 mtbc-r730-05-vm01 NOTICE syncd#syncd: :- threadFunction: new span = 5548339
...

and here:

2019-11-14.12:00:54.703482|g|SAI_OBJECT_TYPE_SCHEDULER_GROUP:oid:0x1700000000043e|SAI_SCHEDULER_GROUP_ATTR_CHILD_LIST=2:oid:0x0,oid:0x0
2019-11-14.12:00:54.704707|G|SAI_STATUS_SUCCESS|SAI_SCHEDULER_GROUP_ATTR_CHILD_LIST=2:oid:0x15000000000429,oid:0x15000000000431
50 K ROUTE_ENTRY create/set
2019-11-14.12:05:11.162217|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1233539504
2019-11-14.12:06:11.165038|G|SAI_STATUS_FAILURE

so in that time window syncd is processing those 50k create/set, and it's very busy, so before it starts to take GET 2019-11-14.12:00:54 from the redis queue, it's is already timed out, since there is a "lot of new span" which counts for each api executions, and if you can see some of even 1 apis takes even more than 5 seconds,

so conclusion here is that vendor SAI is not fast enough to apply those changes
enabling synchronous mode will solve this issue, since OA will be waiting for response for each crate/remove/set as long as all api calls from OA are made from single thread, then when OA will call GET operation, syncd will be not processing any other apis, since there will be nothing in the queue

@zhenggen-xu
Copy link
Collaborator

In general, we would need transaction ID for GET operations so we could discard the mismatched ones due to time out, but that probably requires quite some work. For the time being, I feel we should actually increase this 1 minute timeout in sairedis to a bigger value ( it was reduced by sonic-net/sonic-sairedis#472) on branches that does not support synchronous call yet.

lguohan pushed a commit to sonic-net/sonic-sairedis that referenced this issue May 5, 2021
What I did:
Increase SAI Sync operation timeout from 1 min to 6 min by reverting #472

Why I did:
Issue as mention sonic-net/sonic-buildimage#3832 is seen when there is link/port-channel flaps towards T2  which results in sequence of this operation on T1 topology:-

Link Going down
- Next Hop Member removal from Nexthop Group
- Creation of New Nexthop Group
- All the Routes towards T2 are given SET Operation with new Nexthop Group (6000+ routes)

Link Going up
- Creation of New Nexthop Group
- All the Routes towards T2 are given SET Operation with new Nexthop Group (6000+ routes)

Above Sequence with flaps create many SAI operation of CREATE/SET and if SAI is slow to process them if there in any intermediate Get operation (common GET case being CRM polling at default 5 mins) will result timeout if there is no response in 1 min which will cause next Get operation to fail because of attribute mismatch as it make Request and Reponse are out of sync. 

To fix this:-

- Have increase the timeout to 6 min.  Even 201811 image/branch is having same timeout value. 

- Hard to predict what is good timeout value but based on  below experiment and since 201811 is having 6 min using that value.

How I verify:

- Ran the below script for 800 iteration and orchagent did not got timeout error and no crash. Without fix OA used to crash consistently with most of the time less than 100 iteration of flaps.
- Based on sairedis.rec of the first CRM `GET` operation between `Create/Set` operation in these 800 iteration worst time for GET response was see 4+ min. 
```
#!/bin/bash

j=0
while [ True ]; do
        echo "iteration $j"
        config interface shutdown PortChannel0002
        sleep 2
        config interface startup PortChannel0002
        sleep 2
        j=$((j+1))
done
```


```
admin@str-xxxx-06:/var/log/swss$ sudo zgrep -A 1 "|g|.*AVAILABLE_IPV4_ROUTE" sairedis.rec*
sairedis.rec:2021-05-04.17:40:13.431643|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696
sairedis.rec:2021-05-04.17:46:18.132115|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52530
sairedis.rec:--
sairedis.rec:2021-05-04.17:46:21.128261|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696
sairedis.rec:2021-05-04.17:46:31.258628|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52530
sairedis.rec.1:2021-05-04.17:30:13.900135|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696
sairedis.rec.1:2021-05-04.17:34:47.448092|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52531
sairedis.rec.1:--
sairedis.rec.1:2021-05-04.17:35:12.827247|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696
sairedis.rec.1:2021-05-04.17:35:32.939798|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52533
sairedis.rec.2.gz:2021-05-04.17:20:13.103322|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696
sairedis.rec.2.gz:2021-05-04.17:24:47.796720|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52531
sairedis.rec.2.gz:--
sairedis.rec.2.gz:2021-05-04.17:25:15.073981|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696
sairedis.rec.2.gz:2021-05-04.17:26:08.525235|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52532
```


Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
kktheballer pushed a commit to kktheballer/sonic-sairedis that referenced this issue Jul 21, 2021
What I did:
Increase SAI Sync operation timeout from 1 min to 6 min by reverting sonic-net#472

Why I did:
Issue as mention sonic-net/sonic-buildimage#3832 is seen when there is link/port-channel flaps towards T2  which results in sequence of this operation on T1 topology:-

Link Going down
- Next Hop Member removal from Nexthop Group
- Creation of New Nexthop Group
- All the Routes towards T2 are given SET Operation with new Nexthop Group (6000+ routes)

Link Going up
- Creation of New Nexthop Group
- All the Routes towards T2 are given SET Operation with new Nexthop Group (6000+ routes)

Above Sequence with flaps create many SAI operation of CREATE/SET and if SAI is slow to process them if there in any intermediate Get operation (common GET case being CRM polling at default 5 mins) will result timeout if there is no response in 1 min which will cause next Get operation to fail because of attribute mismatch as it make Request and Reponse are out of sync. 

To fix this:-

- Have increase the timeout to 6 min.  Even 201811 image/branch is having same timeout value. 

- Hard to predict what is good timeout value but based on  below experiment and since 201811 is having 6 min using that value.

How I verify:

- Ran the below script for 800 iteration and orchagent did not got timeout error and no crash. Without fix OA used to crash consistently with most of the time less than 100 iteration of flaps.
- Based on sairedis.rec of the first CRM `GET` operation between `Create/Set` operation in these 800 iteration worst time for GET response was see 4+ min. 
```
#!/bin/bash

j=0
while [ True ]; do
        echo "iteration $j"
        config interface shutdown PortChannel0002
        sleep 2
        config interface startup PortChannel0002
        sleep 2
        j=$((j+1))
done
```


```
admin@str-xxxx-06:/var/log/swss$ sudo zgrep -A 1 "|g|.*AVAILABLE_IPV4_ROUTE" sairedis.rec*
sairedis.rec:2021-05-04.17:40:13.431643|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696
sairedis.rec:2021-05-04.17:46:18.132115|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52530
sairedis.rec:--
sairedis.rec:2021-05-04.17:46:21.128261|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696
sairedis.rec:2021-05-04.17:46:31.258628|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52530
sairedis.rec.1:2021-05-04.17:30:13.900135|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696
sairedis.rec.1:2021-05-04.17:34:47.448092|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52531
sairedis.rec.1:--
sairedis.rec.1:2021-05-04.17:35:12.827247|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696
sairedis.rec.1:2021-05-04.17:35:32.939798|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52533
sairedis.rec.2.gz:2021-05-04.17:20:13.103322|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696
sairedis.rec.2.gz:2021-05-04.17:24:47.796720|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52531
sairedis.rec.2.gz:--
sairedis.rec.2.gz:2021-05-04.17:25:15.073981|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696
sairedis.rec.2.gz:2021-05-04.17:26:08.525235|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52532
```


Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants