Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DUT takes more than 7 seconds to finish update ip v6 neighbor #2414

Closed
keboliu opened this issue Jan 4, 2019 · 10 comments
Closed

DUT takes more than 7 seconds to finish update ip v6 neighbor #2414

keboliu opened this issue Jan 4, 2019 · 10 comments
Assignees

Comments

@keboliu
Copy link
Collaborator

keboliu commented Jan 4, 2019

Description

Steps to reproduce the issue:
run sonic-mgmt test "neighbour_mac_noptf", in this test it use command "ip -6 neigh change xxxxxx" to change the ipv6 neighbor mac address and then verify the ASIC DB to see whether the change is success or not.

Describe the results you received:
sometimes the test failed due to the mac address in the ASIC DB table "KEYS ASIC_STATE:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY*" not change yet, ASIC DB only been updated after about 7 seconds after issue the ip neigh change command.

Describe the results you expected:
the ansible test expecting DB will be updated in 2 seconds, 7 seconds is too long.

Additional information you deem important (e.g. issue happens only occasionally):
when the issue happened, there was no time consuming task running on the DUT, only ansible test is running on it.

Output of show version:

```
admin@sonic:/var/log# show version
SONiC Software Version: SONiC.HEAD.6-6f37181
Distribution: Debian 9.6
Kernel: 4.9.0-8-amd64
Build commit: 6f37181
Build date: Thu Dec 27 09:23:55 UTC 2018
Built by: johnar@jenkins-worker-3

Docker images:
REPOSITORY                 TAG                 IMAGE ID            SIZE
docker-orchagent-mlnx      HEAD.6-6f37181      f7b8bc6db7cd        287.6 MB
docker-orchagent-mlnx      latest              f7b8bc6db7cd        287.6 MB
docker-syncd-mlnx          HEAD.6-6f37181      6e354eba2db0        366 MB
docker-syncd-mlnx          latest              6e354eba2db0        366 MB
docker-lldp-sv2            HEAD.6-6f37181      07af73531c86        275.8 MB
docker-lldp-sv2            latest              07af73531c86        275.8 MB
docker-dhcp-relay          HEAD.6-6f37181      484d63492a27        258 MB
docker-dhcp-relay          latest              484d63492a27        258 MB
docker-database            HEAD.6-6f37181      a946b91f1948        256.7 MB
docker-database            latest              a946b91f1948        256.7 MB
docker-teamd               HEAD.6-6f37181      5b7a15682af1        275.8 MB
docker-teamd               latest              5b7a15682af1        275.8 MB
docker-snmp-sv2            HEAD.6-6f37181      166e335a3a8a        296.9 MB
docker-snmp-sv2            latest              166e335a3a8a        296.9 MB
docker-router-advertiser   HEAD.6-6f37181      d12c47bb6e4e        254.3 MB
docker-router-advertiser   latest              d12c47bb6e4e        254.3 MB
docker-platform-monitor    HEAD.6-6f37181      9027de550971        288.3 MB
docker-platform-monitor    latest              9027de550971        288.3 MB
docker-fpm-quagga          HEAD.6-6f37181      67148e4ecb4b        282.7 MB
docker-fpm-quagga          latest              67148e4ecb4b        282.7 MB
```

syslog for this issue, can see that ip neighbor change command was issued at 06:20:05, ASIC DB was updated at 06:20:11 and 06:20:12, in total it took around 7 seconds:

```
Jan  4 06:20:05.618410 arc-mtbc-1001 INFO ansible-<stdin>: Invoked with warn=True executable=None chdir=None _raw_params=ip -6 neigh change fc00::62 lladdr 08:bc:27:af:cc:65 dev Ethernet96 removes=None creates=None _uses_shell=False
Jan  4 06:20:05.659794 arc-mtbc-1001 NOTICE swss#orchagent: :- addNeighbor: Updated neighbor 08:bc:27:af:cc:65 on Ethernet96
Jan  4 06:20:05.899714 arc-mtbc-1001 INFO ansible-<stdin>: Invoked with warn=True executable=None chdir=None _raw_params=ip -6 neigh change fc00::62 lladdr 08:bc:27:af:cc:67 dev Ethernet96 removes=None creates=None _uses_shell=False
Jan  4 06:20:05.920981 arc-mtbc-1001 NOTICE swss#orchagent: :- addNeighbor: Updated neighbor 08:bc:27:af:cc:67 on Ethernet96
Jan  4 06:20:07.987604 arc-mtbc-1001 NOTICE swss#orchagent: :- removeNextHopGroup: Delete next hop group 10.0.0.1,10.0.0.5,10.0.0.9,10.0.0.13,10.0.0.17,10.0.0.21,10.0.0.25,10.0.0.29
Jan  4 06:20:08.327570 arc-mtbc-1001 INFO ansible-<stdin>: Invoked with warn=True executable=None chdir=None _raw_params=pgrep orchagent removes=None creates=None _uses_shell=False
Jan  4 06:20:08.853737 arc-mtbc-1001 INFO ansible-<stdin>: Invoked
Jan  4 06:20:09.320123 arc-mtbc-1001 INFO ansible-<stdin>: Invoked with warn=True executable=None chdir=None _raw_params=docker exec database redis-cli -n 1 KEYS ASIC_STATE:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY* | grep -Fe '"fc00::62"' removes=None creates=None _uses_shell=True
Jan  4 06:20:09.707769 arc-mtbc-1001 INFO ansible-<stdin>: Invoked with warn=True executable=None chdir=None _raw_params=docker exec database redis-cli -n 1 HGETALL 'ASIC_STATE:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fc00::62","rif":"oid:0x60000000009d4","switch_id":"oid:0x21000000000000"}' | grep "_ATTR_DST_MAC_ADDRESS" -A 1 | tail -1 removes=None creates=None _uses_shell=True
Jan  4 06:20:11.979943 arc-mtbc-1001 INFO syncd#supervisord: syncd Jan 04 06:20:11 NOTICE  SAI_UTILS: mlnx_sai_utils.c[2290]- set_dispatch_attrib_handler: Set DST_MAC_ADDRESS, key:neighbor ip fc00::62 rif 16, val:[08:bc:27:af:cc:65]
Jan  4 06:20:12.157586 arc-mtbc-1001 INFO syncd#supervisord: syncd Jan 04 06:20:12 NOTICE  SAI_UTILS: mlnx_sai_utils.c[2290]- set_dispatch_attrib_handler: Set DST_MAC_ADDRESS, key:neighbor ip fc00::62 rif 16, val:[08:bc:27:af:cc:67]
```
@liat-grozovik
Copy link
Collaborator

@prsunny could you please provide some kind of estimation when this issue can be addressed?
the 'neighbour_mac_noptf' is keep failing and we need this issue to be fixed as part of 201811 release.

@prsunny
Copy link
Contributor

prsunny commented Feb 1, 2019

When I tested manually on another platform, I see that the neighbor entry is updated immediately in ASIC DB. Please note the swss and sairedis rec. Can you check the same manually and verify?

ip neigh show
fc00::a dev PortChannel0002 lladdr 52:54:00:28:ff:94 REACHABLE

sudo ip -6 neigh change fc00::a dev PortChannel0002 lladdr 00:54:00:28:ff:94

SWSS Rec

2019-01-31.23:52:35.900837|NEIGH_TABLE:PortChannel0002:fc00::a|SET|neigh:00:54:00:28:ff:94|family:IPv6

SAIREDIS rec
2019-01-31.23:52:35.901075|s|SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fc00::a","rif":"oid:0x6000000000519","switch_id":"oid:0x21000000000000"}|SAI_NEIGHBOR_ENTRY_ATTR_DST_MAC_ADDRESS=00:54:00:28:FF:94

Syslog:

Jan 31 23:52:35.901447 str-a7170-acs-1 NOTICE swss#orchagent: :- addNeighbor: Updated neighbor 00:54:00:28:ff:94 on PortChannel0002

Notes: This is on the latest master, with T0 topology with BGP sessions up.

@prsunny
Copy link
Contributor

prsunny commented Feb 1, 2019

I also tested on the Nov branch, and observed the same result when changing the Neighbor MAC address manually.

@keboliu
Copy link
Collaborator Author

keboliu commented Feb 12, 2019

@prsunny thanks for sharing the test result. do you know the flow for the neighbor entry update? I would like to further investigate on our platform.

@keboliu
Copy link
Collaborator Author

keboliu commented Apr 3, 2019

@prsunny This issue observed from time to time on our testbed.

In one certain test change the neighbor mac two times

Apr 2 19:15:11.117851 arc-mtbc-1001 INFO ansible-<stdin>: Invoked with warn=True executable=None chdir=None _raw_params=ip -6 neigh change fc00::62 lladdr 08:bc:27:af:cc:65 dev Ethernet96 removes=None creates=None _uses_shell=False

Apr 2 19:15:11.408204 arc-mtbc-1001 INFO ansible-<stdin>: Invoked with warn=True executable=None chdir=None _raw_params=ip -6 neigh change fc00::62 lladdr 08:bc:27:af:cc:67 dev Ethernet96 removes=None creates=None _uses_shell=False

but only after 10s syncd start to handle the second time change, seems that the first time change was "lost"

Apr 2 19:15:21.249048 arc-mtbc-1001 INFO syncd#supervisord: syncd Apr 02 19:15:21 NOTICE SAI_UTILS: mlnx_sai_utils.c[2307]- set_dispatch_attrib_handler: Set DST_MAC_ADDRESS, key:neighbor ip fc00::62 rif 16, val:[08:bc:27:af:cc:67]

Is it the flow that neighsyncd get neighbor change via RTLINK from kernel and then trigger syncd to finish all the change?

syslog.zip

@liat-grozovik
Copy link
Collaborator

@prsunny we cont to see this issue on testbed from time to time. these failures i believe are real and even if it succeed there is something wrong in it.

@stephenxs
Copy link
Collaborator

@prsunny
I've built a debug image with some additional logs inserted at the following position:

  1. NeighSync::onMsg which is called in neighsyncd when receiving the rtnl message notifying neighbor update from kernel and push it to the APPL_DB.
  2. NeighOrch::addNeighbor which is called in orchagent when handling the newly updated/added neighbor and inserting them into ASIC_DB.

I've also add a snippet of code which queries APPL_DB to determine whether the updated mac address has been written into APPL_DB.

Having tested this case on the image, I find:

  1. a rtnl message has been received almost immediately once the command is issued, see log below:
    Jun 12 09:42:34.085081 mtbc-sonic-01-2410 INFO ansible-<stdin>: Invoked with warn=True executable=None chdir=None _raw_params=ip -6 neigh change fc00::3e lladdr 08:bc:27:af:cc:65 dev Ethernet60 removes=None creates=None _uses_shell=False
    Jun 12 09:42:34.092500 mtbc-sonic-01-2410 INFO swss#neighsyncd: :- onMsg: ND-TEST RECEIVING NLMSG neighbour update Ethernet60:fc00::3e 08:bc:27:af:cc:65
    Jun 12 09:42:34.349619 mtbc-sonic-01-2410 INFO ansible-<stdin>: Invoked with warn=True executable=None chdir=None _raw_params=ip -6 neigh change fc00::3e lladdr 08:bc:27:af:cc:67 dev Ethernet60 removes=None creates=None _uses_shell=False
    Jun 12 09:42:34.411200 mtbc-sonic-01-2410 INFO swss#neighsyncd: :- onMsg: ND-TEST RECEIVING NLMSG neighbour update Ethernet60:fc00::3e 08:bc:27:af:cc:67
  2. According to the log, there is no log indicating the nd has been handled by orchagent(neighagent) before ASIC_DB is checked.
  3. The updated mac has not written into APPL_DB either when test failed.

According to the above points, I tend to consider that neighbor update notification has been backlogged in somewhere in orchagent.
Essentially orchagent has a single thread architect which means all work is handled sequentially. Under a heavy workload some work, like neighbor update, can be blocked/delayed if there have already been a lot of works in the queue, which might be what we faced in this issue. Even though under a light workload it can take time to propagate the neighbor update from neighsyncd to orchagent(APPL_DB) and then to syncd(ASIC_DB).
However in the script there is not delay between updating the neighbor in CLI and checking ASIC_DB. I think a 5 second delay should be introduced here in order to solve this issue.

stepanblyschak pushed a commit to stepanblyschak/sonic-mgmt that referenced this issue Jun 14, 2019
Remove the workaround in neighbour_mac_noptf, which temporarily adjust the log level to DEBUG for orchagent, neighsync and nbrmgrd ahead of nd test and recover to INFO after that.
This workaround was introduced in order to provide detailed debug information for the issue sonic-net/sonic-buildimage#2414 (DUT takes more than 7 seconds to finish update ip v6 neighbor). However, we found those information can hardly help.

Signed-off-by: Stephen Sun <stephens@mellanox.com>
@stepanblyschak
Copy link
Collaborator

@prsunny @keboliu @stephenxs
my 2 cents into this issue:

So first in the test goes IPv4 neighbor MAC change test which passes.
This means BGPv4 neighbor will go down because neighbor is not reachable with this MAC so orchagent will go though each route updating next hop group.
While orchagent is updating v4 routes IPv6 neighbor MAC change test starts and orchagent simply has no resources to deal with this update in time.
The idea is to change Neighbor MAC test to configure static neighbors and test, so that it will not affect BGP.

@stephenxs
Copy link
Collaborator

@prsunny @keboliu @stephenxs
my 2 cents into this issue:

So first in the test goes IPv4 neighbor MAC change test which passes.
This means BGPv4 neighbor will go down because neighbor is not reachable with this MAC so orchagent will go though each route updating next hop group.
While orchagent is updating v4 routes IPv6 neighbor MAC change test starts and orchagent simply has no resources to deal with this update in time.
The idea is to change Neighbor MAC test to configure static neighbors and test, so that it will not affect BGP.

i think your assumption is reasonable. and to test with a static neighbor is also a good idea.
in addition, as bgp has a hold time of 10 seconds which is long enough for the test, is it possible to recover the mac address of the neighbor at the end of test so that it won't affect the further bgp processing?
also, we can log the bgp neighbor state at the beginning of the test, and also log at the end if failed. by checking the bgp neighbor state before and after the test we can get your assumption proved.

@prsunny
Copy link
Contributor

prsunny commented Aug 6, 2019

Closing this as the fix is made as part of referenced commit #1026

@prsunny prsunny closed this as completed Aug 6, 2019
@zbud-msft zbud-msft mentioned this issue Sep 15, 2022
7 tasks
dgsudharsan added a commit to dgsudharsan/sonic-buildimage that referenced this issue Sep 21, 2022
Update sonic-swss submodule pointer to include the following:
* 8eea92e [202205][counters] Revert PR sonic-net#2432 for the buffer queue/pg counters improvement ([sonic-net#2462](sonic-net/sonic-swss#2462))
* 5d8636a [202205] Enhance orchagent and buffer manager in error handling (sonic-net#2414) ([sonic-net#2449](sonic-net/sonic-swss#2449))
* aa22237 [Everflow/ERSPAN] Set correct destination port and mac address when the nexthop is updated for ERSPAN mirror destination (sonic-net#2392) ([sonic-net#2455](sonic-net/sonic-swss#2455))
* 04ce7be check state_db for po before sending ARP/ND pkts (sonic-net#2444) ([sonic-net#2450](sonic-net/sonic-swss#2450))
* f0138a2 [portmgr] Fixed the orchagent crash due to late arrival of notif (sonic-net#2431) ([sonic-net#2451](sonic-net/sonic-swss#2451))
* 7cfde48 Change the log messages in addKernelNeigh/Route from ERROR to INFO ([sonic-net#2437](sonic-net/sonic-swss#2437))
* 2c5116e [202205][counters] Improve performance by polling only configured ports buffer queue/pg counters ([sonic-net#2432](sonic-net/sonic-swss#2432))

Signed-off-by: dgsudharsan <sudharsand@nvidia.com>
prsunny pushed a commit that referenced this issue Sep 21, 2022
Update sonic-swss submodule pointer to include the following:
* 8eea92e [202205][counters] Revert PR #2432 for the buffer queue/pg counters improvement ([#2462](sonic-net/sonic-swss#2462))
* 5d8636a [202205] Enhance orchagent and buffer manager in error handling (#2414) ([#2449](sonic-net/sonic-swss#2449))
* aa22237 [Everflow/ERSPAN] Set correct destination port and mac address when the nexthop is updated for ERSPAN mirror destination (#2392) ([#2455](sonic-net/sonic-swss#2455))
* 04ce7be check state_db for po before sending ARP/ND pkts (#2444) ([#2450](sonic-net/sonic-swss#2450))
* f0138a2 [portmgr] Fixed the orchagent crash due to late arrival of notif (#2431) ([#2451](sonic-net/sonic-swss#2451))
* 7cfde48 Change the log messages in addKernelNeigh/Route from ERROR to INFO ([#2437](sonic-net/sonic-swss#2437))
* 2c5116e [202205][counters] Improve performance by polling only configured ports buffer queue/pg counters ([#2432](sonic-net/sonic-swss#2432))
theasianpianist added a commit to theasianpianist/sonic-buildimage that referenced this issue Oct 5, 2022
Include:

df92fb7 Improve verbosity level and provide more info in the log (sonic-net#2472)
e81ed20 [intfmgr]: Enable `accept_untracked_na` kernel param (sonic-net#2436)
24d29f1 [orchdaemon]: Fixed sairedis record file rotation (sonic-net#2299)
b8ee07d [build] add missing package libyang-dev in lgtm.yml (sonic-net#2475)
e46dd29 [crm] Fix issue with continues EXCEEDED and CLEAR logs for ACL group/table counters (sonic-net#2463)
b61d24c [doc]: Update README.md (sonic-net#2456)
b9ade5d [orchagent] Fix issue: ip prefix shall be inited even if VRF/VNET is not ready (sonic-net#2461)
f0f1eb4 Revert "[counters] Improve performance by polling only configured ports buffer queue/pg counters (sonic-net#2360)" (sonic-net#2458)
3d757a8 [ci][asan] add DVS tests run with ASAN (sonic-net#2441)
04fbc8e [ci] Only when test stage succeeded or succeededwithissues, PR run Gcov (sonic-net#2460)
7cc035f [orchagent]: Publish identified events via structured-events channel (sonic-net#2446)
efa0f01 [QoS] Enforce drop probability only for colors whose WRED are enabled (sonic-net#2422)
05c5c2f [swss] Replace memset functions (sonic-net#2423)
9ff993d Modified the test file to remove click commands and do the REDIS-DB u… (sonic-net#2264)
9e376af Install libyang in azure pipeline. (sonic-net#2445)
c1eb99a check state_db for po before sending ARP/ND pkts (sonic-net#2444)
43cc486 [portmgr] Fixed the orchagent crash due to late arrival of notif (sonic-net#2431)
b62c716 Enhance orchagent and buffer manager in error handling (sonic-net#2414)
13bda3c [Everflow/ERSPAN] Set correct destination port and mac address when the nexthop is updated for ERSPAN mirror destination (sonic-net#2392)
0ccb315 Revert "[VS Test] Skip failing subport tests (sonic-net#2370)" (sonic-net#2421)
ac8a83f [UT] [Portsyncd] Added Unit Tests for portsyncd (sonic-net#2297)
83a186a Change the log messages in addKernelNeigh/Route from ERROR to INFO (sonic-net#2437)
9c23389 [BFD]Clean up state_db BFD entries on swss restart (sonic-net#2434)
d41aebf EntityBulker SIGSEGV when create_entry attr_count 0 (sonic-net#2224)
f52a7b1 Fix the Fec Mode Setting of gbsyncd (sonic-net#2430)
8cc0a45 [neighsyncd] Enabling ipv4 link local entries for non-dualtor (sonic-net#2427)
5624e87 Revert "[ci][asan] add DVS tests run with ASAN (sonic-net#2363)" (sonic-net#2433)
a26b26a Dynamic port configuration - add port buffer cfg to the port ref counter (sonic-net#2194)
486939a tlm_teamd: Filter portchannel subinterface events from STATE_DB LAG_TABLE (sonic-net#2408)
a4b8992 [counters] Improve performance by polling only configured ports buffer queue/pg counters (sonic-net#2360)
4aaeec9 added support for Xsight platform (sonic-net#2426)
ca9edca [ci][asan] add DVS tests run with ASAN (sonic-net#2363)
dec4570 Handle dual ToR neighbor miss scenario (sonic-net#2151)
9eb4422 Upstream new development on p4orch (sonic-net#2237)
e9be2c0 [lgtm] Fix dependency (sonic-net#2419)
c0168f3 [muxorch] Returning true if nbr in skip_neighbor_ in isNeighborActive() (sonic-net#2415)
cfcf3d8 [macsec]: Set MTU for MACsec (sonic-net#2398)
8346034 Delete Invalid if condition in intfsorch.cpp (sonic-net#2411)

Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
theasianpianist added a commit that referenced this issue Oct 7, 2022
Include:

df92fb7 Improve verbosity level and provide more info in the log (#2472)
e81ed20 [intfmgr]: Enable `accept_untracked_na` kernel param (#2436)
24d29f1 [orchdaemon]: Fixed sairedis record file rotation (#2299)
b8ee07d [build] add missing package libyang-dev in lgtm.yml (#2475)
e46dd29 [crm] Fix issue with continues EXCEEDED and CLEAR logs for ACL group/table counters (#2463)
b61d24c [doc]: Update README.md (#2456)
b9ade5d [orchagent] Fix issue: ip prefix shall be inited even if VRF/VNET is not ready (#2461)
f0f1eb4 Revert "[counters] Improve performance by polling only configured ports buffer queue/pg counters (#2360)" (#2458)
3d757a8 [ci][asan] add DVS tests run with ASAN (#2441)
04fbc8e [ci] Only when test stage succeeded or succeededwithissues, PR run Gcov (#2460)
7cc035f [orchagent]: Publish identified events via structured-events channel (#2446)
efa0f01 [QoS] Enforce drop probability only for colors whose WRED are enabled (#2422)
05c5c2f [swss] Replace memset functions (#2423)
9ff993d Modified the test file to remove click commands and do the REDIS-DB u… (#2264)
9e376af Install libyang in azure pipeline. (#2445)
c1eb99a check state_db for po before sending ARP/ND pkts (#2444)
43cc486 [portmgr] Fixed the orchagent crash due to late arrival of notif (#2431)
b62c716 Enhance orchagent and buffer manager in error handling (#2414)
13bda3c [Everflow/ERSPAN] Set correct destination port and mac address when the nexthop is updated for ERSPAN mirror destination (#2392)
0ccb315 Revert "[VS Test] Skip failing subport tests (#2370)" (#2421)
ac8a83f [UT] [Portsyncd] Added Unit Tests for portsyncd (#2297)
83a186a Change the log messages in addKernelNeigh/Route from ERROR to INFO (#2437)
9c23389 [BFD]Clean up state_db BFD entries on swss restart (#2434)
d41aebf EntityBulker SIGSEGV when create_entry attr_count 0 (#2224)
f52a7b1 Fix the Fec Mode Setting of gbsyncd (#2430)
8cc0a45 [neighsyncd] Enabling ipv4 link local entries for non-dualtor (#2427)
5624e87 Revert "[ci][asan] add DVS tests run with ASAN (#2363)" (#2433)
a26b26a Dynamic port configuration - add port buffer cfg to the port ref counter (#2194)
486939a tlm_teamd: Filter portchannel subinterface events from STATE_DB LAG_TABLE (#2408)
a4b8992 [counters] Improve performance by polling only configured ports buffer queue/pg counters (#2360)
4aaeec9 added support for Xsight platform (#2426)
ca9edca [ci][asan] add DVS tests run with ASAN (#2363)
dec4570 Handle dual ToR neighbor miss scenario (#2151)
9eb4422 Upstream new development on p4orch (#2237)
e9be2c0 [lgtm] Fix dependency (#2419)
c0168f3 [muxorch] Returning true if nbr in skip_neighbor_ in isNeighborActive() (#2415)
cfcf3d8 [macsec]: Set MTU for MACsec (#2398)
8346034 Delete Invalid if condition in intfsorch.cpp (#2411)

Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
tshalvi pushed a commit to tshalvi/sonic-buildimage that referenced this issue Dec 20, 2022
What I did
Enhance orchagent and buffer manager

Buffer manager: do not insert buffer queue into cache if the profile is illegal, which prevents an empty string from being inserted into APPL_DB during initialization.
orchagent: handle the case that a field referencing other objects is an empty string.
There had been such logic that was broken by a PR last year.
Signed-off-by: Stephen Sun stephens@nvidia.com

Why I did it
Enhance the error handling logic.
In most cases, a user will not encounter such scenarios in a production environment because it's the front-ends' (eg. CLI) responsibility to identify the wrong configuration and prevent them from being inserted to CONFIG_DB.
However, in some cases, like a wrong config_db.json composed and copied to the switch, front-ends can not prevent that.

How I verified it
Manual and mock tests.

Details if related
For the improvement in buffer manager:

previously, the logic was:
declare a reference portQueue to m_portQueueLookup[port][queues] and then assign fvValue(i) to portQueue.running_profile_name
But [] operation on C++ map has a side-effect -- it will insert a new element into the map if there wasn't one. In case the validation check in checkBufferProfileDirection failed and there was not one in the map, the portQueue.running_profile_name will keep empty. This is not what we want.
In case there was an item configured in the map, we should not remove it on failure because we want to prevent the user from being affected by misconfiguration and alert user to correct the error. There is log in checkBufferProfileDirection
Now it is improved in this way:
Avoid using reference and initialize m_portQueueLookup[port][queues] only if there is a valid egress profile configured
StormLiangMS added a commit that referenced this issue Mar 1, 2023
b8173428 - [GCU] Add Sample Unit Test for RDMA Headroom Pool Size Tuning ([device/celestica] blacklist gpio_ich kernel module on haliburton #2692) (5 hours ago) [isabelmsft]
6f84aae7 - Add begin logs to config reload/config minigraph/warm-reboot/fast-reboot (RPC syncd docker image does not start RPC server automatically #2694) (27 hours ago) [Junchao-Mellanox]
e98011f8 - Revert "Secure upgrade ([sonic-cfggen] Multi-key should be in form of (a,b) instead of 'a|b' #2337)" (Add Broadcom XLR/GTS ((BCM9COMX2XMC) support for Tomahawk switch (BCM956960K) #2675) (34 hours ago) [StormLiangMS]
eda4e91b - [show][muxcable] add some new commands health, reset-cause, queue_info support for muxcable (DUT takes more than 7 seconds to finish update ip v6 neighbor #2414) (4 days ago) [vdahiya12]
54e26359 - Replace pickle by json (Add autoneg to 7170-Q59S20 #2636) (4 days ago) [Mai Bui]
StormLiangMS added a commit that referenced this issue Mar 8, 2023
Why I did it
8c7ddf56 - [warm/fast-reboot] Backup logs from tmpfs to disk during fast/warm shutdown ([swss]: update swss docker to stretch #2714) (3 hours ago) [Vaibhav Hemant Dixit]
f2a31b30 - [ci] Fix pipeline issue caused by sonic-slave-* change. ([201803] Modify Debian apt repos to reflect changes made by maintainers #2709) (3 hours ago) [Liu Shilong]
586ecf0e - [dhcp_relay] Fix dhcp_relay restart error while add/del vlan ([thrift] add a patch to revert THRIFT-3650 #2688) (3 hours ago) [Yaqiang Zhu]
07b0ef4c - [portstat CLI] don't print reminder if use json format ([devices] add new accton platform minipack. #2670) (3 hours ago) [wenyiz2021]
48d3d3ef - [show][muxcable] add some new commands health, reset-cause, queue_info support for muxcable (DUT takes more than 7 seconds to finish update ip v6 neighbor #2414) (3 hours ago) [vdahiya12]
How I did it
How to verify it
xumia pushed a commit to xumia/sonic-buildimage-1 that referenced this issue Mar 10, 2023
b8173428 - [GCU] Add Sample Unit Test for RDMA Headroom Pool Size Tuning ([device/celestica] blacklist gpio_ich kernel module on haliburton sonic-net#2692) (5 hours ago) [isabelmsft]
6f84aae7 - Add begin logs to config reload/config minigraph/warm-reboot/fast-reboot (RPC syncd docker image does not start RPC server automatically sonic-net#2694) (27 hours ago) [Junchao-Mellanox]
e98011f8 - Revert "Secure upgrade ([sonic-cfggen] Multi-key should be in form of (a,b) instead of 'a|b' sonic-net#2337)" (Add Broadcom XLR/GTS ((BCM9COMX2XMC) support for Tomahawk switch (BCM956960K) sonic-net#2675) (34 hours ago) [StormLiangMS]
eda4e91b - [show][muxcable] add some new commands health, reset-cause, queue_info support for muxcable (DUT takes more than 7 seconds to finish update ip v6 neighbor sonic-net#2414) (4 days ago) [vdahiya12]
54e26359 - Replace pickle by json (Add autoneg to 7170-Q59S20 sonic-net#2636) (4 days ago) [Mai Bui]
qiluo-msft pushed a commit that referenced this issue Mar 20, 2023
…14048)

For sonic-platform-daemons following commits are added to the submodule

dd8fbae (HEAD -> 202012, origin/202012) [ycabled] add more coverage to ycabled; add minor name change for vendor API CLI return key-values pairs (#338)
846555e [thermalctld] fix some redundant removal of state DB tables (#315)
3d92fb9 Use github code scanning instead of LGTM (#316)

For sonic-utilities the following commits are added in this PR to the submodule
git log --oneline 39cdb49c..202012
ec4c6ea5 (HEAD -> 202012, origin/202012) [show][muxcable] add some new commands health, reset-cause, queue_info support for muxcable (#2414) (#2704)
03ef272e [202012][vlan] Remove add field of vlanid to DHCP_RELAY table while adding vlan (#2681)
e00a81ac [202012][dhcp-relay] Add support for dhcp_relay config cli (#2640)
274184e1 [vlan] Refresh dhcpv6_relay config while adding/deleting a vlan (#2660) (#2668

#### Why I did it
updating the submodule of sonic-platform-daemons, sonic-utilities

#### How I did it

updated the submodule
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants