Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[py-swss/config] config load-minigraph failure leaves database in wrong state #1629

Open
taoyl-ms opened this issue Apr 20, 2018 · 2 comments
Assignees
Labels

Comments

@taoyl-ms
Copy link
Contributor

Description
A not-complete config load-minigraph will leave config DB in a wrong state and later operations might not behave correctly.

Steps to reproduce the issue:

  1. Modify /etc/sonic/minigraph.xml so that it contains a syntax error.
  2. Run config load-minigraph -y, it will fail because of the syntax error.
  3. Fix the minigraph and run config load-minigraph -y again. The command will now hang forever.
@taoyl-ms taoyl-ms self-assigned this Apr 20, 2018
@lguohan
Copy link
Collaborator

lguohan commented Aug 17, 2018

it should timeout, and print error message that db is not initialized.

@zhenggen-xu
Copy link
Collaborator

"config reload" and "config load_minigraph" will flush the configDB first, then try to load things to DB, if failed, the flag was flushed already. However, any subsequent config reload will wait for that flag (CONFIG_DB_INITIALIZED) in configDB when creating ConfigDBConnector instance. We will be stuck at this situation regardless timeout until we manually set the configDB flag or restart database docker.

If this flag "CONFIG_DB_INITIALIZED" was to protect multiple writers going to configDB, then if "config reload" or "config load_minigraph" failed, we should recover the flag automatically. If this flag meant for the initial initialization of database docker, then, "config reload" and "config load_minigraph" should not clear this flag at all.

We should fix this problem one way or another.

@yxieca yxieca assigned lguohan and unassigned taoyl-ms Sep 19, 2019
yxieca added a commit to yxieca/sonic-buildimage that referenced this issue May 28, 2021
sonic-utilities:
* 8b98d45 2021-05-25 | [show] support for show muxcable firmware version of only active banks (sonic-net#1629) (HEAD -> 202012) [vdahiya12]
* afd0975 2021-05-20 | [show] add support for muxcable metrics (sonic-net#1615) [vdahiya12]

sonic-swss
* 7611df5 2021-05-27 | [tunneldecaporch] Set default MTU for the overlay loopback interface (sonic-net#1756) (HEAD -> 202012) [Volodymyr Samotiy]
* 22fbb5c 2021-05-27 | [202012] Resolve neighbor when nexthop does not exist (sonic-net#1759) (github/202012) [Shi Su]
* ec7710c 2021-05-27 | [Bulk mode] Limit the size of bulker (sonic-net#1760) [Shi Su]

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
yxieca added a commit that referenced this issue May 28, 2021
sonic-utilities:
* 8b98d45 2021-05-25 | [show] support for show muxcable firmware version of only active banks (#1629) (HEAD -> 202012) [vdahiya12]
* afd0975 2021-05-20 | [show] add support for muxcable metrics (#1615) [vdahiya12]

sonic-swss
* 7611df5 2021-05-27 | [tunneldecaporch] Set default MTU for the overlay loopback interface (#1756) (HEAD -> 202012) [Volodymyr Samotiy]
* 22fbb5c 2021-05-27 | [202012] Resolve neighbor when nexthop does not exist (#1759) (github/202012) [Shi Su]
* ec7710c 2021-05-27 | [Bulk mode] Limit the size of bulker (#1760) [Shi Su]

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
lguohan pushed a commit that referenced this issue Jun 17, 2021
19615e3 Fixing db_migrator for Feature table (#1674)
d1c1c61 [tests]: skip some dynamic port breakout unit tests (#1677)
25669c3 [CI] sonic-config-engine now depends on SONiC YANG packages (#1675)
3ff68c4 [neighbor-advertiser] delete the tunnel maps appropriately (#1663)
a425ca2 [config] support for configuring muxcable to manual mode of operation  (#1642)
25e17de [show platform summary] Add chassis hardware info to platform summary and version (#1624)
f5f2a00 [db_migrator] fix old 1911 feature config migration to a new one. (#1635)
56db162 [config] Fix config int add incorrect ip (#1414)
1da879c [db_migrator][Mellanox] Update Mellanox buffer migrator with 2km-cable supported (#1564)
c2b760f [sonic_package_manager] flush once finished saving docker image into temporary file (#1638)
cd69473 Replace swsssdk.ConfigDBConnector and SonicDBConfig with swsscommon implementation (#1620)
5f20365 Change to use rvtysh when calling the show commands (#1572)
51d6bf5 Fix Aboot breakage in sonic package manager in sonic-installer (#1625)
18bed46 [console][show] Force refresh all lines status during show line (#1641)
b616cd9 [TPID CONFIG] Added TPID configuration CLI support (#1618)
01eb4b1 [show] support for show muxcable firmware version of only active banks (#1629)
7744c8d [fdb]cli: fdb entries are cleared according to vlan or port or vlan&&port (#657)
e23c5ee Add psu hardware revision to psushow table (#1601)
f1726fe Make advance_version_for_expected_database available for other db migrator test cases as well (#1614)
5d1ad05 [show] add support for muxcable metrics (#1615)
feeab29 [config] Sort Config Db When Saving (#1623)
carl-nokia pushed a commit to carl-nokia/sonic-buildimage that referenced this issue Aug 7, 2021
19615e3 Fixing db_migrator for Feature table (sonic-net#1674)
d1c1c61 [tests]: skip some dynamic port breakout unit tests (sonic-net#1677)
25669c3 [CI] sonic-config-engine now depends on SONiC YANG packages (sonic-net#1675)
3ff68c4 [neighbor-advertiser] delete the tunnel maps appropriately (sonic-net#1663)
a425ca2 [config] support for configuring muxcable to manual mode of operation  (sonic-net#1642)
25e17de [show platform summary] Add chassis hardware info to platform summary and version (sonic-net#1624)
f5f2a00 [db_migrator] fix old 1911 feature config migration to a new one. (sonic-net#1635)
56db162 [config] Fix config int add incorrect ip (sonic-net#1414)
1da879c [db_migrator][Mellanox] Update Mellanox buffer migrator with 2km-cable supported (sonic-net#1564)
c2b760f [sonic_package_manager] flush once finished saving docker image into temporary file (sonic-net#1638)
cd69473 Replace swsssdk.ConfigDBConnector and SonicDBConfig with swsscommon implementation (sonic-net#1620)
5f20365 Change to use rvtysh when calling the show commands (sonic-net#1572)
51d6bf5 Fix Aboot breakage in sonic package manager in sonic-installer (sonic-net#1625)
18bed46 [console][show] Force refresh all lines status during show line (sonic-net#1641)
b616cd9 [TPID CONFIG] Added TPID configuration CLI support (sonic-net#1618)
01eb4b1 [show] support for show muxcable firmware version of only active banks (sonic-net#1629)
7744c8d [fdb]cli: fdb entries are cleared according to vlan or port or vlan&&port (sonic-net#657)
e23c5ee Add psu hardware revision to psushow table (sonic-net#1601)
f1726fe Make advance_version_for_expected_database available for other db migrator test cases as well (sonic-net#1614)
5d1ad05 [show] add support for muxcable metrics (sonic-net#1615)
feeab29 [config] Sort Config Db When Saving (sonic-net#1623)
volodymyrsamotiy added a commit to volodymyrsamotiy/sonic-buildimage that referenced this issue Sep 17, 2021
* 4be4a9b [tlm teamd] Add retry mechanism before logging the ERR in get_dumps. (sonic-net#1629)
* 2c0ce38 [fgnhgorch] Enable packet flow when no FG ECMP neighbors are resolved (sonic-net#1900)
* fddb298 Innovium platform specific changes PFC Detect lua script for SONiC 202012 (sonic-net#1893)

Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
lguohan pushed a commit that referenced this issue Sep 18, 2021
* 4be4a9b [tlm teamd] Add retry mechanism before logging the ERR in get_dumps. (#1629)
* 2c0ce38 [fgnhgorch] Enable packet flow when no FG ECMP neighbors are resolved (#1900)
* fddb298 Innovium platform specific changes PFC Detect lua script for SONiC 202012 (#1893)

Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
judyjoseph added a commit that referenced this issue Sep 27, 2021
sonic-swss:

bb69ca2 [portsorch] Avoid orchagent crash when set invalid interface types to port (#1906)
6e1bacc [pfcwd] Fix the polling interval time granularity (#1912)
564785b [teammgrd]: Improve LAGs cleanup on shutdown: send SIGTERM directly to PID. (#1841)
7ee8d26 [tlm teamd] Add retry mechanism before logging the ERR in get_dumps. (#1629)
7f57d3d [fgnhgorch] Enable packet flow when no FG ECMP neighbors are resolved (#1900)
08d009f Mux state order change (#1902)

sonic-utilities:

1bc0f07 Provide support to install platform extensions (#1578)
968c781 [config reload] Removed job-mode for sonic.target restart (#1820)
prsunny pushed a commit that referenced this issue Sep 28, 2021
*[Submodule] update for swss with following commits:
a3fdaf4 QOS fieldvalue reference ABNF format to string changes ([sonic-platform-daemons] Update submodule #1754)
a8fcadf Add sleep to ensure starting route perf test after the vs is stable ([mellanox]: Update hw-mgmt service with the stop action #1929)
a89d1f8 Fix failing DPB LAG tests ([socat]: build socat with readline #1919)
86b4ede [portsorch] Avoid orchagent crash when set invalid interface types to port (Upgrade azure-keyvault to known compatible version #1906)
025032f [VS] Skip failing test - test_recirc_port ([rsyslog]: use # to separate container and program name in syslog msg #1918)
d338bd0 [pfcwd] Fix the polling interval time granularity (Download newer version (8.23.0-2) of rsyslog from jessie-backports in hopes of eliminating memory leaks #1912)
14c937e Enabling copp tests ([Mellanox] Update hw-management service config #1914)
fbdcaae [teammgrd]: Improve LAGs cleanup on shutdown: send SIGTERM directly to PID. ([docker-syncd-mlnx] add new mlnx-sfpd daemon to docker-syncd-mlnx #1841)
002bb1d [tlm teamd] Add retry mechanism before logging the ERR in get_dumps. ([py-swss/config] config load-minigraph failure leaves database in wrong state #1629)
57d21e7 [pfcwd] Convert polling interval from ms to us in LUA scripts ([interfaces]: Move IP/MTU information from interfaces file into database #1908)
d01524d [fgnhgorch] Enable packet flow when no FG ECMP neighbors are resolved (Update arista driver submodule to includes interrupt handling changes #1900)
8cf355d Mux state order change ([submodule] update snmpagent and dbsyncd, extending/implementing ieee802.1ab, rfc3433, rfc2737 MIBs #1902)
theasianpianist pushed a commit to theasianpianist/sonic-buildimage that referenced this issue Feb 5, 2022
…onic-net#1629)

Fix sonic-net#6632

There has been cases when the get_dumps API in tlm_teamd process is not able to get the right data and logs an error message. 

The issue occurs very rarely and it is due to the race condition between teammgrd/teamsyncd/tlm_teamd when a Portchannel is removed. In the teamd telemetry module there are two places where the get_dumps() is called.
1. When the portchannel object is add/removed. [https://github.com/Azure/sonic-swss/blob/master/tlm_teamd/main.cpp#L101]
2. On timeout of 1 sec. [https://github.com/Azure/sonic-swss/blob/master/tlm_teamd/main.cpp#L108]

In case of timeout call for get_dumps(), there could be an inconsistent state where the portchannel/teamd process is getting removed by teammgrd but the STATE table update to remove the lag interface is still not received by the tlm_teamd module.  

Seen below on a bad case where the get_dumps() call from TIMEOUT handler throws an ERR message - as the remove_lag message is not yet received.

On a good case 
```
Feb  7 02:03:27.576078 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' has been removed.
Feb  7 02:03:28.453829 vlab-01 INFO teamd#supervisord 2021-02-07 02:03:28,451 INFO reaped unknown pid 4747 (exit status 0)
Feb  7 02:03:28.458616 vlab-01 NOTICE teamd#teammgrd: :- removeLag: Stop port channel PortChannel999
```
On a bad case 

```
Feb  7 02:03:33.037401 vlab-01 ERR teamd#tlm_teamd: :- get_dump: Can't get dump for LAG 'PortChannel999'. Skipping
Feb  7 02:03:33.046179 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' has been removed.
Feb  7 02:03:33.997639 vlab-01 INFO teamd#supervisord 2021-02-07 02:03:33,996 INFO reaped unknown pid 4775 (exit status 0)
Feb  7 02:03:34.040126 vlab-01 NOTICE teamd#teammgrd: :- removeLag: Stop port channel PortChannel999
```


**How I did it**

Add retry mechanism before logging the ERR in get_dumps API(). The number of retries is set as 3. So that if the same error repeats 3 times - it is logged, other wise it is considered a transient condition - not an error. 

Additionally added a **to_retry** flag to get_dumps() API so that the caller can decide whether to use the retry mechanism or not.


**How I verified it**
Verified that the error message is no more seen in the syslog.
Confirmed by running ~ 200 times portchannel creation (which had reproduced the issue earlier on VS testbed). 

The new NOTICE message added in remove_lag shows that we had indeed hit the original issue earlier and clearing flags here.

```
admin@vlab-01:/var/log$ sudo zgrep -i "get dump for LAG" syslog*; sudo zgrep -i "clearing it" syslog*
syslog.1:Feb  8 06:41:54.995716 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it
syslog.2.gz:Feb  8 06:31:32.360135 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it
syslog.2.gz:Feb  8 06:36:16.617283 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it
syslog.2.gz:Feb  8 06:37:56.906306 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it
syslog.3.gz:Feb  8 06:25:44.442474 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it
syslog.3.gz:Feb  8 06:27:02.539413 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it
syslog.3.gz:Feb  8 06:27:42.785533 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it
syslog.3.gz:Feb  8 06:29:33.510933 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it
syslog.5.gz:Feb  8 06:08:03.643106 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it
```
taras-keryk pushed a commit to taras-keryk/sonic-buildimage that referenced this issue Apr 28, 2022
sonic-net#1629)

Signed-off-by: vaibhav-dahiya vdahiya@microsoft.com

This PR adds support for an option to display firmware version of muxcable of only active banks.

The new output would look like this in case an active flag is passed to the command line

admin@STR43-0101-0101-01LT0:~$ show muxcable firmware version Ethernet0 --active
{
"version_self_active": "0.7MS",
"version_peer_active": "0.7MS",
"version_nic_active": "0.7MS",
}

What I did
added an option to display active banks only for display muxcable firmware version

Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants