[chassis] Chassis DB cleanup when asic comes up #16213

vganesan-nokia · 2023-08-19T18:19:28Z

Why I did it

In an operational system, after some configuration is changed (that involves line cards generating entries to chassis db), if config reload or system reboot is done without saving the new configuration changes and when system comes up, entries created by the new config are still present in the chassis db and hence the corresponding voq system entries (such as system interface, system neighbor and so on ) in all other line cards. These stale entries may affect the accuracy of the current entries and hence the intended operation of chassis. To fix this, we cleanup the chassis db when the an asic comes up. The chassis db is cleaned up for entries created by the asic that is coming up.

Work item tracking

N/A

How I did it

Cleanup the entries from the following tables in chassis app db in redis_chassis server in the supervisor

(1) SYSTEM_NEIGH
(2) SYSTEM_INTERFACE
(3) SYSTEM_LAG_MEMBER_TABLE
(4) SYSTEM_LAG_TABLE

As part of the clean up only those entries created by the asic that is coming up are deleted. The LAG IDs used by the asics are also de-allocated from SYSTEM_LAG_ID_TABLE and SYSTEM_LAG_ID_SET

How to verify it

When chassis is up and running, change configuration that will create entries for the tables mentioned above. Make sure that the entries created are due to changes in the existing configuration.
Note that chassis db has entries for new changed configutations.
Without saving the changed configuration, do a config reload or reboot the line card or reboot the chassis.
Observe that the chassis db does not have the entries created by the changed configuration.

Which release branch to backport (provide reason below if selected)

N/A

Tested branch (Please provide the tested image version)

master

Description for the changelog

Changes swss.sh service script. Which is invoked with swss service is started. After cleaning up the critical tables in local redis database, the chassis db entries are cleaned up. The clean up use the hostname and asic name currently detected by the service initialization script. If this hostname and asic name are different than those used to create the entries in the chassis db, the entries will not be cleaned up.

Link to config_db schema for YANG module changes

None

A picture of a cute animal (not mandatory but encouraged)

gechiang · 2023-08-23T01:59:35Z

files/scripts/swss.sh

+    end
+    return " 0 $lc $asic
+
+    sleep 30


What is the criteria to pick this value to sleep? If it is merely cleanup the redis DB entry, does it require "sleep xx" at all?
Just looking at how we can spend less time to bring up a module and only wait when it is absolutely necessary so the overall chassis/module bring up time can be reduced...

This sleep time is required. At the line card/asic, we need to make sure that neighbors associated with a system interface is removed in the syncd/SAI before deleting the system interface. If this delay was not there, we observed an issue that at the orchangent system interface refcount will be zero but the syncd meta layer will stil have non-zero. Due to this out of sync issue which is in turn diue to time difference between the orchangent and syncd, syncd returns error and hence orchagent exits. There is no specific criteria used to select this value. This is thie lowest rounded time we used with which we did not see the issue.

Thanks for the explanation. Please add a comment to disclose this so that in future someone may not try to remove it or at least when try to better tunning the value will know what to verify so it will not end up breaking something...

Added comment.

gechiang · 2023-08-23T02:00:21Z

files/scripts/swss.sh

+    end
+    return " 0 $lc $asic
+
+    sleep 15


Same question as above... can we eliminate unnecessary sleep if it is not really needed...

This is for the similar reason as explained for the previous comment but for SYSTEM_LAG_TABLE and SYSTEM_LAG_MEMBER_TABLE entries. At the line card/asic, deletion of system lag will happen only if all the members of the system lag are removed. Even though we send system lag member delete before sending the system lag delete, due to the order of processing and other dependency (like system inteface delete and neighbours delete which introduces timing differences), by the time system lag is tried to be deleted, sometimes, the a system lag still has system interface pending to be deleted. Having this delay (the lowest rounded time we tested with which we did not see the issue) helps to make sure that all system lag members are deleted before deleting system lag

Thanks for the explanation. Please add a comment to disclose this so that in future someone may not try to remove it or at least when try to better tunning the value will know what to verify so it will not end up breaking something...

Added comment

deepak-singhal0408

Changes look good to me.. Thanks

arlakshm · 2023-08-29T23:22:08Z

files/scripts/swss.sh

+    if [[ !($($SONIC_DB_CLI CHASSIS_APP_DB PING | grep -c True) -gt 0) ]]; then
+        return
+    fi
+


can you add a check to do the clean-up for VoQ Linecards only. The code as it is now will add delay of about 45 seconds on supervisor and non-voq linecards.

Added check to run the clean up only for voq switches.

Cleanup the entries from the following tables in chassis app db in redis_chassis server in the supervisor (1) SYSTEM_NEIGH (2) SYSTEM_INTERFACE (3) SYSTEM_LAG_MEMBER_TABLE (4) SYSTEM_LAG_TABLE As part of the clean up only those entries created by the asic that is coming up are deleted. The LAG IDs used by the asics are also de-allocated from SYSTEM_LAG_ID_TABLE and SYSTEM_LAG_ID_SET Signed-off-by: vedganes <veda.ganesan@nokia.com>

Changes to add comments describing reason for adding delay before system interface and system lag entries clean up. Signed-off-by: vedganes <veda.ganesan@nokia.com>

judyjoseph · 2023-08-31T21:14:39Z

files/scripts/swss.sh

+    # but the syncd (meta) still has no-zero refcount. Because of this, orchagent gets "object still in use"
+    # error and aborts.
+
+    sleep 30


Can we also add a check to see if the number of SYSTEM_NEIGHBOR entries ( and similarly below for SYSTEM_LAG_MEMBER_TABLE) are non-zero before adding this sleep(). In case of retries when swss systemd service restarts multiple times etc we need not each time wait for 45 sec.

The problem is not with number of system neighbors or system lag members in the chassis app db. The problem is with timing in processing in other remote asics and linecards. It depends on how much time the remote asics/linecards take to remove the referenced entries in that asic/linecard.

@vganesan-nokia , I think the asks is for this script while executing to cleanup the various stale entries to be able to know whether anything actually did get cleaned up during the iteration. if so, do wait for the needed sleep time whether it be just one entry or 1000 entries... but if NONE is cleaned up let's say for nexthop resource, then we should not dead wait for the 30 seconds there...
This would work out great when swss keep restarting let's say after the cleanup all done it encountered some other issues that caused it to restart again. since the cleanup already done on first restart, it should have already waited during the first restart iteration so that on the second restart, the clean up steps will end up with nothing needs to be clean up any more and therefore for that case no need to sleep at all... Is this something achievable?

That is true, I was pointing to a few cases as below where we can skip the sleep altogether

(a) no SYSTEM_NEIGHBORS/LAG for that asic/LC ( so ideally nothing to clean locally and remote)
or in scenario below
(b) where swss restarts, we check and cleanup SYSTEM_NEIGHBORS for that asic/LC in chassis_db. Wait for 30 secs and make sure that the processing in remote asic/linecard is done.

But not for some reason swss restarts again (assume orchgent crashed). Again we will do same cleanup SYSTEM_NEIGHBORS for that asic/LC in chassis_db -- but this time there are no entries as in last round we cleaned up all .
So now there is no need to wait for 30 sec and 15 sec as there is no SYSTEM_NEIGHBOR/LAG entries relating to my LC/asic present in CHASSIS_DB

But don't block merge for this comment -- we can fine tune this in subsequent commits too

Thanks for the clarification. I'll fix this.

Follwing changes done for review comments: - Added check to run the chassis db clean up only for voq switches. Signed-off-by: vedganes <veda.ganesan@nokia.com>

gechiang · 2023-09-01T00:39:06Z

BRM build failed... retrying the pipeline again...
Azure Pipelines / Azure.sonic-buildimage

gechiang · 2023-09-01T00:40:27Z

/azp run

azure-pipelines · 2023-09-01T00:40:31Z

Commenter does not have sufficient privileges for PR 16213 in repo sonic-net/sonic-buildimage

arlakshm · 2023-09-01T00:41:46Z

/Azp run Azure.sonic-buildimage

azure-pipelines · 2023-09-01T00:41:56Z

Azure Pipelines successfully started running 1 pipeline(s).

gechiang · 2023-08-30T00:02:38Z

files/scripts/swss.sh

+    # but the syncd (meta) still has no-zero refcount. Because of this, orchagent gets "object still in use"
+    # error and aborts.
+
+    sleep 30


We have some concerns about the sleep 30 here and the sleep 15 after the LAG cleanup.
Is there a way you can check for the case where there are no entries need to be removed, we can skip over the sleep statements? Basically, let's not blindly wait unless there is a need to wait... it creates unnecessary delays on swss restart.

gechiang · 2023-08-31T23:12:13Z

files/scripts/swss.sh

+    # but the syncd (meta) still has no-zero refcount. Because of this, orchagent gets "object still in use"
+    # error and aborts.
+
+    sleep 30


@vganesan-nokia , I think the asks is for this script while executing to cleanup the various stale entries to be able to know whether anything actually did get cleaned up during the iteration. if so, do wait for the needed sleep time whether it be just one entry or 1000 entries... but if NONE is cleaned up let's say for nexthop resource, then we should not dead wait for the 30 seconds there...
This would work out great when swss keep restarting let's say after the cleanup all done it encountered some other issues that caused it to restart again. since the cleanup already done on first restart, it should have already waited during the first restart iteration so that on the second restart, the clean up steps will end up with nothing needs to be clean up any more and therefore for that case no need to sleep at all... Is this something achievable?

gechiang · 2023-09-01T06:39:50Z

@yxieca , @StormLiangMS
MSFT ADO: 25038392
Appreciate approval for the corresponding branches. Thanks!

* [chassis]Chassis DB cleanup when asic comes up Cleanup the entries from the following tables in chassis app db in redis_chassis server in the supervisor (1) SYSTEM_NEIGH (2) SYSTEM_INTERFACE (3) SYSTEM_LAG_MEMBER_TABLE (4) SYSTEM_LAG_TABLE As part of the clean up only those entries created by the asic that is coming up are deleted. The LAG IDs used by the asics are also de-allocated from SYSTEM_LAG_ID_TABLE and SYSTEM_LAG_ID_SET - Added check to run the chassis db clean up only for voq switches. Signed-off-by: vedganes <veda.ganesan@nokia.com> Co-authored-by: vganesan-nokia <67648637+vganesan-nokia@users.noreply.github.com>

* [chassis]Chassis DB cleanup when asic comes up Cleanup the entries from the following tables in chassis app db in redis_chassis server in the supervisor (1) SYSTEM_NEIGH (2) SYSTEM_INTERFACE (3) SYSTEM_LAG_MEMBER_TABLE (4) SYSTEM_LAG_TABLE As part of the clean up only those entries created by the asic that is coming up are deleted. The LAG IDs used by the asics are also de-allocated from SYSTEM_LAG_ID_TABLE and SYSTEM_LAG_ID_SET - Added check to run the chassis db clean up only for voq switches. Signed-off-by: vedganes <veda.ganesan@nokia.com>

mssonicbld · 2023-09-03T03:28:58Z

Cherry-pick PR to 202305: #16417

* [chassis]Chassis DB cleanup when asic comes up Cleanup the entries from the following tables in chassis app db in redis_chassis server in the supervisor (1) SYSTEM_NEIGH (2) SYSTEM_INTERFACE (3) SYSTEM_LAG_MEMBER_TABLE (4) SYSTEM_LAG_TABLE As part of the clean up only those entries created by the asic that is coming up are deleted. The LAG IDs used by the asics are also de-allocated from SYSTEM_LAG_ID_TABLE and SYSTEM_LAG_ID_SET - Added check to run the chassis db clean up only for voq switches. Signed-off-by: vedganes <veda.ganesan@nokia.com>

) The Test Case tacacs/test_ro_disk.py::test_ro_disk, in finally block a reboot is performed, currently, the condition for reboot to be successful is to check the docker is running, the enhancement done is to wait till all the critical process in the docker are up and the interfaces are also up for all the line cards in case the supervisor is rebooted The Particular improvement is required due to the changes with respect to PR sonic-net/sonic-buildimage#16213. The Parent PR on Master was merged #9893 What is the motivation for this PR? The Parent Master PR #9893 The Particular improvement is required due to the changes with respect to PR sonic-net/sonic-buildimage#16213. How did you do it? Assertion for successful reboot of supervisor card is updated to check all the process are up on critical docker and interfaces are also up on line cards. How did you verify/test it? Ran the test cases against a multi-asic line card in a T2 chassis.

vganesan-nokia requested a review from lguohan as a code owner August 19, 2023 18:19

deepak-singhal0408 requested review from judyjoseph, arlakshm and deepak-singhal0408 August 23, 2023 00:37

rlhui added the P0 Priority of the issue label Aug 23, 2023

gechiang reviewed Aug 23, 2023

View reviewed changes

rlhui assigned vganesan-nokia Aug 23, 2023

gechiang approved these changes Aug 29, 2023

View reviewed changes

deepak-singhal0408 approved these changes Aug 29, 2023

View reviewed changes

gechiang requested a review from rlhui August 29, 2023 23:07

arlakshm reviewed Aug 29, 2023

View reviewed changes

vganesan-nokia added 2 commits August 29, 2023 22:16

[swss]Chassis db cleanup when asic is coming up - 2

262db4e

Changes to add comments describing reason for adding delay before system interface and system lag entries clean up. Signed-off-by: vedganes <veda.ganesan@nokia.com>

judyjoseph reviewed Aug 31, 2023

View reviewed changes

[swss] Chassis db cleanup when asic comes up - 3

b3ebd82

Follwing changes done for review comments: - Added check to run the chassis db clean up only for voq switches. Signed-off-by: vedganes <veda.ganesan@nokia.com>

vganesan-nokia force-pushed the chdbcon branch from fea9612 to b3ebd82 Compare August 31, 2023 21:57

arlakshm approved these changes Sep 1, 2023

View reviewed changes

gechiang approved these changes Sep 1, 2023

View reviewed changes

gechiang added Request for 202205 Branch Request for 202211 Branch Request for 202305 Branch labels Sep 1, 2023

rlhui merged commit 5fded5c into sonic-net:master Sep 1, 2023
16 checks passed

yxieca added Approved for 202205 Branch Approved for 202211 Branch labels Sep 1, 2023

This was referenced Sep 1, 2023

[action] [PR:16213] [chassis] Chassis DB cleanup when asic comes up #16378

Merged

[action] [PR:16213] [chassis] Chassis DB cleanup when asic comes up #16379

Merged

mssonicbld added a commit that referenced this pull request Sep 1, 2023

[chassis] Chassis DB cleanup when asic comes up (#16213) (#16379)

26e1d59

mssonicbld added Included in 202211 Branch and removed Approved for 202211 Branch Created PR to 202211 Branch labels Sep 1, 2023

mssonicbld added Included in 202205 Branch and removed Approved for 202205 Branch Created PR to 202205 Branch labels Sep 1, 2023

StormLiangMS added the Approved for 202305 Branch label Sep 3, 2023

mssonicbld added the Created PR to 202305 Branch label Sep 3, 2023

mssonicbld mentioned this pull request Sep 3, 2023

[action] [PR:16213] [chassis] Chassis DB cleanup when asic comes up #16417

Merged

mssonicbld added a commit that referenced this pull request Sep 3, 2023

[chassis] Chassis DB cleanup when asic comes up (#16213) (#16417)

ebe24a1

mssonicbld added Included in 202305 Branch and removed Created PR to 202305 Branch labels Sep 3, 2023

dgsudharsan mentioned this pull request Sep 5, 2023

Config reload results in ERR sonic-db-cli: :- getDbInfo: Failed to find CHASSIS_APP_DB database in namespace in syslog #16451

Closed

vganesan-nokia mentioned this pull request Sep 6, 2023

[swss] Chassis db clean up optimization and bug fixes #16454

Merged

11 tasks

vikshaw-Nokia mentioned this pull request Sep 7, 2023

New Test Cases for Chassis DB cleanup when asic comes up PR (sonic-buildimage/pull/16213) sonic-net/sonic-mgmt#9866

Closed

6 tasks

vikshaw-Nokia mentioned this pull request Sep 7, 2023

test_ro_disk: Enhancement for reboot of supervisor card sonic-net/sonic-mgmt#9893

Merged

6 tasks

vikshaw-Nokia mentioned this pull request Oct 3, 2023

New Test Cases for Chassis DB cleanup when asic comes up PR (sonic-buildimage/pull/16213) sonic-net/sonic-mgmt#10218

Merged

6 tasks

mssonicbld mentioned this pull request Oct 19, 2023

[action] [PR:9893] test_ro_disk: Enhancement for reboot of supervisor card sonic-net/sonic-mgmt#10386

Merged

6 tasks

vikshaw-Nokia mentioned this pull request Oct 30, 2023

[202205] test_ro_disk: Enhancement for reboot of supervisor card sonic-net/sonic-mgmt#10528

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[chassis] Chassis DB cleanup when asic comes up #16213

[chassis] Chassis DB cleanup when asic comes up #16213

vganesan-nokia commented Aug 19, 2023

gechiang Aug 23, 2023

vganesan-nokia Aug 23, 2023 •

edited

Loading

gechiang Aug 24, 2023

vganesan-nokia Aug 25, 2023

gechiang Aug 23, 2023

vganesan-nokia Aug 23, 2023

gechiang Aug 24, 2023

vganesan-nokia Aug 25, 2023

deepak-singhal0408 left a comment

arlakshm Aug 29, 2023 •

edited

Loading

vganesan-nokia Aug 31, 2023

judyjoseph Aug 31, 2023

vganesan-nokia Aug 31, 2023

gechiang Aug 31, 2023

judyjoseph Aug 31, 2023

vganesan-nokia Sep 1, 2023

gechiang commented Sep 1, 2023

gechiang commented Sep 1, 2023

azure-pipelines bot commented Sep 1, 2023

arlakshm commented Sep 1, 2023

azure-pipelines bot commented Sep 1, 2023

gechiang Aug 30, 2023

gechiang Aug 31, 2023

gechiang commented Sep 1, 2023

mssonicbld commented Sep 3, 2023

[chassis] Chassis DB cleanup when asic comes up #16213

[chassis] Chassis DB cleanup when asic comes up #16213

Conversation

vganesan-nokia commented Aug 19, 2023

Why I did it

Work item tracking

How I did it

How to verify it

Which release branch to backport (provide reason below if selected)

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

Choose a reason for hiding this comment

vganesan-nokia Aug 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deepak-singhal0408 left a comment

Choose a reason for hiding this comment

arlakshm Aug 29, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gechiang commented Sep 1, 2023

gechiang commented Sep 1, 2023

azure-pipelines bot commented Sep 1, 2023

arlakshm commented Sep 1, 2023

azure-pipelines bot commented Sep 1, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gechiang commented Sep 1, 2023

mssonicbld commented Sep 3, 2023

vganesan-nokia Aug 23, 2023 •

edited

Loading

arlakshm Aug 29, 2023 •

edited

Loading