Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[chassis] Chassis DB cleanup when asic comes up #16213

Merged
merged 3 commits into from
Sep 1, 2023

Conversation

vganesan-nokia
Copy link
Contributor

Why I did it

In an operational system, after some configuration is changed (that involves line cards generating entries to chassis db), if config reload or system reboot is done without saving the new configuration changes and when system comes up, entries created by the new config are still present in the chassis db and hence the corresponding voq system entries (such as system interface, system neighbor and so on ) in all other line cards. These stale entries may affect the accuracy of the current entries and hence the intended operation of chassis. To fix this, we cleanup the chassis db when the an asic comes up. The chassis db is cleaned up for entries created by the asic that is coming up.

Work item tracking

N/A

How I did it

Cleanup the entries from the following tables in chassis app db in redis_chassis server in the supervisor

(1) SYSTEM_NEIGH
(2) SYSTEM_INTERFACE
(3) SYSTEM_LAG_MEMBER_TABLE
(4) SYSTEM_LAG_TABLE

As part of the clean up only those entries created by the asic that is coming up are deleted. The LAG IDs used by the asics are also de-allocated from SYSTEM_LAG_ID_TABLE and SYSTEM_LAG_ID_SET

How to verify it

  • When chassis is up and running, change configuration that will create entries for the tables mentioned above. Make sure that the entries created are due to changes in the existing configuration.
  • Note that chassis db has entries for new changed configutations.
  • Without saving the changed configuration, do a config reload or reboot the line card or reboot the chassis.
  • Observe that the chassis db does not have the entries created by the changed configuration.

Which release branch to backport (provide reason below if selected)

N/A

Tested branch (Please provide the tested image version)

master

Description for the changelog

  • Changes swss.sh service script. Which is invoked with swss service is started. After cleaning up the critical tables in local redis database, the chassis db entries are cleaned up. The clean up use the hostname and asic name currently detected by the service initialization script. If this hostname and asic name are different than those used to create the entries in the chassis db, the entries will not be cleaned up.

Link to config_db schema for YANG module changes

None

A picture of a cute animal (not mandatory but encouraged)

end
return " 0 $lc $asic

sleep 30
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the criteria to pick this value to sleep? If it is merely cleanup the redis DB entry, does it require "sleep xx" at all?
Just looking at how we can spend less time to bring up a module and only wait when it is absolutely necessary so the overall chassis/module bring up time can be reduced...

Copy link
Contributor Author

@vganesan-nokia vganesan-nokia Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sleep time is required. At the line card/asic, we need to make sure that neighbors associated with a system interface is removed in the syncd/SAI before deleting the system interface. If this delay was not there, we observed an issue that at the orchangent system interface refcount will be zero but the syncd meta layer will stil have non-zero. Due to this out of sync issue which is in turn diue to time difference between the orchangent and syncd, syncd returns error and hence orchagent exits. There is no specific criteria used to select this value. This is thie lowest rounded time we used with which we did not see the issue.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. Please add a comment to disclose this so that in future someone may not try to remove it or at least when try to better tunning the value will know what to verify so it will not end up breaking something...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment.

end
return " 0 $lc $asic

sleep 15
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as above... can we eliminate unnecessary sleep if it is not really needed...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for the similar reason as explained for the previous comment but for SYSTEM_LAG_TABLE and SYSTEM_LAG_MEMBER_TABLE entries. At the line card/asic, deletion of system lag will happen only if all the members of the system lag are removed. Even though we send system lag member delete before sending the system lag delete, due to the order of processing and other dependency (like system inteface delete and neighbours delete which introduces timing differences), by the time system lag is tried to be deleted, sometimes, the a system lag still has system interface pending to be deleted. Having this delay (the lowest rounded time we tested with which we did not see the issue) helps to make sure that all system lag members are deleted before deleting system lag

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. Please add a comment to disclose this so that in future someone may not try to remove it or at least when try to better tunning the value will know what to verify so it will not end up breaking something...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment

Copy link
Contributor

@deepak-singhal0408 deepak-singhal0408 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me.. Thanks

@gechiang gechiang requested a review from rlhui August 29, 2023 23:07
if [[ !($($SONIC_DB_CLI CHASSIS_APP_DB PING | grep -c True) -gt 0) ]]; then
return
fi

Copy link
Contributor

@arlakshm arlakshm Aug 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a check to do the clean-up for VoQ Linecards only. The code as it is now will add delay of about 45 seconds on supervisor and non-voq linecards.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added check to run the clean up only for voq switches.

Cleanup the entries from the following tables in chassis app db in
redis_chassis server in the supervisor
(1) SYSTEM_NEIGH
(2) SYSTEM_INTERFACE
(3) SYSTEM_LAG_MEMBER_TABLE
(4) SYSTEM_LAG_TABLE
As part of the clean up only those entries created by the asic that
is coming up are deleted. The LAG IDs used by the asics are also
de-allocated from SYSTEM_LAG_ID_TABLE and SYSTEM_LAG_ID_SET

Signed-off-by: vedganes <veda.ganesan@nokia.com>
Changes to add comments describing reason for adding delay before system
interface and system lag entries clean up.

Signed-off-by: vedganes <veda.ganesan@nokia.com>
# but the syncd (meta) still has no-zero refcount. Because of this, orchagent gets "object still in use"
# error and aborts.

sleep 30
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add a check to see if the number of SYSTEM_NEIGHBOR entries ( and similarly below for SYSTEM_LAG_MEMBER_TABLE) are non-zero before adding this sleep(). In case of retries when swss systemd service restarts multiple times etc we need not each time wait for 45 sec.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is not with number of system neighbors or system lag members in the chassis app db. The problem is with timing in processing in other remote asics and linecards. It depends on how much time the remote asics/linecards take to remove the referenced entries in that asic/linecard.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vganesan-nokia , I think the asks is for this script while executing to cleanup the various stale entries to be able to know whether anything actually did get cleaned up during the iteration. if so, do wait for the needed sleep time whether it be just one entry or 1000 entries... but if NONE is cleaned up let's say for nexthop resource, then we should not dead wait for the 30 seconds there...
This would work out great when swss keep restarting let's say after the cleanup all done it encountered some other issues that caused it to restart again. since the cleanup already done on first restart, it should have already waited during the first restart iteration so that on the second restart, the clean up steps will end up with nothing needs to be clean up any more and therefore for that case no need to sleep at all... Is this something achievable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is true, I was pointing to a few cases as below where we can skip the sleep altogether

(a) no SYSTEM_NEIGHBORS/LAG for that asic/LC ( so ideally nothing to clean locally and remote)
or in scenario below
(b) where swss restarts, we check and cleanup SYSTEM_NEIGHBORS for that asic/LC in chassis_db. Wait for 30 secs and make sure that the processing in remote asic/linecard is done.

But not for some reason swss restarts again (assume orchgent crashed). Again we will do same cleanup SYSTEM_NEIGHBORS for that asic/LC in chassis_db -- but this time there are no entries as in last round we cleaned up all .
So now there is no need to wait for 30 sec and 15 sec as there is no SYSTEM_NEIGHBOR/LAG entries relating to my LC/asic present in CHASSIS_DB

But don't block merge for this comment -- we can fine tune this in subsequent commits too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification. I'll fix this.

Follwing changes done for review comments:
    - Added check to run the chassis db clean up only for voq switches.

Signed-off-by: vedganes <veda.ganesan@nokia.com>
@gechiang
Copy link
Collaborator

gechiang commented Sep 1, 2023

BRM build failed... retrying the pipeline again...
Azure Pipelines / Azure.sonic-buildimage

@gechiang
Copy link
Collaborator

gechiang commented Sep 1, 2023

/azp run

@azure-pipelines
Copy link

Commenter does not have sufficient privileges for PR 16213 in repo sonic-net/sonic-buildimage

@arlakshm
Copy link
Contributor

arlakshm commented Sep 1, 2023

/Azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

# but the syncd (meta) still has no-zero refcount. Because of this, orchagent gets "object still in use"
# error and aborts.

sleep 30
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have some concerns about the sleep 30 here and the sleep 15 after the LAG cleanup.
Is there a way you can check for the case where there are no entries need to be removed, we can skip over the sleep statements? Basically, let's not blindly wait unless there is a need to wait... it creates unnecessary delays on swss restart.

# but the syncd (meta) still has no-zero refcount. Because of this, orchagent gets "object still in use"
# error and aborts.

sleep 30
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vganesan-nokia , I think the asks is for this script while executing to cleanup the various stale entries to be able to know whether anything actually did get cleaned up during the iteration. if so, do wait for the needed sleep time whether it be just one entry or 1000 entries... but if NONE is cleaned up let's say for nexthop resource, then we should not dead wait for the 30 seconds there...
This would work out great when swss keep restarting let's say after the cleanup all done it encountered some other issues that caused it to restart again. since the cleanup already done on first restart, it should have already waited during the first restart iteration so that on the second restart, the clean up steps will end up with nothing needs to be clean up any more and therefore for that case no need to sleep at all... Is this something achievable?

@rlhui rlhui merged commit 5fded5c into sonic-net:master Sep 1, 2023
16 checks passed
@gechiang
Copy link
Collaborator

gechiang commented Sep 1, 2023

@yxieca , @StormLiangMS
MSFT ADO: 25038392
Appreciate approval for the corresponding branches. Thanks!

rlhui pushed a commit that referenced this pull request Sep 1, 2023
* [chassis]Chassis DB cleanup when asic comes up

Cleanup the entries from the following tables in chassis app db in
redis_chassis server in the supervisor
(1) SYSTEM_NEIGH
(2) SYSTEM_INTERFACE
(3) SYSTEM_LAG_MEMBER_TABLE
(4) SYSTEM_LAG_TABLE
As part of the clean up only those entries created by the asic that
is coming up are deleted. The LAG IDs used by the asics are also
de-allocated from SYSTEM_LAG_ID_TABLE and SYSTEM_LAG_ID_SET

- Added check to run the chassis db clean up only for voq switches.

Signed-off-by: vedganes <veda.ganesan@nokia.com>
Co-authored-by: vganesan-nokia <67648637+vganesan-nokia@users.noreply.github.com>
mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this pull request Sep 3, 2023
* [chassis]Chassis DB cleanup when asic comes up

Cleanup the entries from the following tables in chassis app db in
redis_chassis server in the supervisor
(1) SYSTEM_NEIGH
(2) SYSTEM_INTERFACE
(3) SYSTEM_LAG_MEMBER_TABLE
(4) SYSTEM_LAG_TABLE
As part of the clean up only those entries created by the asic that
is coming up are deleted. The LAG IDs used by the asics are also
de-allocated from SYSTEM_LAG_ID_TABLE and SYSTEM_LAG_ID_SET

- Added check to run the chassis db clean up only for voq switches.

Signed-off-by: vedganes <veda.ganesan@nokia.com>
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202305: #16417

sonic-otn pushed a commit to sonic-otn/sonic-buildimage that referenced this pull request Sep 20, 2023
* [chassis]Chassis DB cleanup when asic comes up

Cleanup the entries from the following tables in chassis app db in
redis_chassis server in the supervisor
(1) SYSTEM_NEIGH
(2) SYSTEM_INTERFACE
(3) SYSTEM_LAG_MEMBER_TABLE
(4) SYSTEM_LAG_TABLE
As part of the clean up only those entries created by the asic that
is coming up are deleted. The LAG IDs used by the asics are also
de-allocated from SYSTEM_LAG_ID_TABLE and SYSTEM_LAG_ID_SET

- Added check to run the chassis db clean up only for voq switches.

Signed-off-by: vedganes <veda.ganesan@nokia.com>
wangxin pushed a commit to sonic-net/sonic-mgmt that referenced this pull request Nov 1, 2023
)

The Test Case tacacs/test_ro_disk.py::test_ro_disk, in finally block a reboot is performed, currently, the condition for reboot to be successful is to check the docker is running, the enhancement done is to wait till all the critical process in the docker are up and the interfaces are also up for all the line cards in case the supervisor is rebooted
The Particular improvement is required due to the changes with respect to PR sonic-net/sonic-buildimage#16213. The Parent PR on Master was merged #9893

What is the motivation for this PR?
The Parent Master PR #9893
The Particular improvement is required due to the changes with respect to PR sonic-net/sonic-buildimage#16213.

How did you do it?
Assertion for successful reboot of supervisor card is updated to check all the process are up on critical docker and interfaces are also up on line cards.

How did you verify/test it?
Ran the test cases against a multi-asic line card in a T2 chassis.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

9 participants