Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[config reload] Fix config reload failure due to sonic.target job cancellation #1814

Merged
merged 2 commits into from
Sep 13, 2021

Conversation

vivekrnv
Copy link
Contributor

Signed-off-by: Vivek Reddy Karri vkarri@nvidia.com

What I did

Fixes sonic-net/sonic-buildimage#7508

How I did it

How to verify it

With this change:

root@sonic:/home/admin# config reload -y -f
Running command: rm -rf /tmp/dropstat-*
Disabling container monitoring ...
Stopping SONiC target ...
Running command: /usr/local/bin/sonic-cfggen -j /etc/sonic/init_cfg.json -j /etc/sonic/config_db.json --write-to-db
Running command: /usr/local/bin/db_migrator.py -o migrate
Running command: /usr/local/bin/sonic-cfggen -d -y /etc/sonic/sonic_version.yml -t /usr/share/sonic/templates/sonic-environment.j2,/etc/sonic/sonic-environment
Restarting SONiC target ...
Enabling container monitoring ...
Reloading Monit configuration ...
Reinitializing monit daemon

In Paralell:
admin@sonic:~$ sudo systemctl start teamd.service
Failed to start teamd.service: Transaction for teamd.service/start is destructive (radv.service has 'stop' job queued, but 'start' is included in transaction).
See system logs and 'systemctl status teamd.service' for details.

Without this Change:

root@r-lionfish-16:/home/admin# config reload -y -f
Running command: rm -rf /tmp/dropstat-*
Disabling container monitoring ...
Stopping SONiC target ...
Job for sonic.target canceled.

In Paralell:
admin@sonic:~$ sudo systemctl start teamd.service.

Previous command output (if the output of a command-line utility has changed)

New command output (if the output of a command-line utility has changed)

Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
@vivekrnv
Copy link
Contributor Author

@rajendra-dendukuri, please review

@@ -686,7 +686,7 @@ def _stop_services():
pass

click.echo("Stopping SONiC target ...")
clicommon.run_command("sudo systemctl stop sonic.target")
clicommon.run_command("sudo systemctl stop sonic.target --job-mode replace-irreversibly")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add this option also for systemctl restart sonic.target case at line#709

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added !

Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
@liat-grozovik
Copy link
Collaborator

@vivekreddynv could you please confirm this PR can be cherry picked to 202106 and 202012 cleanly? if not please create separated PRs.

@vivekrnv
Copy link
Contributor Author

@vivekreddynv could you please confirm this PR can be cherry picked to 202106 and 202012 cleanly? if not please create separated PRs.

Verified Manually. The changes can be cleanly cherry-picked to 202012 & 202106

@nazariig
Copy link
Collaborator

@v-wfarris
Copy link

Where can I get the updated bin install file that comes with this fix? I would like to try it out on the Mellanox 2700 I was working with and having issues configuring due to this bug.
Thanks
Wes

@vivekrnv
Copy link
Contributor Author

The submodule update for this repo has been raised here: sonic-net/sonic-buildimage#8741.

Once this gets merged you can get the image from the artifacts available here. https://dev.azure.com/mssonic/build/_build

or once the CI is completed, you can use those artifacts here: https://dev.azure.com/mssonic/build/_build/results?buildId=<buildId>&view=artifacts&type=publishedArtifacts

@v-wfarris
Copy link

Wonderful thanks for the info. May I ask what the eta is on the merger? No hurry I am just curious. Thanks again for your response.
Wes

@nazariig
Copy link
Collaborator

nazariig commented Sep 14, 2021

@v-wfarris there are might be some issues with that fix...so probably we will need yet another PR

@@ -706,7 +706,7 @@ def _reset_failed_services():

def _restart_services():
click.echo("Restarting SONiC target ...")
clicommon.run_command("sudo systemctl restart sonic.target")
clicommon.run_command("sudo systemctl restart sonic.target --job-mode replace-irreversibly")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vivekreddynv please double check this change
@stepanblyschak please elaborate on the consequences

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vivekreddynv It looks like if restart is executed with --job-mode replace-irreversibly we will still have the same issue, because the start job will be placed in the in systemd's job queue as "replace-irreverisbly" and the next config reload stop job will be discarded by systemd due to "replace-irreverisbly" of the start job leading to an error that looks smth like this:

Transaction for sonic.target/stop is destructive

Could you please double check? I think we only need to stop services with the guaranty that it will be successfully executed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when the restart job (and all the other dependent jobs) are finished and is taken out of systemd's job queue, i don't think the next stop job will be cancelled.

However, simultaneous config reloads in the quick succession, can lead to the behavior you have said.

admin@sonic:~$ sudo systemctl restart sonic.target --job-mode replace-irreversibly
admin@sonic:~$ sudo systemctl stop sonic.target --job-mode replace-irreversibly (Ran Immediately)
Failed to stop sonic.target: Transaction for sonic.target/stop is destructive (ntp-config.service has 'start' job queued, but 'stop' is included in transaction).
See system logs and 'systemctl status sonic.target' for details.
(After all the dependent jobs of restart sonic.target  are done, it works)
admin@sonic:~$ sudo systemctl stop sonic.target --job-mode replace-irreversibly
admin@sonic:

The problem here is that the sonic.target start is not a blocking call unlike sonic.target stop.

Nevertheless, the idea of this PR is that config reload should not fail and thus it makes sense to remove this. I'll raise a separate PR

judyjoseph pushed a commit that referenced this pull request Sep 14, 2021
qiluo-msft pushed a commit that referenced this pull request Sep 15, 2021
#### What I did

As discussed in this PR #1814 (comment), only the stop.job should have job-mode set to replace irreversibly.

Otherwise,  simultaneous config reloads in the quick succession, can lead to the behavior.

Although ,when the restart job (and all the other dependent jobs) are finished and is taken out of systemd's job queue, the next stop job will not be cancelled.
qiluo-msft pushed a commit that referenced this pull request Sep 15, 2021
#### What I did

As discussed in this PR #1814 (comment), only the stop.job should have job-mode set to replace irreversibly.

Otherwise,  simultaneous config reloads in the quick succession, can lead to the behavior.

Although ,when the restart job (and all the other dependent jobs) are finished and is taken out of systemd's job queue, the next stop job will not be cancelled.
judyjoseph pushed a commit that referenced this pull request Sep 27, 2021
#### What I did

As discussed in this PR #1814 (comment), only the stop.job should have job-mode set to replace irreversibly.

Otherwise,  simultaneous config reloads in the quick succession, can lead to the behavior.

Although ,when the restart job (and all the other dependent jobs) are finished and is taken out of systemd's job queue, the next stop job will not be cancelled.
stepanblyschak pushed a commit to stepanblyschak/sonic-utilities that referenced this pull request Apr 18, 2022
* d03ba4f [202012] [portstat, intfstat] added rates and utilization  (sonic-net#1812)
* 499ad3f [config reload] Fix config reload failure due to sonic.target job cancellation (sonic-net#1814)
* 96d658c [202012][sonic installer] Add swap setup support (sonic-net#1815)
* a9c6970 platform pre-check for reboot in 202012 branch (sonic-net#1788)
* 0e0478b Unify the number format in the ourput of portstat and pfcstat in all cases (sonic-net#1795)
* 2d1e00e [ecnconfig] Fix exception seen during display and add unit tests (sonic-net#1784) (sonic-net#1789)

Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
malletvapid23 added a commit to malletvapid23/Sonic-Utility that referenced this pull request Aug 3, 2023
#### What I did

As discussed in this PR sonic-net/sonic-utilities#1814 (comment), only the stop.job should have job-mode set to replace irreversibly.

Otherwise,  simultaneous config reloads in the quick succession, can lead to the behavior.

Although ,when the restart job (and all the other dependent jobs) are finished and is taken out of systemd's job queue, the next stop job will not be cancelled.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[services] Job for sonic.target canceled.
8 participants