Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[scripts/fast-reboot] Shutdown remaining containers through systemd #2133

Merged
merged 2 commits into from
Apr 13, 2022

Conversation

stepanblyschak
Copy link
Contributor

@stepanblyschak stepanblyschak commented Apr 8, 2022

The current implementation has two issues:

  1. In case containers from "docker ps" output are ordered in a way that database is first in the list, the "systemctl stop database" followed by "docker kill database" will stop all other containers through systemd and ruin this optimization
  2. After "docker kill database" there are lots of errors from daemons like hostcfgd, system-healthd, caclmgrd, etc. Also it causes those daemons to hang when received SIGTERM making a delay on following "systemctl stop database".

In the new implementation, services are implicitly stopped by systemd in the order that is correct. If a certain container needs an optimization that will kill the container instead of stopping it the container may implement this optimization in its /usr/local/bin/*.sh script.

It is also more optimal since independent services might be stopped in parallel.

Signed-off-by: Stepan Blyschak stepanb@nvidia.com

What I did

Stop services using systemd

How I did it

Stop services using systemd

How to verify it

Run warm-reboot.

Previous command output (if the output of a command-line utility has changed)

New command output (if the output of a command-line utility has changed)

Required to chery-pick: 202012, 202111.

DEPENDS ON: sonic-net/sonic-buildimage#10510 sonic-net/sonic-buildimage#10511

The current implementation has two issues:

1. In case containers from "docker ps" output are ordered in a way that
database is first in the list, the "systemctl stop database" followed by
"docker kill database" will stop all other containers through systemd
and ruin this optimization
2. After "docker kill database" there are lots of errors from daemons
like hostcfgd, system-healthd, caclmgrd, etc. Also it causes those
daemons to hang when received SIGTERM making a delay on following
"systemctl stop database".

In the new implementation, services are implicitelly stopped by systemd
in the order that is correct. If a certain container needs an
optimization that will kill the container instead of stopping it the
container may implement this optimization in its /usr/local/bin/*.sh
script.

It is also more optimal since independent services might be stopped in
parallel.

NOTE: This fix is relevant for regular SONiC image and not for
Kubernetes enabled SONiC image. Kubernetes integration in SONiC might
have this issue still, which might be a design flawn.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
if test -f /usr/local/bin/ctrmgr_tools.py
then
debug "Stopping all remaining containers ..."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this debug flag misplaced, considering the kill/shutdown happens implicitly now through systemd?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this branch is for kubernetes and can be removed. @renukamanavalan

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
@yxieca yxieca requested a review from vaibhavhd April 12, 2022 19:14
@liat-grozovik
Copy link
Collaborator

@renukamanavalan could you please refer to the question above? if this can be removed lets go a head

Copy link
Contributor

@renukamanavalan renukamanavalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current solution of stopping docker service would work.

@liat-grozovik liat-grozovik merged commit 23e9398 into sonic-net:master Apr 13, 2022
qiluo-msft pushed a commit that referenced this pull request Apr 13, 2022
…2133)

The current implementation has two issues:

1. In case containers from "docker ps" output are ordered in a way that database is first in the list, the "systemctl stop database" followed by "docker kill database" will stop all other containers through systemd
and ruin this optimization
2. After "docker kill database" there are lots of errors from daemons like hostcfgd, system-healthd, caclmgrd, etc. Also it causes those daemons to hang when received SIGTERM making a delay on following "systemctl stop database".

In the new implementation, services are implicitly stopped by systemd in the order that is correct. If a certain container needs an optimization that will kill the container instead of stopping it the container may implement this optimization in its /usr/local/bin/*.sh script.

It is also more optimal since independent services might be stopped in parallel.

- What I did
Stop services using systemd

- How I did it
Stop services using systemd

- How to verify it
Run warm-reboot.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
judyjoseph pushed a commit that referenced this pull request Apr 18, 2022
…2133)

The current implementation has two issues:

1. In case containers from "docker ps" output are ordered in a way that database is first in the list, the "systemctl stop database" followed by "docker kill database" will stop all other containers through systemd
and ruin this optimization
2. After "docker kill database" there are lots of errors from daemons like hostcfgd, system-healthd, caclmgrd, etc. Also it causes those daemons to hang when received SIGTERM making a delay on following "systemctl stop database".

In the new implementation, services are implicitly stopped by systemd in the order that is correct. If a certain container needs an optimization that will kill the container instead of stopping it the container may implement this optimization in its /usr/local/bin/*.sh script.

It is also more optimal since independent services might be stopped in parallel.

- What I did
Stop services using systemd

- How I did it
Stop services using systemd

- How to verify it
Run warm-reboot.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
stepanblyschak pushed a commit to stepanblyschak/sonic-utilities that referenced this pull request Apr 18, 2022
How I did it
Advance swss submodule head to include:

c3fb52b 2022-02-04 | Fix for missing lossless PG profile on certain ports (sonic-swss-common update for Vnet tables sonic-net#2133) (HEAD -> 202012, github/202012) [Ying Xie]

Signed-off-by: Ying Xie ying.xie@microsoft.com
stepanblyschak added a commit to stepanblyschak/sonic-utilities that referenced this pull request Apr 18, 2022
```
08495be [scripts/fast-reboot] Shutdown remaining containers through systemd (sonic-net#2133)
fa07373 [scripts/fast-reboot] stop timers in advance (sonic-net#2131)
00ef80e [show][muxcable] Decrease the timeout for show mux status/hwmode (sonic-net#2130)
```

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Junchao-Mellanox added a commit to Junchao-Mellanox/sonic-utilities that referenced this pull request May 11, 2022
liat-grozovik pushed a commit that referenced this pull request May 11, 2022
…ystemd (#2133)" (#2161)

This reverts commit 23e9398.

- What I did
Revert "[scripts/fast-reboot] Shutdown remaining containers through systemd (#2133)"

This reverted PR is part of a story that refactors warm/fast shutdown sequence to gracefully stop services instead of killing them without any ordering and dependency requirements which creates several issues and is error prone for the future.

This PR must come together with sonic-net/sonic-buildimage#10510.
However, #10510 is blocked due to an issue in swss-common sonic-net/sonic-swss-common#603
And a fix by MSFT is in review sonic-net/sonic-swss-common#606

I am reverting it because its dependency is still blocked and we cannot update submodule pointer. Once the dependency of the reverted PR is resolved, it shall be re-committed.
dprital added a commit to dprital/sonic-utilities that referenced this pull request May 11, 2022
liat-grozovik pushed a commit that referenced this pull request May 11, 2022
…ystemd (#2133)" (#2166)

- What I did
This reverts commit a5f55aa.

Revert "[scripts/fast-reboot] Shutdown remaining containers through systemd (#2133)"

This reverted PR is part of a story that refactors warm/fast shutdown sequence to gracefully stop services instead of killing them without any ordering and dependency requirements which creates several issues and is error prone for the future.

This PR must come together with sonic-net/sonic-buildimage#10510.
However, #10510 is blocked due to an issue in swss-common sonic-net/sonic-swss-common#603
And a fix by MSFT is in review sonic-net/sonic-swss-common#606

I am reverting it because its dependency is still blocked and we cannot update submodule pointer. Once the dependency of the reverted PR is resolved, it shall be re-committed.

- How I did it
git revert a5f55aa

- How to verify it
Run tests
dprital added a commit to dprital/sonic-utilities that referenced this pull request May 11, 2022
liat-grozovik pushed a commit to sonic-net/sonic-buildimage that referenced this pull request May 15, 2022
Revert "[scripts/fast-reboot] Shutdown remaining containers through systemd (sonic-net/sonic-utilities#2133)" (sonic-net/sonic-utilities#2166)
stepanblyschak added a commit to stepanblyschak/sonic-utilities that referenced this pull request May 24, 2022
vaibhavhd pushed a commit that referenced this pull request Jul 25, 2022
…hrough systemd (#2133)" (#2161)" (#2184)

Reverts #2161
Revert a revert. This must be merged together with sonic-net/sonic-buildimage#10510
yxieca pushed a commit that referenced this pull request Dec 7, 2022
…2133)

The current implementation has two issues:

1. In case containers from "docker ps" output are ordered in a way that database is first in the list, the "systemctl stop database" followed by "docker kill database" will stop all other containers through systemd
and ruin this optimization
2. After "docker kill database" there are lots of errors from daemons like hostcfgd, system-healthd, caclmgrd, etc. Also it causes those daemons to hang when received SIGTERM making a delay on following "systemctl stop database".

In the new implementation, services are implicitly stopped by systemd in the order that is correct. If a certain container needs an optimization that will kill the container instead of stopping it the container may implement this optimization in its /usr/local/bin/*.sh script.

It is also more optimal since independent services might be stopped in parallel.

- What I did
Stop services using systemd

- How I did it
Stop services using systemd

- How to verify it
Run warm-reboot.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants