Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hostcfgd] Configure service auto-restart in hostcfgd. #5744

Merged
merged 23 commits into from
Jun 29, 2021

Conversation

stepanblyschak
Copy link
Collaborator

@stepanblyschak stepanblyschak commented Oct 29, 2020

Before this change, a process running inside every SONiC container dealt with FEATURE table 'auto_restart' field and depending on the value decided whether a container has to be killed or not.
If killed service auto restart mechanism restarts the container.
This change moves the logic from container to the host daemon - hostcfgd.
The 'auto_restart' handling is kept in supervisor-proc-exit-listener but now it is not required for container that wants to support auto restart feature.

  • hostcfgd refactoring - move feature handling in another class.
  • override systemd service Restart= setting from hostcfgd.
  • remove default systemd Restart=always.

Signed-off-by: Stepan Blyshchak stepanb@nvidia.com

- Why I did it

Remove the need to deal with container orchestration logic from the container itself. Leave this logic to the orchestrator - host OS.

- How I did it

hostcfgd configures 'Restart=' value for systemd service.

- How to verify it

root@r-tigon-11:/home/admin# sudo config feature autorestart lldp enabled
root@r-tigon-11:/home/admin# show feature status | grep lldp
lldp            enabled   enabled
root@r-tigon-11:/home/admin# docker exec -it lldp pkill -9 lldpd
root@r-tigon-11:/home/admin# docker ps -a | grep lldp
65058396277c        docker-lldp:latest                   "/usr/bin/docker-lld…"   2 days ago          Exited (0) 20 seconds ago                       lldp
root@r-tigon-11:/home/admin# docker ps -a | grep lldp
65058396277c        docker-lldp:latest                   "/usr/bin/docker-lld…"   2 days ago          Up 5 seconds                            lldp
root@r-tigon-11:/home/admin# sudo config feature autorestart lldp disabled
root@r-tigon-11:/home/admin# docker exec -it lldp pkill -9 lldpd
root@r-tigon-11:/home/admin# docker ps -a | grep lldp
65058396277c        docker-lldp:latest                   "/usr/bin/docker-lld…"   2 days ago          Up 35 seconds                           lldp
root@r-tigon-11:/home/admin# docker ps -a | grep lldp
65058396277c        docker-lldp:latest                   "/usr/bin/docker-lld…"   2 days ago          Exited (0) 3 seconds ago                       lldp
root@r-tigon-11:/home/admin# docker ps -a | grep lldp
65058396277c        docker-lldp:latest                   "/usr/bin/docker-lld…"   2 days ago          Exited (0) 39 seconds ago                       lldp
root@r-tigon-11:/home/admin#

- Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012

- Description for the changelog

- A picture of a cute animal (not mandatory but encouraged)

@stepanblyschak
Copy link
Collaborator Author

retest mellanox please

@stepanblyschak
Copy link
Collaborator Author

retest vsimage please

Before this change, a process runnning inside every SONiC
container dealt with FEATURE table 'auto_restart' field and
depending on the value decided wether a container has to be
killed or not. If killed service auto restart mechanism
restarts the container. This change moves the logic from
container to the host daemon - hostcfgd.

* hostcfgd refactoring - move feature handling in another class.
* override systemd service Restart= setting from hostcfgd.
* remove code that deals with FEATURE table from supervisor-proc-exit-listener.
* remove default systemd Restart=always.

Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>
Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>
@lgtm-com
Copy link

lgtm-com bot commented Nov 16, 2020

This pull request introduces 1 alert when merging 6f47365 into 261a81d - view on LGTM.com

new alerts:

  • 1 for Unused import

Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>
…restart_cfg

Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>
…restart_cfg

Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>
start_cmds.append("sudo systemctl start {}.{}".format(feature_name_suffix, feature_suffixes[-1]))
for cmd in start_cmds:
syslog.syslog(syslog.LOG_INFO, "Running cmd: '{}'".format(cmd))
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we enhance run_cmd() use it to return error code as well as log the error read from the exception

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, yes, please check the enhanced run_cmd()

stop_cmds.append("sudo systemctl mask {}.{}".format(feature_name_suffix, suffix))
for cmd in stop_cmds:
syslog.syslog(syslog.LOG_INFO, "Running cmd: '{}'".format(cmd))
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as before to use run_cmd()

…it in feature handler code

Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>
@stepanblyschak
Copy link
Collaborator Author

@lguohan @jleveque

@jleveque
Copy link
Contributor

@stepanblyschak: This PR appears to change the container behavior upon critical process exit. Currently, supervisor-proc-exit-listener will only stop the container if auto-restart is enabled. With this PR, it appears the container will always stop, but it will only be restarted if auto-restart is enabled. Can you explain why you believe this behavior is preferred?

@stepanblyschak
Copy link
Collaborator Author

@jleveque
This has been discussed before over emails and with App.Ext sub-group community.
Considering the possibility to run 3rd party dockerized applications on SONiC the auto-restart functionality is highly benefitial, although the current approach limits the use of this feature only to SONiC containers which developers explicitely coded support for it. Thus, the choice was to make auto-restart functionality support for free for any container that is managed by systemd. On the other hand there is a change in container lifetime when critical process dies. Consider running an arbitrary container the user or container orchestrator has no control over its entrypoint process exits unexpectedly. Usually in microservices there is only one process running in container and if this process dies so does the container. Given that, I do not see a benefit from having the container running while one or more critical daemons have exited other than debug purpose, but for that a simple tweak in critical_processes file is enough to make the container still running. If you still have a requirement to control this behaviour from CONFIG DB I suggest that we have a CONTAINER_FEATURE table to control this behaviour. We can declare that to support this functionality 3rd party dockers have to explicitly add support for it, but not for auto-restart.

…restart_cfg

Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>
@stepanblyschak stepanblyschak added the appext Application Extension label Jan 27, 2021
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
@jleveque
Copy link
Contributor

@lguohan: A few Azure Pipelines check builds are stuck in the "Expected — Waiting for status to be reported" state, and I cannot re-trigger these tests. There are a few PRs like this and I've even tried closing/reopening the PRs to no avail. Can you please help here?

@liat-grozovik
Copy link
Collaborator

/AzurePipleines run

@liat-grozovik
Copy link
Collaborator

@lguohan tests are stuck for 2weeks. can you reset it? AzurepPipelines run is not working.
@jleveque any further comments or this can be approved?

@jleveque
Copy link
Contributor

Closing and reopening PR in hopes of getting stuck Azure Pipelines jobs running.

@jleveque jleveque closed this Mar 12, 2021
syslog.syslog(syslog.LOG_INFO, "Feature '{}' service is '{}'"
.format(feature_name, invariant_state))
entry = self.config_db.get_entry('FEATURE', feature_name)
entry['state'] = invariant_state
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stepanblyschak Following my previous comment, if the state at here is always_disabled and invariant_state is always_enabled, the code at here will update the state field of feature to invariant_state. However, the code from line 761 ~ 764 will disable this feature. So I think the code at line 758 should be entry['state'] = state, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No sure I get this comment, are you saying there is a bug in original code?
Could you please point me to a document describing feature state transitions?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yozhao101 Do you have this comment still?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yozhao101 could you please check if the last commit addresses your comment?

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
@renukamanavalan
Copy link
Contributor

/AzurePipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
if cached_feature.state is None:
enable = feature.state in ("always_enabled", "enabled")
disable = feature.state in ("always_disabled", "disabled")
elif cached_feature.state == ("always_enabled", "always_disabled"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in instead of ==: elif cached_feature.state in ("always_enabled", "always_disabled"):

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for noticing this!

src/sonic-host-services/scripts/hostcfgd Show resolved Hide resolved
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
@liat-grozovik
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@renukamanavalan
Copy link
Contributor

@tahmed-dev can you please provide your approval?

@renukamanavalan
Copy link
Contributor

@yozhao101 can you please provide your review/approval ASAP?

@renukamanavalan
Copy link
Contributor

@jleveque, can you please provide your review or approval?

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Copy link
Contributor

@jleveque jleveque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good from my perspective. @yozhao101: Can you please review again to make sure all your concerns have been addressed?

@yozhao101
Copy link
Contributor

Looks good from my perspective. @yozhao101: Can you please review again to make sure all your concerns have been addressed?

Thanks, Joe and Renuka ! I am checking ...

except Exception as err:
if log_err:
syslog.syslog(syslog.LOG_ERR, "{} - failed: return code - {}, output:\n{}"
.format(err.cmd, err.returncode, err.output))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we only have output from child process if it was captured by run() or check_output(). Otherwise, None. Please see: https://docs.python.org/3/library/subprocess.html#subprocess.CalledProcessError.output

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue is relevant for existing code as well - https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-host-services/scripts/hostcfgd#L66. This PR has no intend to fix this issue.

@renukamanavalan
Copy link
Contributor

@yozhao101 can you please resolve/provide your reviews and approval ?

@renukamanavalan
Copy link
Contributor

Waiting for build to succeed. @stepanblyschak, if you can, please ping me, when build succeeds.

@stepanblyschak
Copy link
Collaborator Author

tests/hostcfgd/hostcfgd_radius_test.py:10: in <module>
    from parameterized import parameterized
E   ModuleNotFoundError: No module named 'parameterized

Same error appears again - https://dev.azure.com/mssonic/build/_build/results?buildId=19697&view=logs&j=88ce9a53-729c-5fa9-7b6e-3d98f2488e3f&t=8d99be27-49d0-54d0-99b1-cfc0d47f0318&l=527

@liat-grozovik
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@renukamanavalan renukamanavalan merged commit 9ce7c6d into sonic-net:master Jun 29, 2021
carl-nokia pushed a commit to carl-nokia/sonic-buildimage that referenced this pull request Aug 7, 2021
Before this change, a process running inside every SONiC container dealt with FEATURE table 'auto_restart' field and depending on the value decided whether a container has to be killed or not.
If killed service auto restart mechanism restarts the container.
This change moves the logic from container to the host daemon - hostcfgd.
The 'auto_restart' handling is kept in supervisor-proc-exit-listener but now it is not required for container that wants to support auto restart feature.

hostcfgd refactoring - move feature handling in another class.
override systemd service Restart= setting from hostcfgd.
remove default systemd Restart=always.
Signed-off-by: Stepan Blyshchak stepanb@nvidia.com

- Why I did it

Remove the need to deal with container orchestration logic from the container itself. Leave this logic to the orchestrator - host OS.

- How I did it

hostcfgd configures 'Restart=' value for systemd service.

- How to verify it

root@r-tigon-11:/home/admin# sudo config feature autorestart lldp enabled
root@r-tigon-11:/home/admin# show feature status | grep lldp
lldp            enabled   enabled
root@r-tigon-11:/home/admin# docker exec -it lldp pkill -9 lldpd
root@r-tigon-11:/home/admin# docker ps -a | grep lldp
65058396277c        docker-lldp:latest                   "/usr/bin/docker-lld…"   2 days ago          Exited (0) 20 seconds ago                       lldp
root@r-tigon-11:/home/admin# docker ps -a | grep lldp
65058396277c        docker-lldp:latest                   "/usr/bin/docker-lld…"   2 days ago          Up 5 seconds                            lldp
root@r-tigon-11:/home/admin# sudo config feature autorestart lldp disabled
root@r-tigon-11:/home/admin# docker exec -it lldp pkill -9 lldpd
root@r-tigon-11:/home/admin# docker ps -a | grep lldp
65058396277c        docker-lldp:latest                   "/usr/bin/docker-lld…"   2 days ago          Up 35 seconds                           lldp
root@r-tigon-11:/home/admin# docker ps -a | grep lldp
65058396277c        docker-lldp:latest                   "/usr/bin/docker-lld…"   2 days ago          Exited (0) 3 seconds ago                       lldp
root@r-tigon-11:/home/admin# docker ps -a | grep lldp
65058396277c        docker-lldp:latest                   "/usr/bin/docker-lld…"   2 days ago          Exited (0) 39 seconds ago                       lldp
root@r-tigon-11:/home/admin#
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
appext Application Extension
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants