Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[supervisord] Monitoring the critical processes with supervisord. #6242

Merged
merged 21 commits into from
Jan 21, 2021
Merged

[supervisord] Monitoring the critical processes with supervisord. #6242

merged 21 commits into from
Jan 21, 2021

Conversation

yozhao101
Copy link
Contributor

@yozhao101 yozhao101 commented Dec 17, 2020

- Why I did it
Initially, we used Monit to monitor critical processes in each container. If one of critical processes was not running
or crashed due to some reasons, then Monit will write an alerting message into syslog periodically. If we add a new process
in a container, the corresponding Monti configuration file will also need to update. It is a little hard for maintenance.

Currently we employed event listener of Supervisod to do this monitoring. Since processes in each container are managed by
Supervisord, we can only focus on the logic of monitoring.

- How I did it
We borrowed the event listener of Supervisord to monitor critical processes in containers. The event listener will take
following steps if it was notified one of critical processes exited unexpectedly:

  1. The event listener will first check whether the auto-restart mechanism was enabled for this container or not. If auto-restart mechanism was enabled, event listener will kill the Supervisord process, which should cause the container to exit and subsequently get restarted.

  2. If auto-restart mechanism was not enabled for this contianer, the event listener will enter a loop which will first sleep 1 minute and then check whether the process is running. If yes, the event listener exits. If no, an alerting message will be written into syslog.

- How to verify it
First, we need checked whether the auto-restart mechanism of a container was enabled or not by running the command show feature status. If enabled, one critical process should be selected and killed manually, then we need check whether the container will be restarted or not.

Second, we can disable the auto-restart mechanism if it was enabled at step 1 by running the commnad sudo config feature autorestart <container_name> disabled. Then one critical process should be selected and killed. After that, we will see the alerting message which will appear in the syslog every 1 minute.

- Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • [x ] 202006

- Description for the changelog

- A picture of a cute animal (not mandatory but encouraged)

of Monit.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
"""
while True:
time.sleep(60)
if not is_process_running(process_name):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this process is not running, then you are trapped in this loop, you never get it out.

Copy link
Contributor Author

@yozhao101 yozhao101 Dec 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a critical process was not running, event listener need periodically (every 1 minute) write an alerting message into syslog like Monit, right?

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Copy link
Collaborator

@lguohan lguohan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use one event listener.

@lguohan
Copy link
Collaborator

lguohan commented Jan 13, 2021

any update?

'readline()'.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
@yozhao101
Copy link
Contributor Author

any update?

Updated. Please help me review this PR.

files/scripts/supervisor-proc-exit-listener Outdated Show resolved Hide resolved
files/scripts/supervisor-proc-exit-listener Outdated Show resolved Hide resolved
files/scripts/supervisor-proc-exit-listener Outdated Show resolved Hide resolved
files/scripts/supervisor-proc-exit-listener Outdated Show resolved Hide resolved
files/scripts/supervisor-proc-exit-listener Outdated Show resolved Hide resolved
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
alerts.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
files/scripts/supervisor-proc-exit-listener Outdated Show resolved Hide resolved
files/scripts/supervisor-proc-exit-listener Outdated Show resolved Hide resolved
files/scripts/supervisor-proc-exit-listener Outdated Show resolved Hide resolved
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
be set to "READY" state.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
@lguohan
Copy link
Collaborator

lguohan commented Jan 19, 2021

now, code is much better readable now. please remove is_running_process() and subscribe proc running state, then I can sign-off this pr.

@yozhao101
Copy link
Contributor Author

now, code is much better readable now. please remove is_running_process() and subscribe proc running state, then I can sign-off this pr.

I think this suggestion is great and also clean. Will update the PR.

Comment on lines 115 to 120
if not namespace_prefix or not namespace_id:
syslog.syslog(syslog.LOG_ERR, "Process '{}' is not running in 'host'.".format(process_name))
else:
namespace = namespace_prefix + namespace_id
syslog.syslog(
syslog.LOG_ERR, "Process '{}' is not running in namespace '{}'.".format(process_name, namespace))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest simplifying as follows:

    if not namespace_prefix or not namespace_id:
        namespace = "host"
    else:
        namespace = namespace_prefix + namespace_id

    syslog.syslog(syslog.LOG_ERR, "Process '{}' is not running in namespace '{}'.".format(process_name, namespace))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

"RUNNING", the event listener will receive "PROCESS_STATE_RUNNING" event
and not alert anymore.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
@lguohan
Copy link
Collaborator

lguohan commented Jan 20, 2021

please resolve the conflict.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
process_under_alerting[process_name] = time.time()

# Handle the PROCESS_STATE_RUNNING event
if headers['eventname'] == 'PROCESS_STATE_RUNNING':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> else if

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@lguohan
Copy link
Collaborator

lguohan commented Jan 20, 2021

i think several containers are missing. can you make sure you have changes all container file?

configuration files and fix if...elif... in event listener.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
@yozhao101
Copy link
Contributor Author

i think several containers are missing. can you make sure you have changes all container file?

Fixed.

@lguohan
Copy link
Collaborator

lguohan commented Jan 20, 2021

there are still containers missing? can you do a complete check??!!!

configuration file.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
@yozhao101
Copy link
Contributor Author

there are still containers missing? can you do a complete check??!!!

Can you help me review again please?

@jleveque
Copy link
Contributor

there are still containers missing? can you do a complete check??!!!

Can you help me review again please?

It appears you have now covered all containers which are currently configured to use supervisor-proc-exit-listener. However, there are a handful of containers which are not using supervisor-proc-exit-listener:

dockers/docker-iccpd
dockers/docker-macsec
dockers/docker-sonic-mgmt-framework
platform/barefoot/docker-saiserver-bfn
platform/broadcom/docker-saiserver-brcm
platform/cavium/docker-saiserver-cavm
platform/innovium/docker-syncd-invm
platform/mellanox/docker-saiserver-mlnx

@lguohan: Should we add supervisor-proc-exit-listener to these containers along with (empty?) critical_processes files as part of this PR, or leave them for now and ask the appropriate owners to add it?

@lguohan
Copy link
Collaborator

lguohan commented Jan 21, 2021

retest vs please

@yozhao101
Copy link
Contributor Author

Retest vs please

@yozhao101
Copy link
Contributor Author

Retest vsimage please

@yozhao101
Copy link
Contributor Author

Retest broadcom please

@yozhao101
Copy link
Contributor Author

Retest mellanox please

@lguohan
Copy link
Collaborator

lguohan commented Jan 21, 2021

@jleveque can you check to see if you approve this?

@yozhao101 yozhao101 merged commit be3c036 into sonic-net:master Jan 21, 2021
stepanblyschak added a commit to stepanblyschak/sonic-buildimage that referenced this pull request Jan 22, 2021
lguohan pushed a commit that referenced this pull request Jan 28, 2021
)

- Why I did it
Initially, we used Monit to monitor critical processes in each container. If one of critical processes was not running
or crashed due to some reasons, then Monit will write an alerting message into syslog periodically. If we add a new process
in a container, the corresponding Monti configuration file will also need to update. It is a little hard for maintenance.

Currently we employed event listener of Supervisod to do this monitoring. Since processes in each container are managed by
Supervisord, we can only focus on the logic of monitoring.

- How I did it
We borrowed the event listener of Supervisord to monitor critical processes in containers. The event listener will take
following steps if it was notified one of critical processes exited unexpectedly:

The event listener will first check whether the auto-restart mechanism was enabled for this container or not. If auto-restart mechanism was enabled, event listener will kill the Supervisord process, which should cause the container to exit and subsequently get restarted.

If auto-restart mechanism was not enabled for this contianer, the event listener will enter a loop which will first sleep 1 minute and then check whether the process is running. If yes, the event listener exits. If no, an alerting message will be written into syslog.

- How to verify it
First, we need checked whether the auto-restart mechanism of a container was enabled or not by running the command show feature status. If enabled, one critical process should be selected and killed manually, then we need check whether the container will be restarted or not.

Second, we can disable the auto-restart mechanism if it was enabled at step 1 by running the commnad sudo config feature autorestart <container_name> disabled. Then one critical process should be selected and killed. After that, we will see the alerting message which will appear in the syslog every 1 minute.

- Which release branch to backport (provide reason below if selected)

 201811
 201911
[x ] 202006
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants