Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[supervisord] Monitoring the critical processes with supervisord. #6242

Merged
merged 21 commits into from
Jan 21, 2021
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
abb9c3e
[supervisord] Monitoring the critical processes with supervisord instead
yozhao101 Dec 17, 2020
f42f034
[Supervisord] Delete an unuseful variable.
yozhao101 Dec 17, 2020
97101bb
[Supervisord] Delete an extra empty line.
yozhao101 Dec 17, 2020
83e7545
[supervisord] Use python3 instead of python.
yozhao101 Dec 17, 2020
e8823ac
[supervisord] Delete extra space and period.
yozhao101 Dec 18, 2020
c7fc86f
[Supervisord] Use non-blocking method 'select(...)' instead of
yozhao101 Jan 13, 2021
8b42bb2
[Supervisord] Fix the existing typo.
yozhao101 Jan 13, 2021
e999c62
[Supervisord] Use python3.
yozhao101 Jan 13, 2021
d51dfd6
[Supervisord] Combine the excepts together.
yozhao101 Jan 13, 2021
52ae7f2
[Supervisord] Remove trailing space and fix the alignment.
yozhao101 Jan 14, 2021
88257dd
[Supervisord] Change the variable name.
yozhao101 Jan 14, 2021
7483821
[Supervisord] Reorganize the logic to process the event and log the
yozhao101 Jan 14, 2021
fb0c79c
[Supervisord] Reorganize the comments.
yozhao101 Jan 14, 2021
630d60f
[Supervisord] Fix the comment about the status transition.
yozhao101 Jan 14, 2021
1f9bb83
[Supervisord] Use a flag to indicate whether the event listener should
yozhao101 Jan 14, 2021
24c0278
[Supervisord]
yozhao101 Jan 18, 2021
ba7f5ab
[Supervisord] If the status of a process was changed from "EXITED" to
yozhao101 Jan 19, 2021
23358dd
Merge branch 'master' into monitoring_critical_processes
yozhao101 Jan 20, 2021
9bbb629
[Supervisord] Remove the outlier parenthesis at line 136.
yozhao101 Jan 20, 2021
f54bf9c
[Supervisord] Add the event "PROCESS_STATE_RUNNING" in Superviord
yozhao101 Jan 20, 2021
3c4484a
[Supervisord] Add the "PROCESS_STATE_RUNNING" in dhcp-relay
yozhao101 Jan 20, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions dockers/docker-fpm-frr/frr/supervisord/supervisord.conf.j2
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ buffer_size=50
[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name bgp
events=PROCESS_STATE_EXITED
numprocs=5
lguohan marked this conversation as resolved.
Show resolved Hide resolved
process_name=%(program_name)s_%(process_num)02d
autostart=true
autorestart=unexpected

Expand Down
2 changes: 2 additions & 0 deletions dockers/docker-lldp/supervisord.conf.j2
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ buffer_size=25
[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name lldp
events=PROCESS_STATE_EXITED
numprocs=3
process_name=%(program_name)s_%(process_num)02d
autostart=true
autorestart=unexpected

Expand Down
2 changes: 2 additions & 0 deletions dockers/docker-nat/supervisord.conf
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ buffer_size=25
[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name nat
events=PROCESS_STATE_EXITED
numprocs=2
process_name=%(program_name)s_%(process_num)02d
autostart=true
autorestart=unexpected

Expand Down
2 changes: 2 additions & 0 deletions dockers/docker-orchagent/supervisord.conf
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ buffer_size=100
[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name swss
events=PROCESS_STATE_EXITED
numprocs=11
process_name=%(program_name)s_%(process_num)02d
autostart=true
autorestart=unexpected

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ buffer_size=100
[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name pmon
events=PROCESS_STATE_EXITED
numprocs=3
process_name=%(program_name)s_%(process_num)02d
autostart=true
autorestart=unexpected

Expand Down
2 changes: 2 additions & 0 deletions dockers/docker-snmp/supervisord.conf
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ buffer_size=50
[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name snmp
events=PROCESS_STATE_EXITED
numprocs=3
process_name=%(program_name)s_%(process_num)02d
autostart=true
autorestart=unexpected

Expand Down
2 changes: 2 additions & 0 deletions dockers/docker-sonic-telemetry/supervisord.conf
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ buffer_size=50
[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name telemetry
events=PROCESS_STATE_EXITED
numprocs=2
process_name=%(program_name)s_%(process_num)02d
autostart=true
autorestart=false

Expand Down
153 changes: 122 additions & 31 deletions files/scripts/supervisor-proc-exit-listener
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,11 @@ import os
import signal
import sys
import syslog
import subprocess
import time

import swsssdk

from supervisor import childutils

# Each line of this file should specify either one critical process or one
Expand All @@ -20,10 +23,12 @@ CRITICAL_PROCESSES_FILE = '/etc/supervisor/critical_processes'
# The FEATURE table in config db contains auto-restart field
FEATURE_TABLE_NAME = 'FEATURE'

# Read the critical processes/group names from CRITICAL_PROCESSES_FILE


def get_critical_group_and_process_list():
"""
@summary: Read the critical processes/group names from CRITICAL_PROCESSES_FILE.
@return: Two lists which contain critical processes and group names respectively.
"""
critical_group_list = []
critical_process_list = []

Expand All @@ -49,6 +54,106 @@ def get_critical_group_and_process_list():
return critical_group_list, critical_process_list


def get_command_result(command):
lguohan marked this conversation as resolved.
Show resolved Hide resolved
"""
@summary: Execute the command and return result.
@return: A string which contains the execution result.
"""
command_stdout = ""

try:
proc_instance = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE,
shell=True, universal_newlines=True)
command_stdout, command_stderr = proc_instance.communicate()
if proc_instance.returncode != 0:
syslog.syslog(syslog.LOG_ERR, "Failed to execute the command '{}'. Return code: '{}'".format(
command, proc_instance.returncode))
sys.exit(7)
except OSError as os_err:
syslog.syslog(syslog.LOG_ERR, "Failed to execute the command '{}'. Error: '{}'".format(command, os_err))
sys.exit(8)
except ValueError as val_err:
syslog.syslog(syslog.LOG_ERR, "Failed to execute the command '{}'. Error: '{}'".format(command, val_err))
sys.exit(9)

return command_stdout


def is_process_running(process_name):
lguohan marked this conversation as resolved.
Show resolved Hide resolved
"""
@summary: Determine whether a critical process was running or not.
@return: Return 'True' if process was runnning. Otherwise return 'False'.
"""
supervisorctl_status_command = "supervisorctl status"
command_stdout = ""
is_running = False

command_stdout = get_command_result(supervisorctl_status_command)
for line in command_stdout.split("\n"):
if process_name in line:
status = line.split()[1].strip()
if status == "RUNNING":
is_running = True
break

return is_running


def generate_alerting_message(process_name):
"""
@summary: If a critical process was not running, this function will determine it was running in host
or in a specific namespace. Then an alerting message will be written into syslog.
"""
namespace_prefix = os.environ.get("NAMESPACE_PREFIX")
namespace_id = os.environ.get("NAMESPACE_ID")

if not namespace_prefix or not namespace_id:
syslog.syslog(syslog.LOG_ERR, "Process '{}' is not running in 'host'.".format(process_name))
else:
namespace = namespace_prefix + namespace_id
syslog.syslog(
syslog.LOG_ERR, "Process '{}' is not running in namespace '{}'.".format(process_name, namespace))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest simplifying as follows:

    if not namespace_prefix or not namespace_id:
        namespace = "host"
    else:
        namespace = namespace_prefix + namespace_id

    syslog.syslog(syslog.LOG_ERR, "Process '{}' is not running in namespace '{}'.".format(process_name, namespace))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.



def monitoring_process_and_alerting(process_name):
"""
@summary: This function will determine whether a critical process was running or not in every minute.
If it was not running, then an alerting message will be writen into syslog. Otherwise,
this function will exit.
"""
while True:
time.sleep(60)
if not is_process_running(process_name):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this process is not running, then you are trapped in this loop, you never get it out.

Copy link
Contributor Author

@yozhao101 yozhao101 Dec 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a critical process was not running, event listener need periodically (every 1 minute) write an alerting message into syslog like Monit, right?

generate_alerting_message(process_name)
else:
break


def get_autorestart_state(container_name):
"""
@summary: Read the status of auto-restart feature from Config_DB.
@return: Return the status of auto-restart feature.
"""
config_db = swsssdk.ConfigDBConnector()
config_db.connect()
features_table = config_db.get_table(FEATURE_TABLE_NAME)
if not features_table:
syslog.syslog(syslog.LOG_ERR, "Unable to retrieve features table from Config DB. Exiting...")
sys.exit(2)

if container_name not in features_table:
syslog.syslog(syslog.LOG_ERR, "Unable to retrieve feature '{}'. Exiting...".format(container_name))
sys.exit(3)

restart_feature = features_table[container_name].get('auto_restart')
if not restart_feature:
syslog.syslog(
syslog.LOG_ERR, "Unable to determine auto-restart feature status for '{}'. Exiting...".format(container_name))
sys.exit(4)

return restart_feature


def main(argv):
container_name = None
opts, args = getopt.getopt(argv, "c:", ["container-name="])
Expand All @@ -70,9 +175,6 @@ def main(argv):
headers = childutils.get_headers(line)
payload = sys.stdin.read(int(headers['len']))

# Transition from READY to ACKNOWLEDGED
yozhao101 marked this conversation as resolved.
Show resolved Hide resolved
childutils.listener.ok()

# We only care about PROCESS_STATE_EXITED events
if headers['eventname'] == 'PROCESS_STATE_EXITED':
payload_headers, payload_data = childutils.eventdata(payload + '\n')
Expand All @@ -81,32 +183,21 @@ def main(argv):
processname = payload_headers['processname']
groupname = payload_headers['groupname']

# Read the status of auto-restart feature from Config_DB.
config_db = swsssdk.ConfigDBConnector()
config_db.connect()
features_table = config_db.get_table(FEATURE_TABLE_NAME)
if not features_table:
syslog.syslog(syslog.LOG_ERR, "Unable to retrieve features table from Config DB. Exiting...")
sys.exit(2)

if container_name not in features_table:
syslog.syslog(syslog.LOG_ERR, "Unable to retrieve feature '{}'. Exiting...".format(container_name))
sys.exit(3)

restart_feature = features_table[container_name].get('auto_restart')
if not restart_feature:
syslog.syslog(
syslog.LOG_ERR, "Unable to determine auto-restart feature status for '{}'. Exiting...".format(container_name))
sys.exit(4)

# If auto-restart feature is not disabled and at the same time
# a critical process exited unexpectedly, terminate supervisor
if (restart_feature != 'disabled' and expected == 0 and
(processname in critical_process_list or groupname in critical_group_list)):
MSG_FORMAT_STR = "Process {} exited unxepectedly. Terminating supervisor..."
msg = MSG_FORMAT_STR.format(payload_headers['processname'])
syslog.syslog(syslog.LOG_INFO, msg)
os.kill(os.getppid(), signal.SIGTERM)
if ((processname in critical_process_list or groupname in critical_group_list)
and expected == 0):
if container_name != "database":
lguohan marked this conversation as resolved.
Show resolved Hide resolved
restart_feature = get_autorestart_state(container_name)

if container_name == "database" or restart_feature != "disabled":
MSG_FORMAT_STR = "Process '{}' exited unxepectedly. Terminating supervisor '{}'"
jleveque marked this conversation as resolved.
Show resolved Hide resolved
msg = MSG_FORMAT_STR.format(payload_headers['processname'], container_name)
syslog.syslog(syslog.LOG_INFO, msg)
os.kill(os.getppid(), signal.SIGTERM)
else:
monitoring_process_and_alerting(processname)

# Transition from READY to ACKNOWLEDGED
childutils.listener.ok()


if __name__ == "__main__":
Expand Down
2 changes: 2 additions & 0 deletions platform/broadcom/docker-syncd-brcm/supervisord.conf
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ buffer_size=25
[eventlistener:supervisor-proc-exit-listener]
command=python2 /usr/bin/supervisor-proc-exit-listener --container-name syncd
events=PROCESS_STATE_EXITED
numprocs=2
process_name=%(program_name)s_%(process_num)02d
autostart=true
autorestart=unexpected

Expand Down
2 changes: 2 additions & 0 deletions platform/nephos/docker-syncd-nephos/supervisord.conf
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ buffer_size=25
[eventlistener:supervisor-proc-exit-listener]
command=python2 /usr/bin/supervisor-proc-exit-listener --container-name syncd
events=PROCESS_STATE_EXITED
numprocs=2
process_name=%(program_name)s_%(process_num)02d
autostart=true
autorestart=unexpected

Expand Down