Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[supervisord] Monitoring the critical processes with supervisord. #6242

Merged
merged 21 commits into from
Jan 21, 2021
Merged
Changes from 12 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
abb9c3e
[supervisord] Monitoring the critical processes with supervisord instead
yozhao101 Dec 17, 2020
f42f034
[Supervisord] Delete an unuseful variable.
yozhao101 Dec 17, 2020
97101bb
[Supervisord] Delete an extra empty line.
yozhao101 Dec 17, 2020
83e7545
[supervisord] Use python3 instead of python.
yozhao101 Dec 17, 2020
e8823ac
[supervisord] Delete extra space and period.
yozhao101 Dec 18, 2020
c7fc86f
[Supervisord] Use non-blocking method 'select(...)' instead of
yozhao101 Jan 13, 2021
8b42bb2
[Supervisord] Fix the existing typo.
yozhao101 Jan 13, 2021
e999c62
[Supervisord] Use python3.
yozhao101 Jan 13, 2021
d51dfd6
[Supervisord] Combine the excepts together.
yozhao101 Jan 13, 2021
52ae7f2
[Supervisord] Remove trailing space and fix the alignment.
yozhao101 Jan 14, 2021
88257dd
[Supervisord] Change the variable name.
yozhao101 Jan 14, 2021
7483821
[Supervisord] Reorganize the logic to process the event and log the
yozhao101 Jan 14, 2021
fb0c79c
[Supervisord] Reorganize the comments.
yozhao101 Jan 14, 2021
630d60f
[Supervisord] Fix the comment about the status transition.
yozhao101 Jan 14, 2021
1f9bb83
[Supervisord] Use a flag to indicate whether the event listener should
yozhao101 Jan 14, 2021
24c0278
[Supervisord]
yozhao101 Jan 18, 2021
ba7f5ab
[Supervisord] If the status of a process was changed from "EXITED" to
yozhao101 Jan 19, 2021
23358dd
Merge branch 'master' into monitoring_critical_processes
yozhao101 Jan 20, 2021
9bbb629
[Supervisord] Remove the outlier parenthesis at line 136.
yozhao101 Jan 20, 2021
f54bf9c
[Supervisord] Add the event "PROCESS_STATE_RUNNING" in Superviord
yozhao101 Jan 20, 2021
3c4484a
[Supervisord] Add the "PROCESS_STATE_RUNNING" in dhcp-relay
yozhao101 Jan 20, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
186 changes: 141 additions & 45 deletions files/scripts/supervisor-proc-exit-listener
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,14 @@

import getopt
import os
import select
import signal
import sys
import syslog
import time

import swsssdk

from supervisor import childutils

# Each line of this file should specify either one critical process or one
Expand All @@ -20,10 +23,18 @@ CRITICAL_PROCESSES_FILE = '/etc/supervisor/critical_processes'
# The FEATURE table in config db contains auto-restart field
FEATURE_TABLE_NAME = 'FEATURE'

# Read the critical processes/group names from CRITICAL_PROCESSES_FILE
# Value of parameter 'timeout' in select(...) method
SELECT_TIMEOUT_SECS = 1.0

# Alerting message will be written into syslog in the following interval
ALERTING_INTERVAL_SECS = 60


def get_critical_group_and_process_list():
"""
@summary: Read the critical processes/group names from CRITICAL_PROCESSES_FILE.
@return: Two lists which contain critical processes and group names respectively.
"""
critical_group_list = []
critical_process_list = []

Expand All @@ -49,6 +60,90 @@ def get_critical_group_and_process_list():
return critical_group_list, critical_process_list


def get_command_result(command):
lguohan marked this conversation as resolved.
Show resolved Hide resolved
"""
@summary: Execute the command and return result.
@return: A string which contains the execution result.
"""
command_stdout = ""

try:
proc_instance = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE,
shell=True, universal_newlines=True)
command_stdout, command_stderr = proc_instance.communicate()
if proc_instance.returncode != 0:
syslog.syslog(syslog.LOG_ERR, "Failed to execute the command '{}'. Return code: '{}'".format(
command, proc_instance.returncode))
sys.exit(7)
except (OSError, ValueError) as err:
syslog.syslog(syslog.LOG_ERR, "Failed to execute the command '{}'. Error: '{}'".format(command, err))
sys.exit(8)

return command_stdout


def process_running(process_name):
lguohan marked this conversation as resolved.
Show resolved Hide resolved
"""
@summary: Determine whether a critical process was running or not.
@return: Return 'True' if process was runnning. Otherwise return 'False'.
"""
supervisorctl_status_command = "supervisorctl status"
command_stdout = ""
is_running = False

command_stdout = get_command_result(supervisorctl_status_command)

for line in command_stdout.split("\n"):
if process_name in line:
status = line.split()[1].strip()
if status == "RUNNING":
is_running = True
break

return is_running


def generate_alerting_message(process_name):
"""
@summary: If a critical process was not running, this function will determine it was running in host
or in a specific namespace. Then an alerting message will be written into syslog.
"""
namespace_prefix = os.environ.get("NAMESPACE_PREFIX")
namespace_id = os.environ.get("NAMESPACE_ID")

if not namespace_prefix or not namespace_id:
syslog.syslog(syslog.LOG_ERR, "Process '{}' is not running in 'host'.".format(process_name))
else:
namespace = namespace_prefix + namespace_id
syslog.syslog(
syslog.LOG_ERR, "Process '{}' is not running in namespace '{}'.".format(process_name, namespace))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest simplifying as follows:

    if not namespace_prefix or not namespace_id:
        namespace = "host"
    else:
        namespace = namespace_prefix + namespace_id

    syslog.syslog(syslog.LOG_ERR, "Process '{}' is not running in namespace '{}'.".format(process_name, namespace))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.



def get_autorestart_state(container_name):
"""
@summary: Read the status of auto-restart feature from Config_DB.
@return: Return the status of auto-restart feature.
"""
config_db = swsssdk.ConfigDBConnector()
config_db.connect()
features_table = config_db.get_table(FEATURE_TABLE_NAME)
if not features_table:
syslog.syslog(syslog.LOG_ERR, "Unable to retrieve features table from Config DB. Exiting...")
sys.exit(2)

if container_name not in features_table:
syslog.syslog(syslog.LOG_ERR, "Unable to retrieve feature '{}'. Exiting...".format(container_name))
sys.exit(3)

restart_feature = features_table[container_name].get('auto_restart')
if not restart_feature:
syslog.syslog(
syslog.LOG_ERR, "Unable to determine auto-restart feature status for '{}'. Exiting...".format(container_name))
sys.exit(4)

return restart_feature


def main(argv):
container_name = None
opts, args = getopt.getopt(argv, "c:", ["container-name="])
Expand All @@ -62,51 +157,52 @@ def main(argv):

critical_group_list, critical_process_list = get_critical_group_and_process_list()

process_under_alerting = {}
# Transition from ACKNOWLEDGED to READY
childutils.listener.ready()

while True:
# Transition from ACKNOWLEDGED to READY
childutils.listener.ready()

line = sys.stdin.readline()
headers = childutils.get_headers(line)
payload = sys.stdin.read(int(headers['len']))

# Transition from READY to ACKNOWLEDGED
yozhao101 marked this conversation as resolved.
Show resolved Hide resolved
childutils.listener.ok()

# We only care about PROCESS_STATE_EXITED events
if headers['eventname'] == 'PROCESS_STATE_EXITED':
payload_headers, payload_data = childutils.eventdata(payload + '\n')

expected = int(payload_headers['expected'])
processname = payload_headers['processname']
groupname = payload_headers['groupname']

# Read the status of auto-restart feature from Config_DB.
config_db = swsssdk.ConfigDBConnector()
config_db.connect()
features_table = config_db.get_table(FEATURE_TABLE_NAME)
if not features_table:
syslog.syslog(syslog.LOG_ERR, "Unable to retrieve features table from Config DB. Exiting...")
sys.exit(2)

if container_name not in features_table:
syslog.syslog(syslog.LOG_ERR, "Unable to retrieve feature '{}'. Exiting...".format(container_name))
sys.exit(3)

restart_feature = features_table[container_name].get('auto_restart')
if not restart_feature:
syslog.syslog(
syslog.LOG_ERR, "Unable to determine auto-restart feature status for '{}'. Exiting...".format(container_name))
sys.exit(4)

# If auto-restart feature is not disabled and at the same time
# a critical process exited unexpectedly, terminate supervisor
if (restart_feature != 'disabled' and expected == 0 and
(processname in critical_process_list or groupname in critical_group_list)):
MSG_FORMAT_STR = "Process {} exited unxepectedly. Terminating supervisor..."
msg = MSG_FORMAT_STR.format(payload_headers['processname'])
syslog.syslog(syslog.LOG_INFO, msg)
os.kill(os.getppid(), signal.SIGTERM)
file_descriptor_list = select.select([sys.stdin], [], [], SELECT_TIMEOUT_SECS)[0]
if len(file_descriptor_list) != 0:
jleveque marked this conversation as resolved.
Show resolved Hide resolved
line = file_descriptor_list[0].readline()
headers = childutils.get_headers(line)
payload = sys.stdin.read(int(headers['len']))

# We only care about PROCESS_STATE_EXITED events
if headers['eventname'] == 'PROCESS_STATE_EXITED':
payload_headers, payload_data = childutils.eventdata(payload + '\n')

expected = int(payload_headers['expected'])
processname = payload_headers['processname']
groupname = payload_headers['groupname']

if ((processname in critical_process_list or groupname in critical_group_list)
lguohan marked this conversation as resolved.
Show resolved Hide resolved
and expected == 0):
if container_name != "database":
lguohan marked this conversation as resolved.
Show resolved Hide resolved
restart_feature = get_autorestart_state(container_name)
lguohan marked this conversation as resolved.
Show resolved Hide resolved

if container_name == "database" or restart_feature != "disabled":
MSG_FORMAT_STR = "Process '{}' exited unexpectedly. Terminating supervisor '{}'"
msg = MSG_FORMAT_STR.format(payload_headers['processname'], container_name)
syslog.syslog(syslog.LOG_INFO, msg)
os.kill(os.getppid(), signal.SIGTERM)
else:
process_under_alerting[processname] = time.time()

# Transition from READY to ACKNOWLEDGED
jleveque marked this conversation as resolved.
Show resolved Hide resolved
childutils.listener.ok()

# Transition from ACKNOWLEDGED to READY
childutils.listener.ready()

# Check whether we need write alerting messages into syslog
for process in process_under_alerting.keys():
epoch_time = time.time()
if process_running(process):
lguohan marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is unnecessary, if the process got start, you will receive msg from proc listener, why do you need to use supervisorctl to check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this event listener is only interested in the event PROCESS_STATE_EXITED (see the line 176), it will not receive the notification from Supervisord about the event PROCESS_STATE_RUNNING which was caused by running status of a process changed for example from EXITED to RUNNING. This is why I used the command supervisorctl status to check the status of each process.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not update that? in this case, the proc exit listener should be interested both EXITED, RUNNING state. btw, what are other states?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update the PR and remove the function is_process_running(...), get_command_result(...). The event types regarding to the process status are PROCESS_STATE_RUNNING, PROCESS_STATE_STARTING, PROCESS_STATE_BACKOFF, PROCESS_STATE_EXITED, PROCESS_STATE_STOPPING, PROCESS_STATE_STOPPED, PROCESS_STATE_UNKNOWN and PROCESS_STATE_FATAL.

process_under_alerting.pop(process)
elif epoch_time - process_under_alerting[process] >= ALERTING_INTERVAL_SECS:
process_under_alerting[process] = epoch_time
generate_alerting_message(process)


if __name__ == "__main__":
Expand Down