Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[supervisord] Monitoring the critical processes with supervisord. #6242

Merged
merged 21 commits into from
Jan 21, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
abb9c3e
[supervisord] Monitoring the critical processes with supervisord instead
yozhao101 Dec 17, 2020
f42f034
[Supervisord] Delete an unuseful variable.
yozhao101 Dec 17, 2020
97101bb
[Supervisord] Delete an extra empty line.
yozhao101 Dec 17, 2020
83e7545
[supervisord] Use python3 instead of python.
yozhao101 Dec 17, 2020
e8823ac
[supervisord] Delete extra space and period.
yozhao101 Dec 18, 2020
c7fc86f
[Supervisord] Use non-blocking method 'select(...)' instead of
yozhao101 Jan 13, 2021
8b42bb2
[Supervisord] Fix the existing typo.
yozhao101 Jan 13, 2021
e999c62
[Supervisord] Use python3.
yozhao101 Jan 13, 2021
d51dfd6
[Supervisord] Combine the excepts together.
yozhao101 Jan 13, 2021
52ae7f2
[Supervisord] Remove trailing space and fix the alignment.
yozhao101 Jan 14, 2021
88257dd
[Supervisord] Change the variable name.
yozhao101 Jan 14, 2021
7483821
[Supervisord] Reorganize the logic to process the event and log the
yozhao101 Jan 14, 2021
fb0c79c
[Supervisord] Reorganize the comments.
yozhao101 Jan 14, 2021
630d60f
[Supervisord] Fix the comment about the status transition.
yozhao101 Jan 14, 2021
1f9bb83
[Supervisord] Use a flag to indicate whether the event listener should
yozhao101 Jan 14, 2021
24c0278
[Supervisord]
yozhao101 Jan 18, 2021
ba7f5ab
[Supervisord] If the status of a process was changed from "EXITED" to
yozhao101 Jan 19, 2021
23358dd
Merge branch 'master' into monitoring_critical_processes
yozhao101 Jan 20, 2021
9bbb629
[Supervisord] Remove the outlier parenthesis at line 136.
yozhao101 Jan 20, 2021
f54bf9c
[Supervisord] Add the event "PROCESS_STATE_RUNNING" in Superviord
yozhao101 Jan 20, 2021
3c4484a
[Supervisord] Add the "PROCESS_STATE_RUNNING" in dhcp-relay
yozhao101 Jan 20, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion dockers/docker-database/supervisord.conf.j2
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ nodaemon=true

[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name database
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ buffer_size=50

[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name dhcp_relay
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ buffer_size=50

[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name bgp
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
2 changes: 1 addition & 1 deletion dockers/docker-fpm-gobgp/supervisord.conf
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ nodaemon=true

[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name bgp
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
2 changes: 1 addition & 1 deletion dockers/docker-fpm-quagga/supervisord.conf
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ nodaemon=true

[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name bgp
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
2 changes: 1 addition & 1 deletion dockers/docker-lldp/supervisord.conf.j2
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ buffer_size=25

[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name lldp
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
2 changes: 1 addition & 1 deletion dockers/docker-nat/supervisord.conf
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ buffer_size=25

[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name nat
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
2 changes: 1 addition & 1 deletion dockers/docker-orchagent/supervisord.conf
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ buffer_size=100

[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name swss
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ buffer_size=100

[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name pmon
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ buffer_size=25

[eventlistener:supervisor-proc-exit-script]
command=/usr/bin/supervisor-proc-exit-listener --container-name radv
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
2 changes: 1 addition & 1 deletion dockers/docker-sflow/supervisord.conf
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ buffer_size=25

[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name sflow
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
2 changes: 1 addition & 1 deletion dockers/docker-snmp/supervisord.conf
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ buffer_size=50

[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name snmp
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
2 changes: 1 addition & 1 deletion dockers/docker-sonic-restapi/supervisord.conf
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ buffer_size=25

[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name restapi
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=false

Expand Down
2 changes: 1 addition & 1 deletion dockers/docker-sonic-telemetry/supervisord.conf
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ buffer_size=50

[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name telemetry
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=false

Expand Down
2 changes: 1 addition & 1 deletion dockers/docker-teamd/supervisord.conf
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ buffer_size=50

[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name teamd
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
146 changes: 101 additions & 45 deletions files/scripts/supervisor-proc-exit-listener
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,14 @@

import getopt
import os
import select
import signal
import sys
import syslog
import time

import swsssdk

from supervisor import childutils

# Each line of this file should specify either one critical process or one
Expand All @@ -20,10 +23,18 @@ CRITICAL_PROCESSES_FILE = '/etc/supervisor/critical_processes'
# The FEATURE table in config db contains auto-restart field
FEATURE_TABLE_NAME = 'FEATURE'

# Read the critical processes/group names from CRITICAL_PROCESSES_FILE
# Value of parameter 'timeout' in select(...) method
SELECT_TIMEOUT_SECS = 1.0

# Alerting message will be written into syslog in the following interval
ALERTING_INTERVAL_SECS = 60


def get_critical_group_and_process_list():
"""
@summary: Read the critical processes/group names from CRITICAL_PROCESSES_FILE.
@return: Two lists which contain critical processes and group names respectively.
"""
critical_group_list = []
critical_process_list = []

Expand All @@ -49,6 +60,47 @@ def get_critical_group_and_process_list():
return critical_group_list, critical_process_list


def generate_alerting_message(process_name):
"""
@summary: If a critical process was not running, this function will determine it resides in host
or in a specific namespace. Then an alerting message will be written into syslog.
"""
namespace_prefix = os.environ.get("NAMESPACE_PREFIX")
namespace_id = os.environ.get("NAMESPACE_ID")

if not namespace_prefix or not namespace_id:
namespace = "host"
else:
namespace = namespace_prefix + namespace_id

syslog.syslog(syslog.LOG_ERR, "Process '{}' is not running in namespace '{}'.".format(process_name, namespace))


def get_autorestart_state(container_name):
"""
@summary: Read the status of auto-restart feature from Config_DB.
@return: Return the status of auto-restart feature.
"""
config_db = swsssdk.ConfigDBConnector()
config_db.connect()
features_table = config_db.get_table(FEATURE_TABLE_NAME)
if not features_table:
syslog.syslog(syslog.LOG_ERR, "Unable to retrieve features table from Config DB. Exiting...")
sys.exit(2)

if container_name not in features_table:
syslog.syslog(syslog.LOG_ERR, "Unable to retrieve feature '{}'. Exiting...".format(container_name))
sys.exit(3)

is_auto_restart = features_table[container_name].get('auto_restart')
if not is_auto_restart:
syslog.syslog(
syslog.LOG_ERR, "Unable to determine auto-restart feature status for '{}'. Exiting...".format(container_name))
sys.exit(4)

return is_auto_restart


def main(argv):
container_name = None
opts, args = getopt.getopt(argv, "c:", ["container-name="])
Expand All @@ -62,51 +114,55 @@ def main(argv):

critical_group_list, critical_process_list = get_critical_group_and_process_list()

process_under_alerting = {}
# Transition from ACKNOWLEDGED to READY
childutils.listener.ready()

while True:
# Transition from ACKNOWLEDGED to READY
childutils.listener.ready()

line = sys.stdin.readline()
headers = childutils.get_headers(line)
payload = sys.stdin.read(int(headers['len']))

# Transition from READY to ACKNOWLEDGED
yozhao101 marked this conversation as resolved.
Show resolved Hide resolved
childutils.listener.ok()

# We only care about PROCESS_STATE_EXITED events
if headers['eventname'] == 'PROCESS_STATE_EXITED':
payload_headers, payload_data = childutils.eventdata(payload + '\n')

expected = int(payload_headers['expected'])
processname = payload_headers['processname']
groupname = payload_headers['groupname']

# Read the status of auto-restart feature from Config_DB.
config_db = swsssdk.ConfigDBConnector()
config_db.connect()
features_table = config_db.get_table(FEATURE_TABLE_NAME)
if not features_table:
syslog.syslog(syslog.LOG_ERR, "Unable to retrieve features table from Config DB. Exiting...")
sys.exit(2)

if container_name not in features_table:
syslog.syslog(syslog.LOG_ERR, "Unable to retrieve feature '{}'. Exiting...".format(container_name))
sys.exit(3)

restart_feature = features_table[container_name].get('auto_restart')
if not restart_feature:
syslog.syslog(
syslog.LOG_ERR, "Unable to determine auto-restart feature status for '{}'. Exiting...".format(container_name))
sys.exit(4)

# If auto-restart feature is not disabled and at the same time
# a critical process exited unexpectedly, terminate supervisor
if (restart_feature != 'disabled' and expected == 0 and
(processname in critical_process_list or groupname in critical_group_list)):
MSG_FORMAT_STR = "Process {} exited unxepectedly. Terminating supervisor..."
msg = MSG_FORMAT_STR.format(payload_headers['processname'])
syslog.syslog(syslog.LOG_INFO, msg)
os.kill(os.getppid(), signal.SIGTERM)
file_descriptor_list = select.select([sys.stdin], [], [], SELECT_TIMEOUT_SECS)[0]
if len(file_descriptor_list) > 0:
line = file_descriptor_list[0].readline()
headers = childutils.get_headers(line)
payload = sys.stdin.read(int(headers['len']))

# Handle the PROCESS_STATE_EXITED event
if headers['eventname'] == 'PROCESS_STATE_EXITED':
payload_headers, payload_data = childutils.eventdata(payload + '\n')

expected = int(payload_headers['expected'])
process_name = payload_headers['processname']
group_name = payload_headers['groupname']

if (process_name in critical_process_list or group_name in critical_group_list) and expected == 0:
is_auto_restart = get_autorestart_state(container_name)
if is_auto_restart != "disabled":
MSG_FORMAT_STR = "Process '{}' exited unexpectedly. Terminating supervisor '{}'"
msg = MSG_FORMAT_STR.format(payload_headers['processname'], container_name)
syslog.syslog(syslog.LOG_INFO, msg)
os.kill(os.getppid(), signal.SIGTERM)
else:
process_under_alerting[process_name] = time.time()

# Handle the PROCESS_STATE_RUNNING event
elif headers['eventname'] == 'PROCESS_STATE_RUNNING':
payload_headers, payload_data = childutils.eventdata(payload + '\n')
process_name = payload_headers['processname']

if process_name in process_under_alerting:
process_under_alerting.pop(process_name)

# Transition from BUSY to ACKNOWLEDGED
childutils.listener.ok()

# Transition from ACKNOWLEDGED to READY
childutils.listener.ready()

# Check whether we need write alerting messages into syslog
for process in process_under_alerting.keys():
epoch_time = time.time()
if epoch_time - process_under_alerting[process] >= ALERTING_INTERVAL_SECS:
process_under_alerting[process] = epoch_time
generate_alerting_message(process)


if __name__ == "__main__":
Expand Down
2 changes: 1 addition & 1 deletion platform/barefoot/docker-syncd-bfn/supervisord.conf
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ buffer_size=25

[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name syncd
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
2 changes: 1 addition & 1 deletion platform/broadcom/docker-syncd-brcm/supervisord.conf
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ buffer_size=25

[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name syncd
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
2 changes: 1 addition & 1 deletion platform/cavium/docker-syncd-cavm/supervisord.conf
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ nodaemon=true

[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name syncd
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ buffer_size=25

[eventlistener:supervisor-proc-exit-listener]
command=python2 /usr/bin/supervisor-proc-exit-listener --container-name syncd
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
2 changes: 1 addition & 1 deletion platform/centec/docker-syncd-centec/supervisord.conf
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ events=PROCESS_STATE

[eventlistener:supervisor-proc-exit-listener]
command=python2 /usr/bin/supervisor-proc-exit-listener --container-name syncd
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ buffer_size=25

[eventlistener:supervisor-proc-exit-listener]
command=python2 /usr/bin/supervisor-proc-exit-listener --container-name syncd
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ buffer_size=25

[eventlistener:supervisor-proc-exit-listener]
command=python2 /usr/bin/supervisor-proc-exit-listener --container-name syncd
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
2 changes: 1 addition & 1 deletion platform/marvell/docker-syncd-mrvl/supervisord.conf
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ buffer_size=25

[eventlistener:supervisor-proc-exit-listener]
command=python2 /usr/bin/supervisor-proc-exit-listener --container-name syncd
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
2 changes: 1 addition & 1 deletion platform/mellanox/docker-syncd-mlnx/supervisord.conf
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ buffer_size=25

[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name syncd
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
Loading