Skip to content

Commit

Permalink
[supervisord] Monitoring the critical processes with supervisord. (#6242
Browse files Browse the repository at this point in the history
)

- Why I did it
Initially, we used Monit to monitor critical processes in each container. If one of critical processes was not running
or crashed due to some reasons, then Monit will write an alerting message into syslog periodically. If we add a new process
in a container, the corresponding Monti configuration file will also need to update. It is a little hard for maintenance.

Currently we employed event listener of Supervisod to do this monitoring. Since processes in each container are managed by
Supervisord, we can only focus on the logic of monitoring.

- How I did it
We borrowed the event listener of Supervisord to monitor critical processes in containers. The event listener will take
following steps if it was notified one of critical processes exited unexpectedly:

The event listener will first check whether the auto-restart mechanism was enabled for this container or not. If auto-restart mechanism was enabled, event listener will kill the Supervisord process, which should cause the container to exit and subsequently get restarted.

If auto-restart mechanism was not enabled for this contianer, the event listener will enter a loop which will first sleep 1 minute and then check whether the process is running. If yes, the event listener exits. If no, an alerting message will be written into syslog.

- How to verify it
First, we need checked whether the auto-restart mechanism of a container was enabled or not by running the command show feature status. If enabled, one critical process should be selected and killed manually, then we need check whether the container will be restarted or not.

Second, we can disable the auto-restart mechanism if it was enabled at step 1 by running the commnad sudo config feature autorestart <container_name> disabled. Then one critical process should be selected and killed. After that, we will see the alerting message which will appear in the syslog every 1 minute.

- Which release branch to backport (provide reason below if selected)

 201811
 201911
[x ] 202006
  • Loading branch information
yozhao101 authored and lguohan committed Jan 28, 2021
1 parent fc825f9 commit cc9c3f5
Show file tree
Hide file tree
Showing 30 changed files with 130 additions and 74 deletions.
2 changes: 1 addition & 1 deletion dockers/docker-database/supervisord.conf.j2
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ nodaemon=true

[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name database
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ buffer_size=50

[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name dhcp_relay
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
2 changes: 1 addition & 1 deletion dockers/docker-fpm-frr/frr/supervisord/supervisord.conf.j2
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ buffer_size=50

[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name bgp
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
2 changes: 1 addition & 1 deletion dockers/docker-fpm-gobgp/supervisord.conf
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ nodaemon=true

[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name bgp
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
2 changes: 1 addition & 1 deletion dockers/docker-fpm-quagga/supervisord.conf
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ nodaemon=true

[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name bgp
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
2 changes: 1 addition & 1 deletion dockers/docker-lldp/supervisord.conf.j2
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ buffer_size=25

[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name lldp
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
2 changes: 1 addition & 1 deletion dockers/docker-nat/supervisord.conf
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ buffer_size=25

[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name nat
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
2 changes: 1 addition & 1 deletion dockers/docker-orchagent/supervisord.conf
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ buffer_size=100

[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name swss
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ buffer_size=100

[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name pmon
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ buffer_size=25

[eventlistener:supervisor-proc-exit-script]
command=/usr/bin/supervisor-proc-exit-listener --container-name radv
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
2 changes: 1 addition & 1 deletion dockers/docker-sflow/supervisord.conf
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ buffer_size=25

[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name sflow
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
2 changes: 1 addition & 1 deletion dockers/docker-snmp/supervisord.conf
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ buffer_size=50

[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name snmp
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
2 changes: 1 addition & 1 deletion dockers/docker-sonic-restapi/supervisord.conf
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ buffer_size=25

[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name restapi
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=false

Expand Down
2 changes: 1 addition & 1 deletion dockers/docker-sonic-telemetry/supervisord.conf
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ buffer_size=50

[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name telemetry
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=false

Expand Down
2 changes: 1 addition & 1 deletion dockers/docker-teamd/supervisord.conf
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ buffer_size=50

[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name teamd
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
146 changes: 101 additions & 45 deletions files/scripts/supervisor-proc-exit-listener
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,14 @@

import getopt
import os
import select
import signal
import sys
import syslog
import time

import swsssdk

from supervisor import childutils

# Each line of this file should specify either one critical process or one
Expand All @@ -20,10 +23,18 @@ CRITICAL_PROCESSES_FILE = '/etc/supervisor/critical_processes'
# The FEATURE table in config db contains auto-restart field
FEATURE_TABLE_NAME = 'FEATURE'

# Read the critical processes/group names from CRITICAL_PROCESSES_FILE
# Value of parameter 'timeout' in select(...) method
SELECT_TIMEOUT_SECS = 1.0

# Alerting message will be written into syslog in the following interval
ALERTING_INTERVAL_SECS = 60


def get_critical_group_and_process_list():
"""
@summary: Read the critical processes/group names from CRITICAL_PROCESSES_FILE.
@return: Two lists which contain critical processes and group names respectively.
"""
critical_group_list = []
critical_process_list = []

Expand All @@ -49,6 +60,47 @@ def get_critical_group_and_process_list():
return critical_group_list, critical_process_list


def generate_alerting_message(process_name):
"""
@summary: If a critical process was not running, this function will determine it resides in host
or in a specific namespace. Then an alerting message will be written into syslog.
"""
namespace_prefix = os.environ.get("NAMESPACE_PREFIX")
namespace_id = os.environ.get("NAMESPACE_ID")

if not namespace_prefix or not namespace_id:
namespace = "host"
else:
namespace = namespace_prefix + namespace_id

syslog.syslog(syslog.LOG_ERR, "Process '{}' is not running in namespace '{}'.".format(process_name, namespace))


def get_autorestart_state(container_name):
"""
@summary: Read the status of auto-restart feature from Config_DB.
@return: Return the status of auto-restart feature.
"""
config_db = swsssdk.ConfigDBConnector()
config_db.connect()
features_table = config_db.get_table(FEATURE_TABLE_NAME)
if not features_table:
syslog.syslog(syslog.LOG_ERR, "Unable to retrieve features table from Config DB. Exiting...")
sys.exit(2)

if container_name not in features_table:
syslog.syslog(syslog.LOG_ERR, "Unable to retrieve feature '{}'. Exiting...".format(container_name))
sys.exit(3)

is_auto_restart = features_table[container_name].get('auto_restart')
if not is_auto_restart:
syslog.syslog(
syslog.LOG_ERR, "Unable to determine auto-restart feature status for '{}'. Exiting...".format(container_name))
sys.exit(4)

return is_auto_restart


def main(argv):
container_name = None
opts, args = getopt.getopt(argv, "c:", ["container-name="])
Expand All @@ -62,51 +114,55 @@ def main(argv):

critical_group_list, critical_process_list = get_critical_group_and_process_list()

process_under_alerting = {}
# Transition from ACKNOWLEDGED to READY
childutils.listener.ready()

while True:
# Transition from ACKNOWLEDGED to READY
childutils.listener.ready()

line = sys.stdin.readline()
headers = childutils.get_headers(line)
payload = sys.stdin.read(int(headers['len']))

# Transition from READY to ACKNOWLEDGED
childutils.listener.ok()

# We only care about PROCESS_STATE_EXITED events
if headers['eventname'] == 'PROCESS_STATE_EXITED':
payload_headers, payload_data = childutils.eventdata(payload + '\n')

expected = int(payload_headers['expected'])
processname = payload_headers['processname']
groupname = payload_headers['groupname']

# Read the status of auto-restart feature from Config_DB.
config_db = swsssdk.ConfigDBConnector()
config_db.connect()
features_table = config_db.get_table(FEATURE_TABLE_NAME)
if not features_table:
syslog.syslog(syslog.LOG_ERR, "Unable to retrieve features table from Config DB. Exiting...")
sys.exit(2)

if container_name not in features_table:
syslog.syslog(syslog.LOG_ERR, "Unable to retrieve feature '{}'. Exiting...".format(container_name))
sys.exit(3)

restart_feature = features_table[container_name].get('auto_restart')
if not restart_feature:
syslog.syslog(
syslog.LOG_ERR, "Unable to determine auto-restart feature status for '{}'. Exiting...".format(container_name))
sys.exit(4)

# If auto-restart feature is not disabled and at the same time
# a critical process exited unexpectedly, terminate supervisor
if (restart_feature != 'disabled' and expected == 0 and
(processname in critical_process_list or groupname in critical_group_list)):
MSG_FORMAT_STR = "Process {} exited unxepectedly. Terminating supervisor..."
msg = MSG_FORMAT_STR.format(payload_headers['processname'])
syslog.syslog(syslog.LOG_INFO, msg)
os.kill(os.getppid(), signal.SIGTERM)
file_descriptor_list = select.select([sys.stdin], [], [], SELECT_TIMEOUT_SECS)[0]
if len(file_descriptor_list) > 0:
line = file_descriptor_list[0].readline()
headers = childutils.get_headers(line)
payload = sys.stdin.read(int(headers['len']))

# Handle the PROCESS_STATE_EXITED event
if headers['eventname'] == 'PROCESS_STATE_EXITED':
payload_headers, payload_data = childutils.eventdata(payload + '\n')

expected = int(payload_headers['expected'])
process_name = payload_headers['processname']
group_name = payload_headers['groupname']

if (process_name in critical_process_list or group_name in critical_group_list) and expected == 0:
is_auto_restart = get_autorestart_state(container_name)
if is_auto_restart != "disabled":
MSG_FORMAT_STR = "Process '{}' exited unexpectedly. Terminating supervisor '{}'"
msg = MSG_FORMAT_STR.format(payload_headers['processname'], container_name)
syslog.syslog(syslog.LOG_INFO, msg)
os.kill(os.getppid(), signal.SIGTERM)
else:
process_under_alerting[process_name] = time.time()

# Handle the PROCESS_STATE_RUNNING event
elif headers['eventname'] == 'PROCESS_STATE_RUNNING':
payload_headers, payload_data = childutils.eventdata(payload + '\n')
process_name = payload_headers['processname']

if process_name in process_under_alerting:
process_under_alerting.pop(process_name)

# Transition from BUSY to ACKNOWLEDGED
childutils.listener.ok()

# Transition from ACKNOWLEDGED to READY
childutils.listener.ready()

# Check whether we need write alerting messages into syslog
for process in process_under_alerting.keys():
epoch_time = time.time()
if epoch_time - process_under_alerting[process] >= ALERTING_INTERVAL_SECS:
process_under_alerting[process] = epoch_time
generate_alerting_message(process)


if __name__ == "__main__":
Expand Down
2 changes: 1 addition & 1 deletion platform/barefoot/docker-syncd-bfn/supervisord.conf
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ buffer_size=25

[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name syncd
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
2 changes: 1 addition & 1 deletion platform/broadcom/docker-syncd-brcm/supervisord.conf
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ buffer_size=25

[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name syncd
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
2 changes: 1 addition & 1 deletion platform/cavium/docker-syncd-cavm/supervisord.conf
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ nodaemon=true

[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name syncd
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
2 changes: 1 addition & 1 deletion platform/centec-arm64/docker-syncd-centec/supervisord.conf
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ buffer_size=25

[eventlistener:supervisor-proc-exit-listener]
command=python2 /usr/bin/supervisor-proc-exit-listener --container-name syncd
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
2 changes: 1 addition & 1 deletion platform/centec/docker-syncd-centec/supervisord.conf
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ events=PROCESS_STATE

[eventlistener:supervisor-proc-exit-listener]
command=python2 /usr/bin/supervisor-proc-exit-listener --container-name syncd
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
2 changes: 1 addition & 1 deletion platform/marvell-arm64/docker-syncd-mrvl/supervisord.conf
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ buffer_size=25

[eventlistener:supervisor-proc-exit-listener]
command=python2 /usr/bin/supervisor-proc-exit-listener --container-name syncd
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
2 changes: 1 addition & 1 deletion platform/marvell-armhf/docker-syncd-mrvl/supervisord.conf
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ buffer_size=25

[eventlistener:supervisor-proc-exit-listener]
command=python2 /usr/bin/supervisor-proc-exit-listener --container-name syncd
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
2 changes: 1 addition & 1 deletion platform/marvell/docker-syncd-mrvl/supervisord.conf
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ buffer_size=25

[eventlistener:supervisor-proc-exit-listener]
command=python2 /usr/bin/supervisor-proc-exit-listener --container-name syncd
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
2 changes: 1 addition & 1 deletion platform/mellanox/docker-syncd-mlnx/supervisord.conf
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ buffer_size=25

[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name syncd
events=PROCESS_STATE_EXITED
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected

Expand Down
Loading

0 comments on commit cc9c3f5

Please sign in to comment.