Skip to content

Commit

Permalink
[config] Eliminate race condition between reloading Monit config and (s…
Browse files Browse the repository at this point in the history
…onic-net#1543)

Signed-off-by: Yong Zhao yozhao@microsoft.com

What I did
Nightly test found a failure when we ran the command sudo config reload/load_minigraph, The error message is:

admin@str-a7050-acs-1:~$ sudo config reload -y
Disabling container monitoring ...
Stopping SONiC target ...
Running command: /usr/local/bin/sonic-cfggen -j /etc/sonic/init_cfg.json -j /etc/sonic/config_db.json --write-to-db
Running command: /usr/local/bin/db_migrator.py -o migrate
Resetting failed status on bgp.service
Resetting failed status on caclmgrd.service
Resetting failed status on dhcp_relay.service
Resetting failed status on hostcfgd.service
Resetting failed status on hostname-config.service
Resetting failed status on interfaces-config.service
Resetting failed status on lldp.service
Resetting failed status on ntp-config.service
Resetting failed status on pmon.service
Resetting failed status on procdockerstatsd.service
Resetting failed status on radv.service
Resetting failed status on rsyslog-config.service
Resetting failed status on swss.service
Resetting failed status on syncd.service
Resetting failed status on teamd.service
Resetting failed status on telemetry.timer
Restarting SONiC target ...
Reloading Monit configuration ...
Reinitializing monit daemon
Enabling container monitoring ...
Unix socket /var/run/monit.sock connection error -- No such file or directory
The root reason is that there exists an implicit race condition between the command sudo monit reload at line 701 and
the command sudo monit monitor container_checker at line 706. Both commands need access the monit.sock socket file
under the directory /var/run/.

Specifically the sudo monit reload at line 701 will re-initialize the Monit daemon, delete old monit.sock file and then create a new one. During this re-initializing process, the command sudo monit status can always execute successfully at line 704 before the old monit.sock file was deleted, but the command sudo monit monitor container_checker at line 706 will only succeed if the new monit.sock was created, otherwise it will fail and raise this error message.

How I did it
I changed the sequence between the operation to reload Monit configuration and the operation to enable monitoring container_checker.

How to verify it
I verified this change on DuT str-a7050-acs-1 by running the command sudo config reload/load_minigraph -y to make sure the error was not raised again.

Previous command output (if the output of a command-line utility has changed)
admin@str-a7050-acs-1:~$ sudo config reload -y
Disabling container monitoring ...
Stopping SONiC target ...
Running command: /usr/local/bin/sonic-cfggen -j /etc/sonic/init_cfg.json -j /etc/sonic/config_db.json --write-to-db
Running command: /usr/local/bin/db_migrator.py -o migrate
Resetting failed status on bgp.service
Resetting failed status on caclmgrd.service
Resetting failed status on dhcp_relay.service
Resetting failed status on hostcfgd.service
Resetting failed status on hostname-config.service
Resetting failed status on interfaces-config.service
Resetting failed status on lldp.service
Resetting failed status on ntp-config.service
Resetting failed status on pmon.service
Resetting failed status on procdockerstatsd.service
Resetting failed status on radv.service
Resetting failed status on rsyslog-config.service
Resetting failed status on swss.service
Resetting failed status on syncd.service
Resetting failed status on teamd.service
Resetting failed status on telemetry.timer
Restarting SONiC target ...
Reloading Monit configuration ...
Reinitializing monit daemon
Enabling container monitoring ...
New command output (if the output of a command-line utility has changed)
admin@str-a7050-acs-1:~$ sudo config reload -y
Disabling container monitoring ...
Stopping SONiC target ...
Running command: /usr/local/bin/sonic-cfggen -j /etc/sonic/init_cfg.json -j /etc/sonic/config_db.json --write-to-db
Running command: /usr/local/bin/db_migrator.py -o migrate
Resetting failed status on bgp.service
Resetting failed status on caclmgrd.service
Resetting failed status on dhcp_relay.service
Resetting failed status on hostcfgd.service
Resetting failed status on hostname-config.service
Resetting failed status on interfaces-config.service
Resetting failed status on lldp.service
Resetting failed status on ntp-config.service
Resetting failed status on pmon.service
Resetting failed status on procdockerstatsd.service
Resetting failed status on radv.service
Resetting failed status on rsyslog-config.service
Resetting failed status on swss.service
Resetting failed status on syncd.service
Resetting failed status on teamd.service
Resetting failed status on telemetry.timer
Restarting SONiC target ...
Enabling container monitoring ...
Reloading Monit configuration ...
Reinitializing monit daemon
  • Loading branch information
yozhao101 authored Apr 4, 2021
1 parent 87b2481 commit 9bbc25f
Showing 1 changed file with 4 additions and 4 deletions.
8 changes: 4 additions & 4 deletions config/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -696,17 +696,17 @@ def _restart_services():
click.echo("Restarting SONiC target ...")
clicommon.run_command("sudo systemctl restart sonic.target")

# Reload Monit configuration to pick up new hostname in case it changed
click.echo("Reloading Monit configuration ...")
clicommon.run_command("sudo monit reload")

try:
subprocess.check_call("sudo monit status", shell=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
click.echo("Enabling container monitoring ...")
clicommon.run_command("sudo monit monitor container_checker")
except subprocess.CalledProcessError as err:
pass

# Reload Monit configuration to pick up new hostname in case it changed
click.echo("Reloading Monit configuration ...")
clicommon.run_command("sudo monit reload")


def interface_is_in_vlan(vlan_member_table, interface_name):
""" Check if an interface is in a vlan """
Expand Down

0 comments on commit 9bbc25f

Please sign in to comment.