[Services] Restart DHCP-Relay service upon unexpected critical process exit. #3667

yozhao101 · 2019-10-25T19:02:58Z

What I did
Restart DHCP-Relay service if one of critical processes running in DHCP-Relay container exited or crashed abnormally.
How I did it
Generally I follow the framework created by Joe to implement this feature in DHCP-Relay container.
First, add supervisor-proc-exit-listener event listener option in Supervisord configuration file in DHCP-Relay docker container. Supervisord will read a list of critical processes for which to monitor the unexpected crashed and exited.
For DHCP-Relay container, since a bunch of critical processes will be monitored by a group, we only need put the groupname in the file critical_processes. At the same time, we also add source code in supervisor-proc-exit-listener script to retrieve the groupname and then decide whether it appears in critical_processes.
Second, configure dhcp-relay.service to always auto-restart the service if it stops, with a delay of 30 seconds. Also set a rate limit of 3 restarts within 20 minutes (1200 seconds).
How to verify it
On your switch device, please use docker ps command to list all running docker containers.
Then use docker exec -it container_id /bin/bash to login target container. Typing top command
on the shell will display all the processes dynamically and you will spot the process id of one
of the critical processes. Finally type the command kill -9 process_id to terminate one process.
After exiting the container, you can use watch -n 1 docker ps to dynamically see the restart
of DHCP-Relay container.

…relay service, this file contains a single groupname: isc-dhcp-relay. Signed-off-by: Yong Zhao <yozhao@microsoft.com>

…ical processes file into dockfile.j2. Signed-off-by: Yong Zhao <yozhao@microsoft.com>

… supervisord conf file. Signed-off-by: Yong Zhao <yozhao@microsoft.com>

…f it attempts to restart this container more than 3 times in 20 minutes. Signed-off-by: Yong Zhao <yozhao@microsoft.com>

… to shared Makefile docker-dhcp-relay.mk. Signed-off-by: Yong Zhao <yozhao@microsoft.com>

jleveque · 2019-10-25T21:02:27Z

Looks good. However, I feel that it's not clear that one can now add group names to the "critical_processes" file (because the file name doesn't mention groups). I don't want to rename the file to something long, like "critical_processes_and_groups", though. Any suggestions?

yozhao101 · 2019-10-29T17:19:56Z

I though this issue for a while. In order to keep consistency with other containers, we can put
actual process names in this file not the group name. Or we can divide the critical_processes
into two sections: critical process section and group section. For dhcp-relay, we can leave the critical
process section empty and just put a single group name in group section. For now, can we create a critical_processes.j2 file to handle this issue?

jleveque · 2019-10-29T18:52:28Z

I don’t think we should take the templated approach, as using the group name is now shown to work, is much simpler and will require far less maintenance in the future. I think we can keep this as-is for now, but I would like to distinguish between processes and groups in the future. Maybe once all of the containers are managed properly, we can update the critical_processes syntax to match the supervisor.conf syntax. E.g.,

program:x
program:y
group:z

Then we can update the event listener's parsing logic. This separates the individual processes from the groups and also makes it clear to the user.

jleveque · 2019-11-01T23:23:46Z

Retest this please

… which monitors a bunch of processes. Signed-off-by: Yong Zhao <yozhao@microsoft.com>

Signed-off-by: Yong Zhao <yozhao@microsoft.com>

jleveque · 2019-11-06T01:12:54Z

Retest vs please

…s exit. (sonic-net#3667) Signed-off-by: Yong Zhao <yozhao@microsoft.com>

…s exit. (sonic-net#3667) Signed-off-by: Yong Zhao <yozhao@microsoft.com> [Services] Restart Platform-monitor service upon unexpected critical process exit. (sonic-net#3689) Signed-off-by: Yong Zhao <yozhao@microsoft.com> Signed-off-by: Sangita Maity <sangitamaity0211@gmail.com> RB=2126600 G=lnos-reviewers R=pchaudha,pmao,vapatil,zxu A=zxu

yozhao101 added 5 commits October 25, 2019 11:21

[docker-dhcp-relay] Create a file named critical_processes. For dhcp-…

2474fab

…relay service, this file contains a single groupname: isc-dhcp-relay. Signed-off-by: Yong Zhao <yozhao@microsoft.com>

[docker-dhcp-relay] Add paths of supervisord listener script and crit…

1cfa8e6

…ical processes file into dockfile.j2. Signed-off-by: Yong Zhao <yozhao@microsoft.com>

[docker-dhcp-relay] Make event listener autostart by adding option in…

847aed0

… supervisord conf file. Signed-off-by: Yong Zhao <yozhao@microsoft.com>

[docker-dhcp-relay] Configure systemd to stop restarting dhcp-relay i…

2ed2adb

…f it attempts to restart this container more than 3 times in 20 minutes. Signed-off-by: Yong Zhao <yozhao@microsoft.com>

[docker-dhcp-relay] Add macro $(SUPERVISOR_PROC_EXIT_LISTENER_SCRIPT)…

e2ae9d1

… to shared Makefile docker-dhcp-relay.mk. Signed-off-by: Yong Zhao <yozhao@microsoft.com>

yozhao101 requested a review from jleveque October 25, 2019 19:02

jleveque added the Enhancement ➕ label Oct 25, 2019

yozhao101 added 2 commits November 4, 2019 16:32

[docker-dhcp-relay] Event listener will also be guided by a groupname…

b705b35

… which monitors a bunch of processes. Signed-off-by: Yong Zhao <yozhao@microsoft.com>

[docker-dhcp-relay] Add event listener option in test conf file.

7ca5fc8

Signed-off-by: Yong Zhao <yozhao@microsoft.com>

jleveque approved these changes Nov 6, 2019

View reviewed changes

jleveque merged commit ed79f54 into sonic-net:master Nov 6, 2019

zhenggen-xu pushed a commit to zhenggen-xu/sonic-buildimage that referenced this pull request Jan 10, 2020

[Services] Restart DHCP-Relay service upon unexpected critical proces…

47989f6

…s exit. (sonic-net#3667) Signed-off-by: Yong Zhao <yozhao@microsoft.com>

yozhao101 mentioned this pull request Jun 25, 2020

[doc] Monitoring and Auto-mitigating the unhealthy of docker containers in SONiC sonic-net/SONiC#564

Open

jleveque added the DHCP Relay label Jul 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Services] Restart DHCP-Relay service upon unexpected critical process exit. #3667

[Services] Restart DHCP-Relay service upon unexpected critical process exit. #3667

yozhao101 commented Oct 25, 2019

jleveque commented Oct 25, 2019

yozhao101 commented Oct 29, 2019

jleveque commented Oct 29, 2019

jleveque commented Nov 1, 2019

jleveque commented Nov 6, 2019

[Services] Restart DHCP-Relay service upon unexpected critical process exit. #3667

[Services] Restart DHCP-Relay service upon unexpected critical process exit. #3667

Conversation

yozhao101 commented Oct 25, 2019

jleveque commented Oct 25, 2019

yozhao101 commented Oct 29, 2019

jleveque commented Oct 29, 2019

jleveque commented Nov 1, 2019

jleveque commented Nov 6, 2019