Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Services] Restart DHCP-Relay service upon unexpected critical process exit. #3667

Merged
merged 7 commits into from
Nov 6, 2019
Merged

[Services] Restart DHCP-Relay service upon unexpected critical process exit. #3667

merged 7 commits into from
Nov 6, 2019

Conversation

yozhao101
Copy link
Contributor

  • What I did
    Restart DHCP-Relay service if one of critical processes running in DHCP-Relay container exited or crashed abnormally.

  • How I did it
    Generally I follow the framework created by Joe to implement this feature in DHCP-Relay container.
    First, add supervisor-proc-exit-listener event listener option in Supervisord configuration file in DHCP-Relay docker container. Supervisord will read a list of critical processes for which to monitor the unexpected crashed and exited.
    For DHCP-Relay container, since a bunch of critical processes will be monitored by a group, we only need put the groupname in the file critical_processes. At the same time, we also add source code in supervisor-proc-exit-listener script to retrieve the groupname and then decide whether it appears in critical_processes.
    Second, configure dhcp-relay.service to always auto-restart the service if it stops, with a delay of 30 seconds. Also set a rate limit of 3 restarts within 20 minutes (1200 seconds).

  • How to verify it
    On your switch device, please use docker ps command to list all running docker containers.
    Then use docker exec -it container_id /bin/bash to login target container. Typing top command
    on the shell will display all the processes dynamically and you will spot the process id of one
    of the critical processes. Finally type the command kill -9 process_id to terminate one process.
    After exiting the container, you can use watch -n 1 docker ps to dynamically see the restart
    of DHCP-Relay container.

…relay service, this file contains a single groupname: isc-dhcp-relay.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
…ical processes file into dockfile.j2.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
… supervisord conf file.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
…f it attempts to restart this container more than 3 times in 20 minutes.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
… to shared Makefile docker-dhcp-relay.mk.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
@jleveque
Copy link
Contributor

Looks good. However, I feel that it's not clear that one can now add group names to the "critical_processes" file (because the file name doesn't mention groups). I don't want to rename the file to something long, like "critical_processes_and_groups", though. Any suggestions?

@yozhao101
Copy link
Contributor Author

I though this issue for a while. In order to keep consistency with other containers, we can put
actual process names in this file not the group name. Or we can divide the critical_processes
into two sections: critical process section and group section. For dhcp-relay, we can leave the critical
process section empty and just put a single group name in group section. For now, can we create a critical_processes.j2 file to handle this issue?

@jleveque
Copy link
Contributor

I don’t think we should take the templated approach, as using the group name is now shown to work, is much simpler and will require far less maintenance in the future. I think we can keep this as-is for now, but I would like to distinguish between processes and groups in the future. Maybe once all of the containers are managed properly, we can update the critical_processes syntax to match the supervisor.conf syntax. E.g.,

program:x
program:y
group:z

Then we can update the event listener's parsing logic. This separates the individual processes from the groups and also makes it clear to the user.

@jleveque
Copy link
Contributor

jleveque commented Nov 1, 2019

Retest this please

… which monitors a bunch of processes.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
@jleveque
Copy link
Contributor

jleveque commented Nov 6, 2019

Retest vs please

@jleveque jleveque merged commit ed79f54 into sonic-net:master Nov 6, 2019
zhenggen-xu pushed a commit to zhenggen-xu/sonic-buildimage that referenced this pull request Jan 10, 2020
…s exit. (sonic-net#3667)

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
praveen-li pushed a commit to praveen-li/sonic-buildimage that referenced this pull request Feb 9, 2021
…s exit. (sonic-net#3667)

Signed-off-by: Yong Zhao <yozhao@microsoft.com>

[Services] Restart Platform-monitor service upon unexpected critical process exit. (sonic-net#3689)
Signed-off-by: Yong Zhao <yozhao@microsoft.com>

Signed-off-by: Sangita Maity <sangitamaity0211@gmail.com>

    RB=2126600
    G=lnos-reviewers
    R=pchaudha,pmao,vapatil,zxu
    A=zxu
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants