Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[doc] Monitoring and Auto-mitigating the unhealthy of docker containers in SONiC #564

Open
wants to merge 57 commits into
base: master
Choose a base branch
from

Conversation

yozhao101
Copy link
Contributor

This document will introduce the motivation and design for monitoring, auto-mitigating the unhealthy of docker containers in SONiC.

the running status of critical process and resource usage.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
feature.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
doc/monitoring_containers/monitoring_containers.md Outdated Show resolved Hide resolved
doc/monitoring_containers/monitoring_containers.md Outdated Show resolved Hide resolved
doc/monitoring_containers/monitoring_containers.md Outdated Show resolved Hide resolved
doc/monitoring_containers/monitoring_containers.md Outdated Show resolved Hide resolved
doc/monitoring_containers/monitoring_containers.md Outdated Show resolved Hide resolved
doc/monitoring_containers/monitoring_containers.md Outdated Show resolved Hide resolved
doc/monitoring_containers/monitoring_containers.md Outdated Show resolved Hide resolved
doc/monitoring_containers/monitoring_containers.md Outdated Show resolved Hide resolved
doc/monitoring_containers/monitoring_containers.md Outdated Show resolved Hide resolved
doc/monitoring_containers/monitoring_containers.md Outdated Show resolved Hide resolved
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Copy link
Contributor

@jleveque jleveque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One new review comment added and one old comment is still unaddressed.

Resource Usage.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
auto-restart and warm re-boot. Add a paragraph to introduce how can
we use Monit to monitor multiple processes with the same command.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
},
"lldp": {
"auto_restart": "disabled",
"high_mem_alert": "104857600",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can thresholds should be human readable? Can it be possible to calculate threshold in % values ?

},
"snmp": {
"auto_restart": "enabled",
"high_mem_alert": "157286400",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you determine how much thresholds should configure? do you have anay recommendations?

admin@sonic:~$ show container feature autorestart
Container Name Status
-------------------- --------
database disabled

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can database container is consistent with data after auto restart ?

container stops, the systemd service which manages the container will also stop, but it is
configured to automatically restart the service, thus it will restart the container.

We also introduced a configuration option which can enable or disable this auto-restart feature

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does auto-restart works with dockers loaded dynamically?

Similar to monitoring the critical processes, we can employ Monit to monitor the resource usage
such as CPU, memory and disk for each process. Unfortunately Monit is unable to do the resource monitoring
in the container level. Thus we propose a new design to achieve such monitoring based on Monit.
Specifically Monit will monitor a script and check its exit status. This script

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this script be able to detect any hang/loop or deadlock situation for the processes or
threads inside the container?

1. Monit must provide the ability to generate an alert when a critical process has not
been alive for 5 minutes.
2. Monit must provide the ability to generate an alert when the resource usage of
a docker container is larger than the pre-defined threshold.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be saved in the DB for doing trend analysis of containers on the resource utilization?

@ben-gale
Copy link

What's the plan for this feature? Is it proceeding?

@jleveque
Copy link
Contributor

@ben-gale: Yes, it is proceeding. Most of the infrastructure is already in place in the master and 201911 branches.

@ben-gale
Copy link

@ben-gale: Yes, it is proceeding. Most of the infrastructure is already in place in the master and 201911 branches.

Thanks Joe - timeline for the code PRs to master?

@jleveque
Copy link
Contributor

@yozhao101: Can you please add a comment here with links to all the related PRs in sonic-buildimage and sonic-utilities thus far?

@yozhao101
Copy link
Contributor Author

@yozhao101: Can you please add a comment here with links to all the related PRs in sonic-buildimage and sonic-utilities thus far?

Yes, I will update with link of PRs.

@yozhao101
Copy link
Contributor Author

yozhao101 commented Jun 25, 2020

@ben-gale: Yes, it is proceeding. Most of the infrastructure is already in place in the master and 201911 branches.

Thanks Joe - timeline for the code PRs to master?
@jleveque

This document introduced three features which we plan to deploy into SONiC:

1.We proposed to employ Monit to monitor the running status of critical processes in docker containers. The PRs of this proposal in the public SONiC repo are as following:

sonic-net/sonic-buildimage#3940
sonic-net/sonic-buildimage#4033
sonic-net/sonic-buildimage#4706

2.We proposed to employ process monitoring/notification framework of supervisord to implement the auto-restart feature of docker containers. The PRs of this proposal in the public SONiC repo are as following:

[process monitoring/notification framework] https://github.com/Azure/sonic-buildimage/pull/2852/files
[process monitoring/notification framework] sonic-net/sonic-buildimage#4073

[Syncd] https://github.com/Azure/sonic-buildimage/pull/3534/files
[SWSS] https://github.com/Azure/sonic-buildimage/pull/2852/files
https://github.com/Azure/sonic-buildimage/pull/2845/files
[SNMP] sonic-net/sonic-buildimage#3650
[DHCP_Relay] sonic-net/sonic-buildimage#3667
[Radv] sonic-net/sonic-buildimage#3681
[PMon] sonic-net/sonic-buildimage#3689
[Teamd] sonic-net/sonic-buildimage#3703
[LLDP] sonic-net/sonic-buildimage#3713
[Sflow] sonic-net/sonic-buildimage#3751
[Telemetry] sonic-net/sonic-buildimage#3768
[Database] sonic-net/sonic-buildimage#4138
[BGP] sonic-net/sonic-buildimage#4207
[NAT] sonic-net/sonic-buildimage#4208

[CLI to check the state of autorestart feature of each container]
sonic-net/sonic-utilities#798
sonic-net/sonic-utilities#801

@ben-gale
Copy link

ben-gale commented Jun 30, 2020 via email

by Supervisord and high memory restart.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
@yxieca yxieca force-pushed the master branch 2 times, most recently from 8498931 to 8837dc2 Compare April 15, 2022 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants