-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[memory_checker] Do not check memory usage of containers which are not created #11129
[memory_checker] Do not check memory usage of containers which are not created #11129
Conversation
device is (re)booted, then exits without checking its memory usage. Signed-off-by: Yong Zhao <yozhao@microsoft.com>
This reverts commit a477dbb.
This reverts commit 28a1071.
This reverts commit add2fbf.
booted/rebooted, then `memory_checker` will not check its memory usage. Signed-off-by: Yong Zhao <yozhao@microsoft.com>
/AzurePipelines run Azure.sonic-buildimage |
Azure Pipelines successfully started running 1 pipeline(s). |
list could not be retrieved. Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
@qiluo-msft Can you help me review this PR please? |
@@ -104,7 +131,13 @@ def main(): | |||
parser.add_argument("threshold_value", type=int, help="threshold value in bytes") | |||
args = parser.parse_args() | |||
|
|||
check_memory_usage(args.container_name, args.threshold_value) | |||
running_container_names = get_running_container_names() | |||
if args.container_name in running_container_names: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Working on the test case and will post the PR link at here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sonic-mgmt test case is posted: sonic-net/sonic-mgmt#5823. If you are available, can you please help me review?
running_container_names: A list indicates names of running containers. | ||
""" | ||
running_container_names = [] | ||
docker_client = docker.DockerClient(base_url='unix://var/run/docker.sock') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yozhao101 do we need handle the case when docker engine is not started? Let's say memory checker starts simultaneously with docker service
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memory_checker
script will be ran by Monit periodically and Monit is delayed to run this script after device is up 300 seconds.
Docker daemon will be started by systemd immediately once device is booted/rebooted.
If docker daemon crashed or is not started successfully, then memory_checker
will exit with error code and log an error message into syslog as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yozhao101 ok
…rted successfully. Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
@nazariig Can you please review again to see whether logging an error message is good way or not to address your question? |
container_list = container_obj.list(filters={"status": "running"}) | ||
for container in container_list: | ||
running_container_names.append(container.name) | ||
except (docker.errors.APIError, docker.errors.DockerException) as err: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yozhao101 i would suggest to rewrite like this:
try:
docker_client = docker.DockerClient(base_url='unix://var/run/docker.sock')
running_container_list = docker_client.containers.list(filters={"status": "running"})
running_container_names = [ container.name for container in running_container_list ]
except (docker.errors.APIError, docker.errors.DockerException) as err:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Please wait for other reviewers.
/AzurePipelines run Azure.sonic-buildimage |
1.What is the motivation for this PR? This PR aims to add a new test case which will check whether memory_checker can log a message into syslog if a container is not created during device is booted/rebooted. The image PR is: sonic-net/sonic-buildimage#11129. 2.How did you do it? This test case has the following steps: - Removes a container explicitly from DuT - Leverages Loganalyzer to analyze whether the message from memory_checker appears in syslog - Restarts the corresponding container on DuT and do health check 3.How did you verify/test it? I verified this new test case on lab device str-s6000-on-2
…t created (#11129) Signed-off-by: Yong Zhao yozhao@microsoft.com Why I did it This PR aims to fix an issue (#10088) by enhancing the script memory_checker. Specifically, if container is not created successfully during device is booted/rebooted, then memory_checker do not need check its memory usage. How I did it In the script memory_checker, a function is added to get names of running containers. If the specified container name is not in current running container list, then this script will exit without checking its memory usage. How to verify it I tested on a lab device by following the steps: Stops telemetry container with command sudo systemctl stop telemetry.service Removes telemetry container with command docker rm telemetry Checks whether the script memory_checker ran by Monit will generate the syslog message saying it will exit without checking memory usage of telemetry.
1.What is the motivation for this PR? This PR aims to add a new test case which will check whether memory_checker can log a message into syslog if a container is not created during device is booted/rebooted. The image PR is: sonic-net/sonic-buildimage#11129. 2.How did you do it? This test case has the following steps: - Removes a container explicitly from DuT - Leverages Loganalyzer to analyze whether the message from memory_checker appears in syslog - Restarts the corresponding container on DuT and do health check 3.How did you verify/test it? I verified this new test case on lab device str-s6000-on-2
…t created (#11129) Signed-off-by: Yong Zhao yozhao@microsoft.com Why I did it This PR aims to fix an issue (#10088) by enhancing the script memory_checker. Specifically, if container is not created successfully during device is booted/rebooted, then memory_checker do not need check its memory usage. How I did it In the script memory_checker, a function is added to get names of running containers. If the specified container name is not in current running container list, then this script will exit without checking its memory usage. How to verify it I tested on a lab device by following the steps: Stops telemetry container with command sudo systemctl stop telemetry.service Removes telemetry container with command docker rm telemetry Checks whether the script memory_checker ran by Monit will generate the syslog message saying it will exit without checking memory usage of telemetry.
…emon is not running (#11476) Fix in Monit memory_checker plugin. Skip fetching running containers if docker engine is down (can happen in deinit). This PR fixes issue #11472. Signed-off-by: liora liora@nvidia.com Why I did it In the case where Monit runs during deinit flow, memory_checker plugin is fetching the running containers without checking if Docker service is still running. I added this check. How I did it Use systemctl is-active to check if Docker engine is still running. How to verify it Use systemctl to stop docker engine and reload Monit, no errors in log and relevant print appears in log. Which release branch to backport (provide reason below if selected) The fix is required in 202205 and 202012 since the PR that introduced the issue was cherry picked to those branches (#11129).
…emon is not running (#11476) Fix in Monit memory_checker plugin. Skip fetching running containers if docker engine is down (can happen in deinit). This PR fixes issue #11472. Signed-off-by: liora liora@nvidia.com Why I did it In the case where Monit runs during deinit flow, memory_checker plugin is fetching the running containers without checking if Docker service is still running. I added this check. How I did it Use systemctl is-active to check if Docker engine is still running. How to verify it Use systemctl to stop docker engine and reload Monit, no errors in log and relevant print appears in log. Which release branch to backport (provide reason below if selected) The fix is required in 202205 and 202012 since the PR that introduced the issue was cherry picked to those branches (#11129).
…emon is not running (#11476) Fix in Monit memory_checker plugin. Skip fetching running containers if docker engine is down (can happen in deinit). This PR fixes issue #11472. Signed-off-by: liora liora@nvidia.com Why I did it In the case where Monit runs during deinit flow, memory_checker plugin is fetching the running containers without checking if Docker service is still running. I added this check. How I did it Use systemctl is-active to check if Docker engine is still running. How to verify it Use systemctl to stop docker engine and reload Monit, no errors in log and relevant print appears in log. Which release branch to backport (provide reason below if selected) The fix is required in 202205 and 202012 since the PR that introduced the issue was cherry picked to those branches (#11129).
…emon is not running (sonic-net#11476) Fix in Monit memory_checker plugin. Skip fetching running containers if docker engine is down (can happen in deinit). This PR fixes issue sonic-net#11472. Signed-off-by: liora liora@nvidia.com Why I did it In the case where Monit runs during deinit flow, memory_checker plugin is fetching the running containers without checking if Docker service is still running. I added this check. How I did it Use systemctl is-active to check if Docker engine is still running. How to verify it Use systemctl to stop docker engine and reload Monit, no errors in log and relevant print appears in log. Which release branch to backport (provide reason below if selected) The fix is required in 202205 and 202012 since the PR that introduced the issue was cherry picked to those branches (sonic-net#11129).
… service is not running. (#16018) #### Why I did it To fix the logic introduced by [[memory_checker] Do not check memory usage of containers which are not created #11129](#11129). There could be a scenario before the reboot, where 1. The `docker service` has stopped 2. In a very short period of time, the monit service performs the `root@sonic:/home/admin# monit status container_memory_telemetry` In such scenario, the `memory_checker` script will throw an error to the syslog: ``` ERR memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))' ``` But, actually, this scenario is a correct behavior, because when the docker service is stopped, the Unix socket is destroyed and that is why we could see the `FileNotFoundError(2, 'No such file or directory'` exception in the syslog. #### How I did it Change the log severity to the warning and changed the return value. #### How to verify it It is really hard to catch the exact moment described in the `Why I did it` section. In order to check the logic: 1. Change the Unix socket path to non-existing in [/usr/bin/memory_checker](https://github.com/sonic-net/sonic-buildimage/blob/47742dfc2c0d1fa27198d69c9183ddc044e11b22/files/image_config/monit/memory_checker#L139) file on the switch. 2. Execute the `root@sonic:/home/admin# monit restart container_memory_telemetry` 3. Check the syslog for such messages: ``` WARNING memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborte d.', FileNotFoundError(2, 'No such file or directory'))' INFO memory_checker: [memory_checker] Exits without checking memory usage since container 'telemetry' is not running! ```
… service is not running. (sonic-net#16018) #### Why I did it To fix the logic introduced by [[memory_checker] Do not check memory usage of containers which are not created sonic-net#11129](sonic-net#11129). There could be a scenario before the reboot, where 1. The `docker service` has stopped 2. In a very short period of time, the monit service performs the `root@sonic:/home/admin# monit status container_memory_telemetry` In such scenario, the `memory_checker` script will throw an error to the syslog: ``` ERR memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))' ``` But, actually, this scenario is a correct behavior, because when the docker service is stopped, the Unix socket is destroyed and that is why we could see the `FileNotFoundError(2, 'No such file or directory'` exception in the syslog. #### How I did it Change the log severity to the warning and changed the return value. #### How to verify it It is really hard to catch the exact moment described in the `Why I did it` section. In order to check the logic: 1. Change the Unix socket path to non-existing in [/usr/bin/memory_checker](https://github.com/sonic-net/sonic-buildimage/blob/47742dfc2c0d1fa27198d69c9183ddc044e11b22/files/image_config/monit/memory_checker#L139) file on the switch. 2. Execute the `root@sonic:/home/admin# monit restart container_memory_telemetry` 3. Check the syslog for such messages: ``` WARNING memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborte d.', FileNotFoundError(2, 'No such file or directory'))' INFO memory_checker: [memory_checker] Exits without checking memory usage since container 'telemetry' is not running! ```
… service is not running. (#16018) #### Why I did it To fix the logic introduced by [[memory_checker] Do not check memory usage of containers which are not created #11129](#11129). There could be a scenario before the reboot, where 1. The `docker service` has stopped 2. In a very short period of time, the monit service performs the `root@sonic:/home/admin# monit status container_memory_telemetry` In such scenario, the `memory_checker` script will throw an error to the syslog: ``` ERR memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))' ``` But, actually, this scenario is a correct behavior, because when the docker service is stopped, the Unix socket is destroyed and that is why we could see the `FileNotFoundError(2, 'No such file or directory'` exception in the syslog. #### How I did it Change the log severity to the warning and changed the return value. #### How to verify it It is really hard to catch the exact moment described in the `Why I did it` section. In order to check the logic: 1. Change the Unix socket path to non-existing in [/usr/bin/memory_checker](https://github.com/sonic-net/sonic-buildimage/blob/47742dfc2c0d1fa27198d69c9183ddc044e11b22/files/image_config/monit/memory_checker#L139) file on the switch. 2. Execute the `root@sonic:/home/admin# monit restart container_memory_telemetry` 3. Check the syslog for such messages: ``` WARNING memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborte d.', FileNotFoundError(2, 'No such file or directory'))' INFO memory_checker: [memory_checker] Exits without checking memory usage since container 'telemetry' is not running! ```
… service is not running. (sonic-net#16018) #### Why I did it To fix the logic introduced by [[memory_checker] Do not check memory usage of containers which are not created sonic-net#11129](sonic-net#11129). There could be a scenario before the reboot, where 1. The `docker service` has stopped 2. In a very short period of time, the monit service performs the `root@sonic:/home/admin# monit status container_memory_telemetry` In such scenario, the `memory_checker` script will throw an error to the syslog: ``` ERR memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))' ``` But, actually, this scenario is a correct behavior, because when the docker service is stopped, the Unix socket is destroyed and that is why we could see the `FileNotFoundError(2, 'No such file or directory'` exception in the syslog. #### How I did it Change the log severity to the warning and changed the return value. #### How to verify it It is really hard to catch the exact moment described in the `Why I did it` section. In order to check the logic: 1. Change the Unix socket path to non-existing in [/usr/bin/memory_checker](https://github.com/sonic-net/sonic-buildimage/blob/47742dfc2c0d1fa27198d69c9183ddc044e11b22/files/image_config/monit/memory_checker#L139) file on the switch. 2. Execute the `root@sonic:/home/admin# monit restart container_memory_telemetry` 3. Check the syslog for such messages: ``` WARNING memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborte d.', FileNotFoundError(2, 'No such file or directory'))' INFO memory_checker: [memory_checker] Exits without checking memory usage since container 'telemetry' is not running! ```
… service is not running. (sonic-net#16018) #### Why I did it To fix the logic introduced by [[memory_checker] Do not check memory usage of containers which are not created sonic-net#11129](sonic-net#11129). There could be a scenario before the reboot, where 1. The `docker service` has stopped 2. In a very short period of time, the monit service performs the `root@sonic:/home/admin# monit status container_memory_telemetry` In such scenario, the `memory_checker` script will throw an error to the syslog: ``` ERR memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))' ``` But, actually, this scenario is a correct behavior, because when the docker service is stopped, the Unix socket is destroyed and that is why we could see the `FileNotFoundError(2, 'No such file or directory'` exception in the syslog. #### How I did it Change the log severity to the warning and changed the return value. #### How to verify it It is really hard to catch the exact moment described in the `Why I did it` section. In order to check the logic: 1. Change the Unix socket path to non-existing in [/usr/bin/memory_checker](https://github.com/sonic-net/sonic-buildimage/blob/47742dfc2c0d1fa27198d69c9183ddc044e11b22/files/image_config/monit/memory_checker#L139) file on the switch. 2. Execute the `root@sonic:/home/admin# monit restart container_memory_telemetry` 3. Check the syslog for such messages: ``` WARNING memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborte d.', FileNotFoundError(2, 'No such file or directory'))' INFO memory_checker: [memory_checker] Exits without checking memory usage since container 'telemetry' is not running! ```
… service is not running. (#16018) #### Why I did it To fix the logic introduced by [[memory_checker] Do not check memory usage of containers which are not created #11129](#11129). There could be a scenario before the reboot, where 1. The `docker service` has stopped 2. In a very short period of time, the monit service performs the `root@sonic:/home/admin# monit status container_memory_telemetry` In such scenario, the `memory_checker` script will throw an error to the syslog: ``` ERR memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))' ``` But, actually, this scenario is a correct behavior, because when the docker service is stopped, the Unix socket is destroyed and that is why we could see the `FileNotFoundError(2, 'No such file or directory'` exception in the syslog. #### How I did it Change the log severity to the warning and changed the return value. #### How to verify it It is really hard to catch the exact moment described in the `Why I did it` section. In order to check the logic: 1. Change the Unix socket path to non-existing in [/usr/bin/memory_checker](https://github.com/sonic-net/sonic-buildimage/blob/47742dfc2c0d1fa27198d69c9183ddc044e11b22/files/image_config/monit/memory_checker#L139) file on the switch. 2. Execute the `root@sonic:/home/admin# monit restart container_memory_telemetry` 3. Check the syslog for such messages: ``` WARNING memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborte d.', FileNotFoundError(2, 'No such file or directory'))' INFO memory_checker: [memory_checker] Exits without checking memory usage since container 'telemetry' is not running! ```
Signed-off-by: Yong Zhao yozhao@microsoft.com
Why I did it
This PR aims to fix an issue (#10088) by enhancing the script
memory_checker
.Specifically, if container is not created successfully during device is booted/rebooted, then
memory_checker
do not need check its memory usage.How I did it
In the script
memory_checker
, a function is added to get names of running containers. If the specified container name is not in current running container list, then this script will exit without checking its memory usage.How to verify it
I tested on a lab device by following the steps:
Stops
telemetry
container with commandsudo systemctl stop telemetry.service
Removes
telemetry
container with commanddocker rm telemetry
Checks whether the script
memory_checker
ran by Monit will generate the syslog message saying it will exit without checking memory usage oftelemetry
:Which release branch to backport (provide reason below if selected)
Description for the changelog
Link to config_db schema for YANG module changes
A picture of a cute animal (not mandatory but encouraged)