-
Notifications
You must be signed in to change notification settings - Fork 727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Pytest] Test inability of Monit to reset internal counter and its new syntax as a workaround #5492
[Pytest] Test inability of Monit to reset internal counter and its new syntax as a workaround #5492
Conversation
…y post-check failed. Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
counter and test new syntax of Monit is a workaround for this failure. Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
@qiluo-msft can you please help me review this PR? |
.format(container_name, dut_name, enum_rand_one_per_hwsku_frontend_hostname)) | ||
duthost = duthosts[dut_name] | ||
logger.info("Restarting '{}' container ...".format(container_name)) | ||
duthost.shell("sudo systemctl restart {}.service".format(container_name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returns: | ||
None. | ||
""" | ||
logger.info("Backing up Monit configuration files on DuT '{}' ...".format(duthost.hostname)) | ||
duthost.shell("sudo cp -f /etc/monit/monitrc /tmp/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
purge In reply to: 1101782789 Refers to: tests/memory_checker/test_memory_checker.py:150 in 3754436. [](commit_id = 3754436, deletion_comment = False) |
thread_pool = ThreadPool() | ||
thread_pool.apply_async(consume_memory, (duthost, container_name, vm_workers)) | ||
|
||
logger.info("Sleep 100 seconds to wait for the alerting messages from syslog...") | ||
time.sleep(100) | ||
logger.info("Sleep 130 seconds to wait for the alerting messages from syslog ...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently polling interval (cycle) of Monit is 60 seconds. Since in Monit configuration file of telemetry
, it will be restarted if memory usage is larger than 400MB
for 1 time within 2 cycles
. 2 cycles
will be 120 seconds. That's why at here number 130
is used.
This number is not platform dependent.
thread_pool = ThreadPool() | ||
thread_pool.apply_async(consume_memory, (duthost, container_name, vm_workers)) | ||
|
||
logger.info("Sleep 130 seconds to wait for the alerting messages from syslog ...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will use a named constant to represent this number.
"Monit can reset counter and restart '{}'!".format(container_name)) | ||
logger.info("Monit was unable to reset its counter and '{}' can not be restarted!".format(container_name)) | ||
else: | ||
pytest_assert(analyzing_result["total"]["expected_match"] == 5, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Number 5
is the length of list expected_alerting_messages
at line 334. I wIll update this by len(expected_alerting_messages)
.
|
||
|
||
@pytest.mark.parametrize("test_setup_and_cleanup", | ||
[' if status == 3 for 1 times within 2 cycles then exec "/usr/bin/restart_service telemetry"'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enum_rand_one_per_hwsku_frontend_hostname): | ||
"""Checks that Monit was unable to reset its counter. Specifically Monit will restart | ||
the contianer if memory usage of it is larger than the threshold for specific times within | ||
a sliding window. However, Monit was unable to restart the contianer anymore if memory usage is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update.
@@ -225,25 +412,109 @@ def test_memory_checker(duthosts, enum_dut_feature_container, creds, enum_rand_o | |||
# and number of vm_workers is hard coded. We will extend this testing on all containers after | |||
# the feature 'memory_checker' is fully implemented. | |||
container_name = "telemetry" | |||
vm_workers = 4 | |||
vm_workers = 6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vm_worker
means how many threads stress
utility will create to consume memory by doing malloc()/free()
operation.
Usually each thread will consume the physical memory around 0MB ~ 256MB
. I tested stress
utility in lab device and observed that minimum memory usage of 1 vm_worker
is 50MB~80MB.
In order to enable total memory usage of all workers is larger than 400MB, I selected the number 6
instead of 4
.
…10288) Signed-off-by: Yong Zhao <yozhao@microsoft.com> Why I did it This PR aims to fix the Monit issue which shows Monit can't reset its counter when monitoring memory usage of telemetry container. Specifically the Monit configuration file related to monitoring memory usage of telemetry container is as following: check program container_memory_telemetry with path "/usr/bin/memory_checker telemetry 419430400" if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry" If memory usage of telemetry container is larger than 400MB for 10 times within 20 cycles (minutes), then it will be restarted. Recently we observed, after telemetry container was restarted, its memory usage continuously increased from 400MB to 11GB within 1 hour, but it was not restarted anymore during this 1 hour sliding window. The reason is Monit can't reset its counter to count again and Monit can reset its counter if and only if the status of monitored service was changed from Status failed to Status ok. However, during this 1 hour sliding window, the status of monitored service was not changed from Status failed to Status ok. Currently for each service monitored by Monit, there will be an entry showing the monitoring status, monitoring mode etc. For example, the following output from command sudo monit status shows the status of monitored service to monitor memory usage of telemetry: Program 'container_memory_telemetry' status Status ok monitoring status Monitored monitoring mode active on reboot start last exit value 0 last output - data collected Sat, 19 Mar 2022 19:56:26 Every 1 minute, Monit will run the script to check the memory usage of telemetry and update the counter if memory usage is larger than 400MB. If Monit checked the counter and found memory usage of telemetry is larger than 400MB for 10 times within 20 minutes, then telemetry container was restarted. Following is an example status of monitored service: Program 'container_memory_telemetry' status Status failed monitoring status Monitored monitoring mode active on reboot start last exit value 0 last output - data collected Tue, 01 Feb 2022 22:52:55 After telemetry container was restarted. we found memory usage of telemetry increased rapidly from around 100MB to more than 400MB during 1 minute and status of monitored service did not have a chance to be changed from Status failed to Status ok. How I did it In order to provide a workaround for this issue, Monit recently introduced another syntax format repeat every <n> cycles related to exec. This new syntax format will enable Monit repeat executing the background script if the error persists for a given number of cycles. How to verify it I verified this change on lab device str-s6000-acs-12. Another pytest PR (sonic-net/sonic-mgmt#5492) is submitted in sonic-mgmt repo for review.
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
change during testing and restore it after testing. Signed-off-by: Yong Zhao <yozhao@microsoft.com>
duthost.shell("mv -f /etc/monit/conf.d/monit_* /tmp/") | ||
duthost.shell("cp -f /tmp/monit_telemetry /etc/monit/conf.d/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially I used cp -f /etc/monit/conf.d/monit_telemetry /tmp
to only backup Monit configuration file of telemetry
. However, when I tested this pytest script on a lab device with 20191130.83
image, it failed on two new test cases test_monit_reset_counter_failure(...)
and test_monit_new_syntax(...)
. I should answer why it fails with 20191130.83
image?
In order to fully understand the following RCA, the prerequisite is to learn asynchronous
behavior of Monit: https://mmonit.com/monit/documentation/monit.html#PROGRAM%20STATUS%20TEST. One sentence to summarize this async
behavior is at each polling cycle (every 1 minute in SONiC) Monit will fork a thread to execute the script in configuration file but will collect the running status (exit code) of that script at next polling cycle. For example, Suppose the polling cycle of Monit is 1 minute, it forks a thread to execute the script memory_checker
at 21:03:00. Then Monit will collect the exit code of this script at 21:04:00.
I think this RCA will also answer why the Monit can not mitigate high memory spike issue of telemetry
in 202012
production image but it can do it in 20191130
production image.
-
Why does this pytest script fail on lab device with
20191130.83
image?
The logic of test functiontest_monit_reset_counter_failure(...)
is: First,stress
utility tool is leveraged to consume more than 400MB physical memory intelemetry
container, then this function will check whethertelemetry
will be restarted or not; Second, iftelemetry
container was restarted,stress
utility is leveraged to consume more than 400MB physical memory to checktelemetry
container will be restarted again or not. Third, iftelemetry
container was restarted again, then this test function will be failed since this restart is not expected. Otherwise, it will succeed in passing the test to prove Monit is unable to reset its internal counter.When this test function
test_monit_reset_counter_failure(...)
is executed on lab device with20191130.83
image, at the Second step, thetelemetry
container is restarted. This is unexpected behavior and thus test case is failed at Third step. -
Why was
telemetry
container restarted in the Second step?
Thetelemetry
container is restarted again due to Monit can reset its counter. -
Why Monit can reset its counter in the Second step?
Following is part of Monittelemetry
configuration in lab device under testing:check program container_memory_telemetry with path "/usr/bin/memory_checker telemetry 419430400" if status == 3 for 1 times within 2 cycles then exec "/usr/bin/restart_service telemetry"
I drew a figure (attachment) to present this
asynchronous
behavior of Monit.telemetry
container can be restarted at the First step which is corresponding to the timestamp 21:02:00 in the figure. At timestamp 21:02:00, exit code ofmemory_checker
is collected,telemetry
is being restarted and at the same time,memory_checker
is being executed to check the memory usage oftelemetry
.
The interesting thing is in 20191130.83
image, the thread to execute memory_checker
was scheduled by kernel to run after telemetry
was stopped but before telemetry
is fully started. However, in 20201231.54
image, this thread was scheduled to run before telemetry
was stopped.
If the thread to run memory_checker
was scheduled to run before telemetry
was stopped, the thread running memory_checker
will see memory usage of telemetry
is larger than 400MB. Then at timestamp 21:03:00, Monit will collect the exit code (3) of memory_checker
and did not reset its internal counter.
If the thread to run memory_checker
was scheduled to run after telemetry
was stopped but before telemetry
is fully started, memory_checker
will see memory usage of telemetry
is around 70MB much less than 400MB. Then at timestamp 21:03:00, Monit will collect the exit code (0) of memory_checker
and then resets its internal counter.
- Why the thread to run
memory_checker
in202012.31
image is scheduled beforetelemetry
is stopped and it is scheduled to run duringtelemetry
is being started in20191130.83
image?
In20201231
image, Monit will monitor the memory usage oftelemetry
,route_checker
,disk_checker
,container_checker
and system resources usage. At each polling cycle, Monit will fork 3 or 4 threads andmemory_checker
will immediately be scheduled to run. However, in20191130.83
image, Monit will fork more than 30 threads and the thread runningmemory_checker
will be scheduled and delayed to run aftertelemetry
was stopped but before it is fully started. This observation is confirmed by the syslogs on lab device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qiluo-msft Please also review the above comments and RCA!
Did you suggest I do not need manually uninstall the |
I mean In reply to: 1101782789 Refers to: tests/memory_checker/test_memory_checker.py:150 in 3754436. [](commit_id = 3754436, deletion_comment = False) |
…10288) Signed-off-by: Yong Zhao <yozhao@microsoft.com> Why I did it This PR aims to fix the Monit issue which shows Monit can't reset its counter when monitoring memory usage of telemetry container. Specifically the Monit configuration file related to monitoring memory usage of telemetry container is as following: check program container_memory_telemetry with path "/usr/bin/memory_checker telemetry 419430400" if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry" If memory usage of telemetry container is larger than 400MB for 10 times within 20 cycles (minutes), then it will be restarted. Recently we observed, after telemetry container was restarted, its memory usage continuously increased from 400MB to 11GB within 1 hour, but it was not restarted anymore during this 1 hour sliding window. The reason is Monit can't reset its counter to count again and Monit can reset its counter if and only if the status of monitored service was changed from Status failed to Status ok. However, during this 1 hour sliding window, the status of monitored service was not changed from Status failed to Status ok. Currently for each service monitored by Monit, there will be an entry showing the monitoring status, monitoring mode etc. For example, the following output from command sudo monit status shows the status of monitored service to monitor memory usage of telemetry: Program 'container_memory_telemetry' status Status ok monitoring status Monitored monitoring mode active on reboot start last exit value 0 last output - data collected Sat, 19 Mar 2022 19:56:26 Every 1 minute, Monit will run the script to check the memory usage of telemetry and update the counter if memory usage is larger than 400MB. If Monit checked the counter and found memory usage of telemetry is larger than 400MB for 10 times within 20 minutes, then telemetry container was restarted. Following is an example status of monitored service: Program 'container_memory_telemetry' status Status failed monitoring status Monitored monitoring mode active on reboot start last exit value 0 last output - data collected Tue, 01 Feb 2022 22:52:55 After telemetry container was restarted. we found memory usage of telemetry increased rapidly from around 100MB to more than 400MB during 1 minute and status of monitored service did not have a chance to be changed from Status failed to Status ok. How I did it In order to provide a workaround for this issue, Monit recently introduced another syntax format repeat every <n> cycles related to exec. This new syntax format will enable Monit repeat executing the background script if the error persists for a given number of cycles. How to verify it I verified this change on lab device str-s6000-acs-12. Another pytest PR (sonic-net/sonic-mgmt#5492) is submitted in sonic-mgmt repo for review.
to delete binary file and also the related configuration files. Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Update. |
@qiluo-msft Can you please help me review this PR again? |
…w syntax as a workaround (#5492) What is the motivation for this PR? This PR adds two new test cases to test a Monit issue fixed by the PR (sonic-net/sonic-buildimage#10288) in SONiC image repo. How did you do it? Specifically the 'stress' utility is leveraged to increase memory usage of a container continuously, then Test whether that container can be restarted by the script ran by Monit. Test whether that container can be restarted by the script ran by Moni; If that container was restarted, then test the script ran by Monit was unable to restart the container anymore due to Monit failed to reset its internal counter. Test whether that container can be restarted by the script ran by Moni; If that container was restarted, then test the script ran by Monit was able to restart the container with the help of new Monit syntax although Monit failed to reset its internal counter. How did you verify/test it? I tested manually against the DuT str-s6000-0n-5 with 20201231.54 and 20191130.83 images. and following is testing results with 20201231.54 image.
Description of PR
Summary:
Fixes # (issue)
Type of change
Back port request
Approach
What is the motivation for this PR?
This PR adds two new test cases to test a Monit issue fixed by the PR (sonic-net/sonic-buildimage#10288) in SONiC image repo.
How did you do it?
Specifically the 'stress' utility is leveraged to increase memory usage of a container continuously, then
Test whether that container can be restarted by the script ran by Monit.
Test whether that container can be restarted by the script ran by Moni; If that container was restarted, then test the script ran
by Monit was unable to restart the container anymore due to Monit failed to reset its internal counter.
Test whether that container can be restarted by the script ran by Moni; If that container was restarted, then test the script ran
by Monit was able to restart the container with the help of new Monit syntax although Monit failed to reset its internal
counter.
How did you verify/test it?
I tested manually against the DuT
str-s6000-0n-5
with20201231.54
and20191130.83
images. and following is testing results with20201231.54
image:Any platform specific information?
Supported testbed topology if it's a new test case?
Documentation