[Pytest] Test inability of Monit to reset internal counter and its new syntax as a workaround #5492

yozhao101 · 2022-04-08T19:03:05Z

Description of PR

Summary:
Fixes # (issue)

Type of change

Bug fix
Testbed and Framework(new/improvement)
[ x] Test case(new/improvement)

Back port request

201911
[ x] 202012

Approach

What is the motivation for this PR?

This PR adds two new test cases to test a Monit issue fixed by the PR (sonic-net/sonic-buildimage#10288) in SONiC image repo.

How did you do it?

Specifically the 'stress' utility is leveraged to increase memory usage of a container continuously, then

Test whether that container can be restarted by the script ran by Monit.
Test whether that container can be restarted by the script ran by Moni; If that container was restarted, then test the script ran
by Monit was unable to restart the container anymore due to Monit failed to reset its internal counter.
Test whether that container can be restarted by the script ran by Moni; If that container was restarted, then test the script ran
by Monit was able to restart the container with the help of new Monit syntax although Monit failed to reset its internal
counter.

How did you verify/test it?

I tested manually against the DuT str-s6000-0n-5 with 20201231.54 and 20191130.83 images. and following is testing results with 20201231.54 image:

    ----------------------------------------------------------------------------------------------- live log setup -----------------------------------------------------------------------------------------------
    18:56:31 __init__.set_default L0049 INFO | Completeness level not set during test execution. Setting to default level: CompletenessLevel.basic
    18:56:31 __init__.check_test_completeness L0138 INFO | Test has no defined levels. Continue without test completeness checks
    18:56:31 __init__.loganalyzer L0048 INFO | Log analyzer is disabled
    18:56:31 __init__._fixture_generator_decorator L0071 INFO | -------------------- fixture test_setup_and_cleanup setup starts --------------------
    18:56:31 test_memory_checker.install_stress_utili L0120 INFO | Installing 'stress' utility in 'telemetry' container ...
    18:56:31 test_memory_checker.install_stress_utili L0127 INFO | Shuting down all BGP sessions ...
    18:56:34 test_memory_checker.install_stress_utili L0129 INFO | All BGP sessions are shut down!...
    18:56:54 test_memory_checker.install_stress_utili L0136 INFO | 'stress' utility was installed.
    18:56:54 test_memory_checker.backup_monit_config_ L0059 INFO | Backing up Monit configuration files on DuT 'str-s6000-on-5' ...
    18:56:57 test_memory_checker.backup_monit_config_ L0062 INFO | Monit configuration files on DuT 'str-s6000-on-5' is backed up.
    18:56:57 test_memory_checker.customize_monit_conf L0074 INFO | Modifying Monit config to eliminate start delay and decrease interval ...
    18:57:01 test_memory_checker.customize_monit_conf L0078 INFO | Modifying Monit config to eliminate start delay and decrease interval are done.
    18:57:01 test_memory_checker.restart_monit_servic L0105 INFO | Restarting Monit service ...
    18:57:02 test_memory_checker.restart_monit_servic L0107 INFO | Monit service is restarted.
    18:57:02 __init__._fixture_generator_decorator L0075 INFO | -------------------- fixture test_setup_and_cleanup setup ends --------------------
    ----------------------------------------------------------------------------------------------- live log call ------------------------------------------------------------------------------------------------
    18:57:02 test_memory_checker.test_monit_reset_cou L0465 INFO | Checks whether 'telemetry' is running ...
    18:57:04 test_memory_checker.test_monit_reset_cou L0471 INFO | 'telemetry' is running on DuT!
    18:57:08 test_memory_checker.consumes_memory_and_ L0349 INFO | Sleep 130 seconds to wait for the alerting messages from syslog ...
    18:57:08 test_memory_checker.consume_memory L0207 INFO | Executing command 'stress -m 6' in 'telemetry' container ...
    18:59:18 test_memory_checker.consumes_memory_and_ L0352 INFO | Checking the alerting messages related to container restart ...
    18:59:31 test_memory_checker.consumes_memory_and_ L0354 INFO | Found all the expected alerting messages from syslog!
    18:59:31 test_memory_checker.consumes_memory_and_ L0356 INFO | Waiting for 'telemetry' container to be restarted ...
    18:59:32 test_memory_checker.consumes_memory_and_ L0362 INFO | 'telemetry' container is restarted.
    18:59:32 test_memory_checker.consumes_memory_and_ L0364 INFO | Running 'stress' utility again in 'telemetry' ...
    18:59:32 test_memory_checker.consume_memory L0207 INFO | Executing command 'stress -m 6' in 'telemetry' container ...
    18:59:33 test_memory_checker.consumes_memory_and_ L0369 INFO | Checking memory usage of 'telemetry' every 30 seconds for 6 times ...
    18:59:38 test_memory_checker.consumes_memory_and_ L0373 INFO | Memory usage of 'telemetry' is '901.7MiB'
    19:00:13 test_memory_checker.consumes_memory_and_ L0373 INFO | Memory usage of 'telemetry' is '1.268GiB'
    19:00:48 test_memory_checker.consumes_memory_and_ L0373 INFO | Memory usage of 'telemetry' is '495.5MiB'
    19:01:22 test_memory_checker.consumes_memory_and_ L0373 INFO | Memory usage of 'telemetry' is '1.24GiB'
    19:01:57 test_memory_checker.consumes_memory_and_ L0373 INFO | Memory usage of 'telemetry' is '777.8MiB'
    19:02:32 test_memory_checker.consumes_memory_and_ L0373 INFO | Memory usage of 'telemetry' is '761.8MiB'
    19:03:02 test_memory_checker.consumes_memory_and_ L0376 INFO | Analyzing syslog messages to verify whether 'telemetry' is restarted ...
    19:03:20 test_memory_checker.consumes_memory_and_ L0381 INFO | Monit was unable to reset its counter and 'telemetry' can not be restarted!
    PASSED [ 47%]
    --------------------------------------------------------------------------------------------- live log teardown ----------------------------------------------------------------------------------------------
    19:03:20 __init__._fixture_generator_decorator L0083 INFO | -------------------- fixture test_setup_and_cleanup teardown starts --------------------
    19:03:20 test_memory_checker.restore_monit_config L0090 INFO | Restoring original Monit configuration files on DuT 'str-s6000-on-5' ...
    19:03:23 test_memory_checker.restore_monit_config L0093 INFO | Original Monit configuration files on DuT 'str-s6000-on-5' is restored.
    19:03:23 test_memory_checker.restart_monit_servic L0105 INFO | Restarting Monit service ...
    19:03:26 test_memory_checker.restart_monit_servic L0107 INFO | Monit service is restarted.
    19:03:26 test_memory_checker.restart_container L0047 INFO | Restarting 'telemetry' container ...
    19:03:39 test_memory_checker.restart_container L0049 INFO | 'telemetry' is restarted.
    19:03:39 test_memory_checker.remove_stress_utilit L0150 INFO | Removing 'stress' utility from 'telemetry' container ...
    19:03:56 test_memory_checker.remove_stress_utilit L0154 INFO | 'stress' utility was removed.
    19:03:56 test_memory_checker.remove_stress_utilit L0156 INFO | Bringing up all BGP sessions ...
    19:04:00 test_memory_checker.remove_stress_utilit L0158 INFO | BGP sessions are started up.
    19:04:00 test_memory_checker.postcheck_critical_p L0239 INFO | Checking the running status of critical processes in 'telemetry' container ...
    19:04:06 sonic.critical_process_status L0572 INFO | ====== supervisor process status for service telemetry ======
    19:04:06 test_memory_checker.postcheck_critical_p L0245 INFO | All critical processes in 'telemetry' container are running.
    19:04:06 __init__._fixture_generator_decorator L0092 INFO | -------------------- fixture test_setup_and_cleanup teardown ends --------------------


      19:04:06 __init__.set_default L0049 INFO | Completeness level not set during test execution. Setting to default level: CompletenessLevel.basic
      19:04:06 __init__.check_test_completeness L0138 INFO | Test has no defined levels. Continue without test completeness checks
      19:04:06 __init__.loganalyzer L0048 INFO | Log analyzer is disabled
      19:04:06 __init__._fixture_generator_decorator L0071 INFO | -------------------- fixture test_setup_and_cleanup setup starts --------------------
      19:04:06 test_memory_checker.install_stress_utili L0120 INFO | Installing 'stress' utility in 'telemetry' container ...
      19:04:06 test_memory_checker.install_stress_utili L0127 INFO | Shuting down all BGP sessions ...
      19:04:10 test_memory_checker.install_stress_utili L0129 INFO | All BGP sessions are shut down!...
      19:04:29 test_memory_checker.install_stress_utili L0136 INFO | 'stress' utility was installed.
      19:04:29 test_memory_checker.backup_monit_config_ L0059 INFO | Backing up Monit configuration files on DuT 'str-s6000-on-5' ...
      19:04:32 test_memory_checker.backup_monit_config_ L0062 INFO | Monit configuration files on DuT 'str-s6000-on-5' is backed up.
      19:04:32 test_memory_checker.customize_monit_conf L0074 INFO | Modifying Monit config to eliminate start delay and decrease interval ...
      19:04:36 test_memory_checker.customize_monit_conf L0078 INFO | Modifying Monit config to eliminate start delay and decrease interval are done.
      19:04:36 test_memory_checker.restart_monit_servic L0105 INFO | Restarting Monit service ...
      19:04:38 test_memory_checker.restart_monit_servic L0107 INFO | Monit service is restarted.
      19:04:38 __init__._fixture_generator_decorator L0075 INFO | -------------------- fixture test_setup_and_cleanup setup ends --------------------
      ----------------------------------------------------------------------------------------------- live log call ------------------------------------------------------------------------------------------------
      19:04:38 test_memory_checker.test_monit_new_synta L0512 INFO | Checks whether 'telemetry' is running ...
      19:04:40 test_memory_checker.test_monit_new_synta L0518 INFO | 'telemetry' is running on DuT!
      19:04:44 test_memory_checker.consumes_memory_and_ L0349 INFO | Sleep 130 seconds to wait for the alerting messages from syslog ...
      19:04:44 test_memory_checker.consume_memory L0207 INFO | Executing command 'stress -m 6' in 'telemetry' container ...
      19:06:55 test_memory_checker.consumes_memory_and_ L0352 INFO | Checking the alerting messages related to container restart ...
      19:07:06 test_memory_checker.consumes_memory_and_ L0354 INFO | Found all the expected alerting messages from syslog!
      19:07:06 test_memory_checker.consumes_memory_and_ L0356 INFO | Waiting for 'telemetry' container to be restarted ...
      19:07:07 test_memory_checker.consumes_memory_and_ L0362 INFO | 'telemetry' container is restarted.
      19:07:07 test_memory_checker.consumes_memory_and_ L0364 INFO | Running 'stress' utility again in 'telemetry' ...
      19:07:07 test_memory_checker.consume_memory L0207 INFO | Executing command 'stress -m 6' in 'telemetry' container ...
      19:07:09 test_memory_checker.consumes_memory_and_ L0369 INFO | Checking memory usage of 'telemetry' every 30 seconds for 6 times ...
      19:07:14 test_memory_checker.consumes_memory_and_ L0373 INFO | Memory usage of 'telemetry' is '1.041GiB'
      19:07:49 test_memory_checker.consumes_memory_and_ L0373 INFO | Memory usage of 'telemetry' is '0B'
      19:08:22 test_memory_checker.consumes_memory_and_ L0373 INFO | Memory usage of 'telemetry' is '93.53MiB'
      19:08:56 test_memory_checker.consumes_memory_and_ L0373 INFO | Memory usage of 'telemetry' is '93.77MiB'
      19:09:31 test_memory_checker.consumes_memory_and_ L0373 INFO | Memory usage of 'telemetry' is '93.8MiB'
      19:10:05 test_memory_checker.consumes_memory_and_ L0373 INFO | Memory usage of 'telemetry' is '95.88MiB'
      19:10:35 test_memory_checker.consumes_memory_and_ L0376 INFO | Analyzing syslog messages to verify whether 'telemetry' is restarted ...
      19:10:45 test_memory_checker.consumes_memory_and_ L0385 INFO | Monit was able to restart 'telemetry' with the help of new syntax!
      PASSED [ 80%]
      --------------------------------------------------------------------------------------------- live log teardown ----------------------------------------------------------------------------------------------
      19:10:45 __init__._fixture_generator_decorator L0083 INFO | -------------------- fixture test_setup_and_cleanup teardown starts --------------------
      19:10:45 test_memory_checker.restore_monit_config L0090 INFO | Restoring original Monit configuration files on DuT 'str-s6000-on-5' ...
      19:10:48 test_memory_checker.restore_monit_config L0093 INFO | Original Monit configuration files on DuT 'str-s6000-on-5' is restored.
      19:10:48 test_memory_checker.restart_monit_servic L0105 INFO | Restarting Monit service ...
      19:10:51 test_memory_checker.restart_monit_servic L0107 INFO | Monit service is restarted.
      19:10:51 test_memory_checker.restart_container L0047 INFO | Restarting 'telemetry' container ...
      19:11:00 test_memory_checker.restart_container L0049 INFO | 'telemetry' is restarted.
      19:11:00 test_memory_checker.remove_stress_utilit L0150 INFO | Removing 'stress' utility from 'telemetry' container ...
      19:11:17 test_memory_checker.remove_stress_utilit L0154 INFO | 'stress' utility was removed.
      19:11:17 test_memory_checker.remove_stress_utilit L0156 INFO | Bringing up all BGP sessions ...
      19:11:20 test_memory_checker.remove_stress_utilit L0158 INFO | BGP sessions are started up.
      19:11:20 test_memory_checker.postcheck_critical_p L0239 INFO | Checking the running status of critical processes in 'telemetry' container ...
      19:11:26 sonic.critical_process_status L0572 INFO | ====== supervisor process status for service telemetry ======
      19:11:26 test_memory_checker.postcheck_critical_p L0245 INFO | All critical processes in 'telemetry' container are running.
      19:11:26 __init__._fixture_generator_decorator L0092 INFO | -------------------- fixture test_setup_and_cleanup teardown ends --------------------

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

…y post-check failed. Signed-off-by: Yong Zhao <yozhao@microsoft.com>

Signed-off-by: Yong Zhao <yozhao@microsoft.com>

counter and test new syntax of Monit is a workaround for this failure. Signed-off-by: Yong Zhao <yozhao@microsoft.com>

Signed-off-by: Yong Zhao <yozhao@microsoft.com>

yozhao101 · 2022-04-15T01:30:26Z

@qiluo-msft can you please help me review this PR?

qiluo-msft · 2022-04-18T21:25:10Z

tests/memory_checker/test_memory_checker.py

-                   .format(container_name, dut_name, enum_rand_one_per_hwsku_frontend_hostname))
-    duthost = duthosts[dut_name]
+    logger.info("Restarting '{}' container ...".format(container_name))
+    duthost.shell("sudo systemctl restart {}.service".format(container_name))


sudo

sudo is not needed. #Closed

qiluo-msft · 2022-04-18T21:25:44Z

tests/memory_checker/test_memory_checker.py

+    Returns:
+        None.
+    """
+    logger.info("Backing up Monit configuration files on DuT '{}' ...".format(duthost.hostname))
    duthost.shell("sudo cp -f /etc/monit/monitrc /tmp/")


sudo

The same, many times in this file. #Closed

qiluo-msft · 2022-04-18T21:27:42Z

remove_cmd_result = duthost.shell("docker exec {} apt-get remove stress -y".format(container_name))

purge

In reply to: 1101782789

Refers to: tests/memory_checker/test_memory_checker.py:150 in 3754436. [](commit_id = 3754436, deletion_comment = False)

qiluo-msft · 2022-04-18T21:32:23Z

tests/memory_checker/test_memory_checker.py

    thread_pool = ThreadPool()
    thread_pool.apply_async(consume_memory, (duthost, container_name, vm_workers))

-    logger.info("Sleep 100 seconds to wait for the alerting messages from syslog...")
-    time.sleep(100)
+    logger.info("Sleep 130 seconds to wait for the alerting messages from syslog ...")


130

How do you get the magic number of seconds?
Is it platform dependent? If yes, what is the guideline to calculate for a new platform? #Closed

Currently polling interval (cycle) of Monit is 60 seconds. Since in Monit configuration file of telemetry, it will be restarted if memory usage is larger than 400MB for 1 time within 2 cycles. 2 cycles will be 120 seconds. That's why at here number 130 is used.

This number is not platform dependent.

qiluo-msft · 2022-04-18T21:33:09Z

tests/memory_checker/test_memory_checker.py

+    thread_pool = ThreadPool()
+    thread_pool.apply_async(consume_memory, (duthost, container_name, vm_workers))
+
+    logger.info("Sleep 130 seconds to wait for the alerting messages from syslog ...")


130

magic number. if it is the same as above, use a named constant. #Closed

Will use a named constant to represent this number.

qiluo-msft · 2022-04-18T21:34:12Z

tests/memory_checker/test_memory_checker.py

+                      "Monit can reset counter and restart '{}'!".format(container_name))
+        logger.info("Monit was unable to reset its counter and '{}' can not be restarted!".format(container_name))
+    else:
+        pytest_assert(analyzing_result["total"]["expected_match"] == 5,


5

magic number. #Closed

Number 5 is the length of list expected_alerting_messages at line 334. I wIll update this by len(expected_alerting_messages).

qiluo-msft · 2022-04-18T21:34:49Z

tests/memory_checker/test_memory_checker.py

+
+
+@pytest.mark.parametrize("test_setup_and_cleanup",
+                         ['    if status == 3 for 1 times within 2 cycles then exec "/usr/bin/restart_service telemetry"'],


status

Provide the reference link for status values? #Closed

Link is at here: https://github.com/Azure/sonic-buildimage/blob/122cb9069598e0798883538af923381c40f3ba8a/files/image_config/monit/memory_checker#L88

qiluo-msft · 2022-04-18T21:37:25Z

tests/memory_checker/test_memory_checker.py

+                                     enum_rand_one_per_hwsku_frontend_hostname):
+    """Checks that Monit was unable to reset its counter. Specifically Monit will restart
+    the contianer if memory usage of it is larger than the threshold for specific times within
+    a sliding window. However, Monit was unable to restart the contianer anymore if memory usage is


contianer

typo #Closed

Will update.

qiluo-msft · 2022-04-18T21:39:00Z

tests/memory_checker/test_memory_checker.py

@@ -225,25 +412,109 @@ def test_memory_checker(duthosts, enum_dut_feature_container, creds, enum_rand_o
    # and number of vm_workers is hard coded. We will extend this testing on all containers after
    # the feature 'memory_checker' is fully implemented.
    container_name = "telemetry"
-    vm_workers = 4
+    vm_workers = 6


vm_workers

What does vm_worker mean?
Why magic number 6? #Closed

vm_worker means how many threads stress utility will create to consume memory by doing malloc()/free() operation.

Usually each thread will consume the physical memory around 0MB ~ 256MB. I tested stress utility in lab device and observed that minimum memory usage of 1 vm_worker is 50MB~80MB.

In order to enable total memory usage of all workers is larger than 400MB, I selected the number 6 instead of 4.

…10288) Signed-off-by: Yong Zhao <yozhao@microsoft.com> Why I did it This PR aims to fix the Monit issue which shows Monit can't reset its counter when monitoring memory usage of telemetry container. Specifically the Monit configuration file related to monitoring memory usage of telemetry container is as following: check program container_memory_telemetry with path "/usr/bin/memory_checker telemetry 419430400" if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry" If memory usage of telemetry container is larger than 400MB for 10 times within 20 cycles (minutes), then it will be restarted. Recently we observed, after telemetry container was restarted, its memory usage continuously increased from 400MB to 11GB within 1 hour, but it was not restarted anymore during this 1 hour sliding window. The reason is Monit can't reset its counter to count again and Monit can reset its counter if and only if the status of monitored service was changed from Status failed to Status ok. However, during this 1 hour sliding window, the status of monitored service was not changed from Status failed to Status ok. Currently for each service monitored by Monit, there will be an entry showing the monitoring status, monitoring mode etc. For example, the following output from command sudo monit status shows the status of monitored service to monitor memory usage of telemetry: Program 'container_memory_telemetry' status Status ok monitoring status Monitored monitoring mode active on reboot start last exit value 0 last output - data collected Sat, 19 Mar 2022 19:56:26 Every 1 minute, Monit will run the script to check the memory usage of telemetry and update the counter if memory usage is larger than 400MB. If Monit checked the counter and found memory usage of telemetry is larger than 400MB for 10 times within 20 minutes, then telemetry container was restarted. Following is an example status of monitored service: Program 'container_memory_telemetry' status Status failed monitoring status Monitored monitoring mode active on reboot start last exit value 0 last output - data collected Tue, 01 Feb 2022 22:52:55 After telemetry container was restarted. we found memory usage of telemetry increased rapidly from around 100MB to more than 400MB during 1 minute and status of monitored service did not have a chance to be changed from Status failed to Status ok. How I did it In order to provide a workaround for this issue, Monit recently introduced another syntax format repeat every <n> cycles related to exec. This new syntax format will enable Monit repeat executing the background script if the error persists for a given number of cycles. How to verify it I verified this change on lab device str-s6000-acs-12. Another pytest PR (sonic-net/sonic-mgmt#5492) is submitted in sonic-mgmt repo for review.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>

change during testing and restore it after testing. Signed-off-by: Yong Zhao <yozhao@microsoft.com>

yozhao101 · 2022-04-21T18:22:54Z

tests/memory_checker/test_memory_checker.py

+    duthost.shell("mv -f /etc/monit/conf.d/monit_* /tmp/")
+    duthost.shell("cp -f /tmp/monit_telemetry /etc/monit/conf.d/")


Initially I used cp -f /etc/monit/conf.d/monit_telemetry /tmp to only backup Monit configuration file of telemetry. However, when I tested this pytest script on a lab device with 20191130.83 image, it failed on two new test cases test_monit_reset_counter_failure(...) and test_monit_new_syntax(...). I should answer why it fails with 20191130.83 image?

In order to fully understand the following RCA, the prerequisite is to learn asynchronous behavior of Monit: https://mmonit.com/monit/documentation/monit.html#PROGRAM%20STATUS%20TEST. One sentence to summarize this async behavior is at each polling cycle (every 1 minute in SONiC) Monit will fork a thread to execute the script in configuration file but will collect the running status (exit code) of that script at next polling cycle. For example, Suppose the polling cycle of Monit is 1 minute, it forks a thread to execute the script memory_checker at 21:03:00. Then Monit will collect the exit code of this script at 21:04:00.

I think this RCA will also answer why the Monit can not mitigate high memory spike issue of telemetry in 202012 production image but it can do it in 20191130 production image.

Why does this pytest script fail on lab device with 20191130.83 image?
The logic of test function test_monit_reset_counter_failure(...) is: First, stress utility tool is leveraged to consume more than 400MB physical memory in telemetry container, then this function will check whether telemetry will be restarted or not; Second, if telemetry container was restarted, stress utility is leveraged to consume more than 400MB physical memory to check telemetry container will be restarted again or not. Third, if telemetry container was restarted again, then this test function will be failed since this restart is not expected. Otherwise, it will succeed in passing the test to prove Monit is unable to reset its internal counter.

When this test function test_monit_reset_counter_failure(...) is executed on lab device with 20191130.83 image, at the Second step, the telemetry container is restarted. This is unexpected behavior and thus test case is failed at Third step.

Why was telemetry container restarted in the Second step?
The telemetry container is restarted again due to Monit can reset its counter.

Why Monit can reset its counter in the Second step?
Following is part of Monit telemetry configuration in lab device under testing:

check program container_memory_telemetry with path "/usr/bin/memory_checker telemetry 419430400" if status == 3 for 1 times within 2 cycles then exec "/usr/bin/restart_service telemetry"

I drew a figure (attachment) to present this asynchronous behavior of Monit. telemetry container can be restarted at the First step which is corresponding to the timestamp 21:02:00 in the figure. At timestamp 21:02:00, exit code of memory_checker is collected, telemetry is being restarted and at the same time, memory_checker is being executed to check the memory usage of telemetry.

The interesting thing is in 20191130.83 image, the thread to execute memory_checker was scheduled by kernel to run after telemetry was stopped but before telemetry is fully started. However, in 20201231.54 image, this thread was scheduled to run before telemetry was stopped.

If the thread to run memory_checker was scheduled to run before telemetry was stopped, the thread running memory_checker will see memory usage of telemetry is larger than 400MB. Then at timestamp 21:03:00, Monit will collect the exit code (3) of memory_checker and did not reset its internal counter.

If the thread to run memory_checker was scheduled to run after telemetry was stopped but before telemetry is fully started, memory_checker will see memory usage of telemetry is around 70MB much less than 400MB. Then at timestamp 21:03:00, Monit will collect the exit code (0) of memory_checker and then resets its internal counter.

Why the thread to run memory_checker in 202012.31 image is scheduled before telemetry is stopped and it is scheduled to run during telemetry is being started in 20191130.83 image?
In 20201231 image, Monit will monitor the memory usage of telemetry, route_checker, disk_checker, container_checker and system resources usage. At each polling cycle, Monit will fork 3 or 4 threads and memory_checker will immediately be scheduled to run. However, in 20191130.83 image, Monit will fork more than 30 threads and the thread running memory_checker will be scheduled and delayed to run after telemetry was stopped but before it is fully started. This observation is confirmed by the syslogs on lab device.

@qiluo-msft Please also review the above comments and RCA!

yozhao101 · 2022-04-21T20:01:38Z

remove_cmd_result = duthost.shell("docker exec {} apt-get remove stress -y".format(container_name))
purge

Refers to: tests/memory_checker/test_memory_checker.py:150 in 3754436. [](commit_id = 3754436, deletion_comment = False)

Did you suggest I do not need manually uninstall the stress utility after testing, right?

qiluo-msft · 2022-04-21T21:42:16Z

remove_cmd_result = duthost.shell("docker exec {} apt-get remove stress -y".format(container_name))

I mean apt-get purge which saves more disk space.

In reply to: 1101782789

Refers to: tests/memory_checker/test_memory_checker.py:150 in 3754436. [](commit_id = 3754436, deletion_comment = False)

…10288) Signed-off-by: Yong Zhao <yozhao@microsoft.com> Why I did it This PR aims to fix the Monit issue which shows Monit can't reset its counter when monitoring memory usage of telemetry container. Specifically the Monit configuration file related to monitoring memory usage of telemetry container is as following: check program container_memory_telemetry with path "/usr/bin/memory_checker telemetry 419430400" if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry" If memory usage of telemetry container is larger than 400MB for 10 times within 20 cycles (minutes), then it will be restarted. Recently we observed, after telemetry container was restarted, its memory usage continuously increased from 400MB to 11GB within 1 hour, but it was not restarted anymore during this 1 hour sliding window. The reason is Monit can't reset its counter to count again and Monit can reset its counter if and only if the status of monitored service was changed from Status failed to Status ok. However, during this 1 hour sliding window, the status of monitored service was not changed from Status failed to Status ok. Currently for each service monitored by Monit, there will be an entry showing the monitoring status, monitoring mode etc. For example, the following output from command sudo monit status shows the status of monitored service to monitor memory usage of telemetry: Program 'container_memory_telemetry' status Status ok monitoring status Monitored monitoring mode active on reboot start last exit value 0 last output - data collected Sat, 19 Mar 2022 19:56:26 Every 1 minute, Monit will run the script to check the memory usage of telemetry and update the counter if memory usage is larger than 400MB. If Monit checked the counter and found memory usage of telemetry is larger than 400MB for 10 times within 20 minutes, then telemetry container was restarted. Following is an example status of monitored service: Program 'container_memory_telemetry' status Status failed monitoring status Monitored monitoring mode active on reboot start last exit value 0 last output - data collected Tue, 01 Feb 2022 22:52:55 After telemetry container was restarted. we found memory usage of telemetry increased rapidly from around 100MB to more than 400MB during 1 minute and status of monitored service did not have a chance to be changed from Status failed to Status ok. How I did it In order to provide a workaround for this issue, Monit recently introduced another syntax format repeat every <n> cycles related to exec. This new syntax format will enable Monit repeat executing the background script if the error persists for a given number of cycles. How to verify it I verified this change on lab device str-s6000-acs-12. Another pytest PR (sonic-net/sonic-mgmt#5492) is submitted in sonic-mgmt repo for review.

to delete binary file and also the related configuration files. Signed-off-by: Yong Zhao <yozhao@microsoft.com>

yozhao101 · 2022-04-22T06:46:48Z

remove_cmd_result = duthost.shell("docker exec {} apt-get remove stress -y".format(container_name))
I mean apt-get purge which saves more disk space.

In reply to: 1101782789

Refers to: tests/memory_checker/test_memory_checker.py:150 in 3754436. [](commit_id = 3754436, deletion_comment = False)

Update.

yozhao101 · 2022-04-26T18:38:23Z

remove_cmd_result = duthost.shell("docker exec {} apt-get remove stress -y".format(container_name))
I mean apt-get purge which saves more disk space.

In reply to: 1101782789

Refers to: tests/memory_checker/test_memory_checker.py:150 in 3754436. [](commit_id = 3754436, deletion_comment = False)

@qiluo-msft Can you please help me review this PR again?

…w syntax as a workaround (#5492) What is the motivation for this PR? This PR adds two new test cases to test a Monit issue fixed by the PR (sonic-net/sonic-buildimage#10288) in SONiC image repo. How did you do it? Specifically the 'stress' utility is leveraged to increase memory usage of a container continuously, then Test whether that container can be restarted by the script ran by Monit. Test whether that container can be restarted by the script ran by Moni; If that container was restarted, then test the script ran by Monit was unable to restart the container anymore due to Monit failed to reset its internal counter. Test whether that container can be restarted by the script ran by Moni; If that container was restarted, then test the script ran by Monit was able to restart the container with the help of new Monit syntax although Monit failed to reset its internal counter. How did you verify/test it? I tested manually against the DuT str-s6000-0n-5 with 20201231.54 and 20191130.83 images. and following is testing results with 20201231.54 image.

yozhao101 added 6 commits September 3, 2021 22:45

[process_monitoring] Adding debuggin statement to check the reason wh…

944ce6e

…y post-check failed. Signed-off-by: Yong Zhao <yozhao@microsoft.com>

Merge remote-tracking branch 'sonic_official/master'

1650778

Merge remote-tracking branch 'sonic_official/master'

aa96b1a

Signed-off-by: Yong Zhao <yozhao@microsoft.com>

[process_monitoring] Remove unrelated changes.

87959ad

Signed-off-by: Yong Zhao <yozhao@microsoft.com>

Merge remote-tracking branch 'sonic_official/master'

a7f3977

[pytest] Add two test cases to test Monit fails to reset its internal

0ff6181

counter and test new syntax of Monit is a workaround for this failure. Signed-off-by: Yong Zhao <yozhao@microsoft.com>

yozhao101 changed the title ~~[Pytest] Test inability of Monit to reset its internal counter and its new syntax as workaround~~ [Pytest] Test inability of Monit to reset internal counter and its new syntax as a workaround Apr 8, 2022

yozhao101 mentioned this pull request Apr 8, 2022

[Monit] Repeat restarting culprit container if Monit can't reset its counter sonic-net/sonic-buildimage#10288

Merged

2 tasks

yozhao101 added 2 commits April 8, 2022 12:36

[pytest] Delete blank lines and reword comments.

26967c2

Signed-off-by: Yong Zhao <yozhao@microsoft.com>

[pytest] Fix a typo.

3754436

Signed-off-by: Yong Zhao <yozhao@microsoft.com>

yozhao101 requested review from qiluo-msft and yxieca April 8, 2022 19:53

yozhao101 marked this pull request as ready for review April 11, 2022 21:45

yozhao101 requested a review from a team as a code owner April 11, 2022 21:45

qiluo-msft reviewed Apr 18, 2022

View reviewed changes

yozhao101 added 2 commits April 20, 2022 19:48

[memory_checker] Address the comments such as removing the sudo.

d5a2801

Signed-off-by: Yong Zhao <yozhao@microsoft.com>

[memory_checker] Need backup telemetry configuration file, make the

b8b433e

change during testing and restore it after testing. Signed-off-by: Yong Zhao <yozhao@microsoft.com>

yozhao101 commented Apr 21, 2022

View reviewed changes

[memory_checker] Use apt-get purge ... instead of apt-get remove ...

3f6ae36

to delete binary file and also the related configuration files. Signed-off-by: Yong Zhao <yozhao@microsoft.com>

qiluo-msft approved these changes Apr 26, 2022

View reviewed changes

yozhao101 merged commit 0b0ddb3 into sonic-net:master Apr 27, 2022

yozhao101 deleted the test_monit_counter_and_workaround branch April 27, 2022 16:27

yozhao101 added Request for 202012 branch Request for 201911 branch labels Apr 27, 2022

wangxin added the Included in 202012 branch label Apr 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Pytest] Test inability of Monit to reset internal counter and its new syntax as a workaround #5492

[Pytest] Test inability of Monit to reset internal counter and its new syntax as a workaround #5492

yozhao101 commented Apr 8, 2022 •

edited

Loading

yozhao101 commented Apr 15, 2022

qiluo-msft Apr 18, 2022 •

edited

Loading

qiluo-msft Apr 18, 2022 •

edited

Loading

qiluo-msft commented Apr 18, 2022 •

edited

Loading

qiluo-msft Apr 18, 2022 •

edited

Loading

yozhao101 Apr 20, 2022

qiluo-msft Apr 18, 2022 •

edited

Loading

yozhao101 Apr 20, 2022

qiluo-msft Apr 18, 2022 •

edited

Loading

yozhao101 Apr 20, 2022

qiluo-msft Apr 18, 2022 •

edited

Loading

yozhao101 Apr 20, 2022

qiluo-msft Apr 18, 2022 •

edited

Loading

yozhao101 Apr 20, 2022

qiluo-msft Apr 18, 2022 •

edited

Loading

yozhao101 Apr 20, 2022 •

edited

Loading

yozhao101 Apr 21, 2022 •

edited

Loading

yozhao101 Apr 21, 2022

yozhao101 commented Apr 21, 2022

qiluo-msft commented Apr 21, 2022

yozhao101 commented Apr 22, 2022

yozhao101 commented Apr 26, 2022



		@pytest.mark.parametrize("test_setup_and_cleanup",
		[' if status == 3 for 1 times within 2 cycles then exec "/usr/bin/restart_service telemetry"'],

		duthost.shell("mv -f /etc/monit/conf.d/monit_* /tmp/")
		duthost.shell("cp -f /tmp/monit_telemetry /etc/monit/conf.d/")

[Pytest] Test inability of Monit to reset internal counter and its new syntax as a workaround #5492

[Pytest] Test inability of Monit to reset internal counter and its new syntax as a workaround #5492

Conversation

yozhao101 commented Apr 8, 2022 • edited Loading

Description of PR

Type of change

Back port request

Approach

What is the motivation for this PR?

How did you do it?

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

yozhao101 commented Apr 15, 2022

qiluo-msft Apr 18, 2022 • edited Loading

Choose a reason for hiding this comment

qiluo-msft Apr 18, 2022 • edited Loading

Choose a reason for hiding this comment

qiluo-msft commented Apr 18, 2022 • edited Loading

qiluo-msft Apr 18, 2022 • edited Loading

Choose a reason for hiding this comment

yozhao101 Apr 20, 2022

Choose a reason for hiding this comment

qiluo-msft Apr 18, 2022 • edited Loading

Choose a reason for hiding this comment

yozhao101 Apr 20, 2022

Choose a reason for hiding this comment

qiluo-msft Apr 18, 2022 • edited Loading

Choose a reason for hiding this comment

yozhao101 Apr 20, 2022

Choose a reason for hiding this comment

qiluo-msft Apr 18, 2022 • edited Loading

Choose a reason for hiding this comment

yozhao101 Apr 20, 2022

Choose a reason for hiding this comment

qiluo-msft Apr 18, 2022 • edited Loading

Choose a reason for hiding this comment

yozhao101 Apr 20, 2022

Choose a reason for hiding this comment

qiluo-msft Apr 18, 2022 • edited Loading

Choose a reason for hiding this comment

yozhao101 Apr 20, 2022 • edited Loading

Choose a reason for hiding this comment

yozhao101 Apr 21, 2022 • edited Loading

Choose a reason for hiding this comment

yozhao101 Apr 21, 2022

Choose a reason for hiding this comment

yozhao101 commented Apr 21, 2022

qiluo-msft commented Apr 21, 2022

yozhao101 commented Apr 22, 2022

yozhao101 commented Apr 26, 2022

yozhao101 commented Apr 8, 2022 •

edited

Loading

qiluo-msft Apr 18, 2022 •

edited

Loading

qiluo-msft Apr 18, 2022 •

edited

Loading

qiluo-msft commented Apr 18, 2022 •

edited

Loading

qiluo-msft Apr 18, 2022 •

edited

Loading

qiluo-msft Apr 18, 2022 •

edited

Loading

qiluo-msft Apr 18, 2022 •

edited

Loading

qiluo-msft Apr 18, 2022 •

edited

Loading

qiluo-msft Apr 18, 2022 •

edited

Loading

qiluo-msft Apr 18, 2022 •

edited

Loading

yozhao101 Apr 20, 2022 •

edited

Loading

yozhao101 Apr 21, 2022 •

edited

Loading