Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Pytest] Test inability of Monit to reset internal counter and its new syntax as a workaround #5492

Merged

Conversation

yozhao101
Copy link
Contributor

@yozhao101 yozhao101 commented Apr 8, 2022

Description of PR

Summary:
Fixes # (issue)

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • [ x] Test case(new/improvement)

Back port request

  • 201911
  • [ x] 202012

Approach

What is the motivation for this PR?

This PR adds two new test cases to test a Monit issue fixed by the PR (sonic-net/sonic-buildimage#10288) in SONiC image repo.

How did you do it?

Specifically the 'stress' utility is leveraged to increase memory usage of a container continuously, then

  • Test whether that container can be restarted by the script ran by Monit.

  • Test whether that container can be restarted by the script ran by Moni; If that container was restarted, then test the script ran
    by Monit was unable to restart the container anymore due to Monit failed to reset its internal counter.

  • Test whether that container can be restarted by the script ran by Moni; If that container was restarted, then test the script ran
    by Monit was able to restart the container with the help of new Monit syntax although Monit failed to reset its internal
    counter.

How did you verify/test it?

I tested manually against the DuT str-s6000-0n-5 with 20201231.54 and 20191130.83 images. and following is testing results with 20201231.54 image:

    ----------------------------------------------------------------------------------------------- live log setup -----------------------------------------------------------------------------------------------
    18:56:31 __init__.set_default L0049 INFO | Completeness level not set during test execution. Setting to default level: CompletenessLevel.basic
    18:56:31 __init__.check_test_completeness L0138 INFO | Test has no defined levels. Continue without test completeness checks
    18:56:31 __init__.loganalyzer L0048 INFO | Log analyzer is disabled
    18:56:31 __init__._fixture_generator_decorator L0071 INFO | -------------------- fixture test_setup_and_cleanup setup starts --------------------
    18:56:31 test_memory_checker.install_stress_utili L0120 INFO | Installing 'stress' utility in 'telemetry' container ...
    18:56:31 test_memory_checker.install_stress_utili L0127 INFO | Shuting down all BGP sessions ...
    18:56:34 test_memory_checker.install_stress_utili L0129 INFO | All BGP sessions are shut down!...
    18:56:54 test_memory_checker.install_stress_utili L0136 INFO | 'stress' utility was installed.
    18:56:54 test_memory_checker.backup_monit_config_ L0059 INFO | Backing up Monit configuration files on DuT 'str-s6000-on-5' ...
    18:56:57 test_memory_checker.backup_monit_config_ L0062 INFO | Monit configuration files on DuT 'str-s6000-on-5' is backed up.
    18:56:57 test_memory_checker.customize_monit_conf L0074 INFO | Modifying Monit config to eliminate start delay and decrease interval ...
    18:57:01 test_memory_checker.customize_monit_conf L0078 INFO | Modifying Monit config to eliminate start delay and decrease interval are done.
    18:57:01 test_memory_checker.restart_monit_servic L0105 INFO | Restarting Monit service ...
    18:57:02 test_memory_checker.restart_monit_servic L0107 INFO | Monit service is restarted.
    18:57:02 __init__._fixture_generator_decorator L0075 INFO | -------------------- fixture test_setup_and_cleanup setup ends --------------------
    ----------------------------------------------------------------------------------------------- live log call ------------------------------------------------------------------------------------------------
    18:57:02 test_memory_checker.test_monit_reset_cou L0465 INFO | Checks whether 'telemetry' is running ...
    18:57:04 test_memory_checker.test_monit_reset_cou L0471 INFO | 'telemetry' is running on DuT!
    18:57:08 test_memory_checker.consumes_memory_and_ L0349 INFO | Sleep 130 seconds to wait for the alerting messages from syslog ...
    18:57:08 test_memory_checker.consume_memory L0207 INFO | Executing command 'stress -m 6' in 'telemetry' container ...
    18:59:18 test_memory_checker.consumes_memory_and_ L0352 INFO | Checking the alerting messages related to container restart ...
    18:59:31 test_memory_checker.consumes_memory_and_ L0354 INFO | Found all the expected alerting messages from syslog!
    18:59:31 test_memory_checker.consumes_memory_and_ L0356 INFO | Waiting for 'telemetry' container to be restarted ...
    18:59:32 test_memory_checker.consumes_memory_and_ L0362 INFO | 'telemetry' container is restarted.
    18:59:32 test_memory_checker.consumes_memory_and_ L0364 INFO | Running 'stress' utility again in 'telemetry' ...
    18:59:32 test_memory_checker.consume_memory L0207 INFO | Executing command 'stress -m 6' in 'telemetry' container ...
    18:59:33 test_memory_checker.consumes_memory_and_ L0369 INFO | Checking memory usage of 'telemetry' every 30 seconds for 6 times ...
    18:59:38 test_memory_checker.consumes_memory_and_ L0373 INFO | Memory usage of 'telemetry' is '901.7MiB'
    19:00:13 test_memory_checker.consumes_memory_and_ L0373 INFO | Memory usage of 'telemetry' is '1.268GiB'
    19:00:48 test_memory_checker.consumes_memory_and_ L0373 INFO | Memory usage of 'telemetry' is '495.5MiB'
    19:01:22 test_memory_checker.consumes_memory_and_ L0373 INFO | Memory usage of 'telemetry' is '1.24GiB'
    19:01:57 test_memory_checker.consumes_memory_and_ L0373 INFO | Memory usage of 'telemetry' is '777.8MiB'
    19:02:32 test_memory_checker.consumes_memory_and_ L0373 INFO | Memory usage of 'telemetry' is '761.8MiB'
    19:03:02 test_memory_checker.consumes_memory_and_ L0376 INFO | Analyzing syslog messages to verify whether 'telemetry' is restarted ...
    19:03:20 test_memory_checker.consumes_memory_and_ L0381 INFO | Monit was unable to reset its counter and 'telemetry' can not be restarted!
    PASSED [ 47%]
    --------------------------------------------------------------------------------------------- live log teardown ----------------------------------------------------------------------------------------------
    19:03:20 __init__._fixture_generator_decorator L0083 INFO | -------------------- fixture test_setup_and_cleanup teardown starts --------------------
    19:03:20 test_memory_checker.restore_monit_config L0090 INFO | Restoring original Monit configuration files on DuT 'str-s6000-on-5' ...
    19:03:23 test_memory_checker.restore_monit_config L0093 INFO | Original Monit configuration files on DuT 'str-s6000-on-5' is restored.
    19:03:23 test_memory_checker.restart_monit_servic L0105 INFO | Restarting Monit service ...
    19:03:26 test_memory_checker.restart_monit_servic L0107 INFO | Monit service is restarted.
    19:03:26 test_memory_checker.restart_container L0047 INFO | Restarting 'telemetry' container ...
    19:03:39 test_memory_checker.restart_container L0049 INFO | 'telemetry' is restarted.
    19:03:39 test_memory_checker.remove_stress_utilit L0150 INFO | Removing 'stress' utility from 'telemetry' container ...
    19:03:56 test_memory_checker.remove_stress_utilit L0154 INFO | 'stress' utility was removed.
    19:03:56 test_memory_checker.remove_stress_utilit L0156 INFO | Bringing up all BGP sessions ...
    19:04:00 test_memory_checker.remove_stress_utilit L0158 INFO | BGP sessions are started up.
    19:04:00 test_memory_checker.postcheck_critical_p L0239 INFO | Checking the running status of critical processes in 'telemetry' container ...
    19:04:06 sonic.critical_process_status L0572 INFO | ====== supervisor process status for service telemetry ======
    19:04:06 test_memory_checker.postcheck_critical_p L0245 INFO | All critical processes in 'telemetry' container are running.
    19:04:06 __init__._fixture_generator_decorator L0092 INFO | -------------------- fixture test_setup_and_cleanup teardown ends --------------------


      19:04:06 __init__.set_default L0049 INFO | Completeness level not set during test execution. Setting to default level: CompletenessLevel.basic
      19:04:06 __init__.check_test_completeness L0138 INFO | Test has no defined levels. Continue without test completeness checks
      19:04:06 __init__.loganalyzer L0048 INFO | Log analyzer is disabled
      19:04:06 __init__._fixture_generator_decorator L0071 INFO | -------------------- fixture test_setup_and_cleanup setup starts --------------------
      19:04:06 test_memory_checker.install_stress_utili L0120 INFO | Installing 'stress' utility in 'telemetry' container ...
      19:04:06 test_memory_checker.install_stress_utili L0127 INFO | Shuting down all BGP sessions ...
      19:04:10 test_memory_checker.install_stress_utili L0129 INFO | All BGP sessions are shut down!...
      19:04:29 test_memory_checker.install_stress_utili L0136 INFO | 'stress' utility was installed.
      19:04:29 test_memory_checker.backup_monit_config_ L0059 INFO | Backing up Monit configuration files on DuT 'str-s6000-on-5' ...
      19:04:32 test_memory_checker.backup_monit_config_ L0062 INFO | Monit configuration files on DuT 'str-s6000-on-5' is backed up.
      19:04:32 test_memory_checker.customize_monit_conf L0074 INFO | Modifying Monit config to eliminate start delay and decrease interval ...
      19:04:36 test_memory_checker.customize_monit_conf L0078 INFO | Modifying Monit config to eliminate start delay and decrease interval are done.
      19:04:36 test_memory_checker.restart_monit_servic L0105 INFO | Restarting Monit service ...
      19:04:38 test_memory_checker.restart_monit_servic L0107 INFO | Monit service is restarted.
      19:04:38 __init__._fixture_generator_decorator L0075 INFO | -------------------- fixture test_setup_and_cleanup setup ends --------------------
      ----------------------------------------------------------------------------------------------- live log call ------------------------------------------------------------------------------------------------
      19:04:38 test_memory_checker.test_monit_new_synta L0512 INFO | Checks whether 'telemetry' is running ...
      19:04:40 test_memory_checker.test_monit_new_synta L0518 INFO | 'telemetry' is running on DuT!
      19:04:44 test_memory_checker.consumes_memory_and_ L0349 INFO | Sleep 130 seconds to wait for the alerting messages from syslog ...
      19:04:44 test_memory_checker.consume_memory L0207 INFO | Executing command 'stress -m 6' in 'telemetry' container ...
      19:06:55 test_memory_checker.consumes_memory_and_ L0352 INFO | Checking the alerting messages related to container restart ...
      19:07:06 test_memory_checker.consumes_memory_and_ L0354 INFO | Found all the expected alerting messages from syslog!
      19:07:06 test_memory_checker.consumes_memory_and_ L0356 INFO | Waiting for 'telemetry' container to be restarted ...
      19:07:07 test_memory_checker.consumes_memory_and_ L0362 INFO | 'telemetry' container is restarted.
      19:07:07 test_memory_checker.consumes_memory_and_ L0364 INFO | Running 'stress' utility again in 'telemetry' ...
      19:07:07 test_memory_checker.consume_memory L0207 INFO | Executing command 'stress -m 6' in 'telemetry' container ...
      19:07:09 test_memory_checker.consumes_memory_and_ L0369 INFO | Checking memory usage of 'telemetry' every 30 seconds for 6 times ...
      19:07:14 test_memory_checker.consumes_memory_and_ L0373 INFO | Memory usage of 'telemetry' is '1.041GiB'
      19:07:49 test_memory_checker.consumes_memory_and_ L0373 INFO | Memory usage of 'telemetry' is '0B'
      19:08:22 test_memory_checker.consumes_memory_and_ L0373 INFO | Memory usage of 'telemetry' is '93.53MiB'
      19:08:56 test_memory_checker.consumes_memory_and_ L0373 INFO | Memory usage of 'telemetry' is '93.77MiB'
      19:09:31 test_memory_checker.consumes_memory_and_ L0373 INFO | Memory usage of 'telemetry' is '93.8MiB'
      19:10:05 test_memory_checker.consumes_memory_and_ L0373 INFO | Memory usage of 'telemetry' is '95.88MiB'
      19:10:35 test_memory_checker.consumes_memory_and_ L0376 INFO | Analyzing syslog messages to verify whether 'telemetry' is restarted ...
      19:10:45 test_memory_checker.consumes_memory_and_ L0385 INFO | Monit was able to restart 'telemetry' with the help of new syntax!
      PASSED [ 80%]
      --------------------------------------------------------------------------------------------- live log teardown ----------------------------------------------------------------------------------------------
      19:10:45 __init__._fixture_generator_decorator L0083 INFO | -------------------- fixture test_setup_and_cleanup teardown starts --------------------
      19:10:45 test_memory_checker.restore_monit_config L0090 INFO | Restoring original Monit configuration files on DuT 'str-s6000-on-5' ...
      19:10:48 test_memory_checker.restore_monit_config L0093 INFO | Original Monit configuration files on DuT 'str-s6000-on-5' is restored.
      19:10:48 test_memory_checker.restart_monit_servic L0105 INFO | Restarting Monit service ...
      19:10:51 test_memory_checker.restart_monit_servic L0107 INFO | Monit service is restarted.
      19:10:51 test_memory_checker.restart_container L0047 INFO | Restarting 'telemetry' container ...
      19:11:00 test_memory_checker.restart_container L0049 INFO | 'telemetry' is restarted.
      19:11:00 test_memory_checker.remove_stress_utilit L0150 INFO | Removing 'stress' utility from 'telemetry' container ...
      19:11:17 test_memory_checker.remove_stress_utilit L0154 INFO | 'stress' utility was removed.
      19:11:17 test_memory_checker.remove_stress_utilit L0156 INFO | Bringing up all BGP sessions ...
      19:11:20 test_memory_checker.remove_stress_utilit L0158 INFO | BGP sessions are started up.
      19:11:20 test_memory_checker.postcheck_critical_p L0239 INFO | Checking the running status of critical processes in 'telemetry' container ...
      19:11:26 sonic.critical_process_status L0572 INFO | ====== supervisor process status for service telemetry ======
      19:11:26 test_memory_checker.postcheck_critical_p L0245 INFO | All critical processes in 'telemetry' container are running.
      19:11:26 __init__._fixture_generator_decorator L0092 INFO | -------------------- fixture test_setup_and_cleanup teardown ends --------------------

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

…y post-check failed.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
counter and test new syntax of Monit is a workaround for this failure.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
@yozhao101 yozhao101 changed the title [Pytest] Test inability of Monit to reset its internal counter and its new syntax as workaround [Pytest] Test inability of Monit to reset internal counter and its new syntax as a workaround Apr 8, 2022
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
@yozhao101 yozhao101 marked this pull request as ready for review April 11, 2022 21:45
@yozhao101 yozhao101 requested a review from a team as a code owner April 11, 2022 21:45
@yozhao101
Copy link
Contributor Author

@qiluo-msft can you please help me review this PR?

.format(container_name, dut_name, enum_rand_one_per_hwsku_frontend_hostname))
duthost = duthosts[dut_name]
logger.info("Restarting '{}' container ...".format(container_name))
duthost.shell("sudo systemctl restart {}.service".format(container_name))
Copy link
Contributor

@qiluo-msft qiluo-msft Apr 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sudo

sudo is not needed. #Closed

Returns:
None.
"""
logger.info("Backing up Monit configuration files on DuT '{}' ...".format(duthost.hostname))
duthost.shell("sudo cp -f /etc/monit/monitrc /tmp/")
Copy link
Contributor

@qiluo-msft qiluo-msft Apr 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sudo

The same, many times in this file. #Closed

@qiluo-msft
Copy link
Contributor

qiluo-msft commented Apr 18, 2022

remove_cmd_result = duthost.shell("docker exec {} apt-get remove stress -y".format(container_name))

purge


In reply to: 1101782789


Refers to: tests/memory_checker/test_memory_checker.py:150 in 3754436. [](commit_id = 3754436, deletion_comment = False)

thread_pool = ThreadPool()
thread_pool.apply_async(consume_memory, (duthost, container_name, vm_workers))

logger.info("Sleep 100 seconds to wait for the alerting messages from syslog...")
time.sleep(100)
logger.info("Sleep 130 seconds to wait for the alerting messages from syslog ...")
Copy link
Contributor

@qiluo-msft qiluo-msft Apr 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

130

How do you get the magic number of seconds?
Is it platform dependent? If yes, what is the guideline to calculate for a new platform? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently polling interval (cycle) of Monit is 60 seconds. Since in Monit configuration file of telemetry, it will be restarted if memory usage is larger than 400MB for 1 time within 2 cycles. 2 cycles will be 120 seconds. That's why at here number 130 is used.

This number is not platform dependent.

thread_pool = ThreadPool()
thread_pool.apply_async(consume_memory, (duthost, container_name, vm_workers))

logger.info("Sleep 130 seconds to wait for the alerting messages from syslog ...")
Copy link
Contributor

@qiluo-msft qiluo-msft Apr 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

130

magic number. if it is the same as above, use a named constant. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will use a named constant to represent this number.

"Monit can reset counter and restart '{}'!".format(container_name))
logger.info("Monit was unable to reset its counter and '{}' can not be restarted!".format(container_name))
else:
pytest_assert(analyzing_result["total"]["expected_match"] == 5,
Copy link
Contributor

@qiluo-msft qiluo-msft Apr 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5

magic number. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Number 5 is the length of list expected_alerting_messages at line 334. I wIll update this by len(expected_alerting_messages).



@pytest.mark.parametrize("test_setup_and_cleanup",
[' if status == 3 for 1 times within 2 cycles then exec "/usr/bin/restart_service telemetry"'],
Copy link
Contributor

@qiluo-msft qiluo-msft Apr 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

status

Provide the reference link for status values? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enum_rand_one_per_hwsku_frontend_hostname):
"""Checks that Monit was unable to reset its counter. Specifically Monit will restart
the contianer if memory usage of it is larger than the threshold for specific times within
a sliding window. However, Monit was unable to restart the contianer anymore if memory usage is
Copy link
Contributor

@qiluo-msft qiluo-msft Apr 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

contianer

typo #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update.

@@ -225,25 +412,109 @@ def test_memory_checker(duthosts, enum_dut_feature_container, creds, enum_rand_o
# and number of vm_workers is hard coded. We will extend this testing on all containers after
# the feature 'memory_checker' is fully implemented.
container_name = "telemetry"
vm_workers = 4
vm_workers = 6
Copy link
Contributor

@qiluo-msft qiluo-msft Apr 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vm_workers

What does vm_worker mean?
Why magic number 6? #Closed

Copy link
Contributor Author

@yozhao101 yozhao101 Apr 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vm_worker means how many threads stress utility will create to consume memory by doing malloc()/free() operation.

Usually each thread will consume the physical memory around 0MB ~ 256MB. I tested stress utility in lab device and observed that minimum memory usage of 1 vm_worker is 50MB~80MB.

In order to enable total memory usage of all workers is larger than 400MB, I selected the number 6 instead of 4.

yozhao101 added a commit to sonic-net/sonic-buildimage that referenced this pull request Apr 21, 2022
…10288)

Signed-off-by: Yong Zhao <yozhao@microsoft.com>

Why I did it
This PR aims to fix the Monit issue which shows Monit can't reset its counter when monitoring memory usage of telemetry container.

Specifically the Monit configuration file related to monitoring memory usage of telemetry container is as following:

  check program container_memory_telemetry with path "/usr/bin/memory_checker telemetry 419430400"
      if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry"
If memory usage of telemetry container is larger than 400MB for 10 times within 20 cycles (minutes), then it will be restarted.
Recently we observed, after telemetry container was restarted, its memory usage continuously increased from 400MB to 11GB within 1 hour, but it was not restarted anymore during this 1 hour sliding window.

The reason is Monit can't reset its counter to count again and Monit can reset its counter if and only if the status of monitored service was changed from Status failed to Status ok. However, during this 1 hour sliding window, the status of monitored service was not changed from Status failed to Status ok.

Currently for each service monitored by Monit, there will be an entry showing the monitoring status, monitoring mode etc. For example, the following output from command sudo monit status shows the status of monitored service to monitor memory usage of telemetry:

    Program 'container_memory_telemetry'
         status                             Status ok
         monitoring status          Monitored
         monitoring mode          active
         on reboot                      start
         last exit value                0
         last output                    -
         data collected               Sat, 19 Mar 2022 19:56:26
Every 1 minute, Monit will run the script to check the memory usage of telemetry and update the counter if memory usage is larger than 400MB. If Monit checked the counter and found memory usage of telemetry is larger than 400MB for 10 times
within 20 minutes, then telemetry container was restarted. Following is an example status of monitored service:

    Program 'container_memory_telemetry'
         status                             Status failed
         monitoring status          Monitored
         monitoring mode          active
         on reboot                      start
         last exit value                0
         last output                    -
         data collected               Tue, 01 Feb 2022 22:52:55
After telemetry container was restarted. we found memory usage of telemetry increased rapidly from around 100MB to more than 400MB during 1 minute and status of monitored service did not have a chance to be changed from Status failed to Status ok.

How I did it
In order to provide a workaround for this issue, Monit recently introduced another syntax format repeat every <n> cycles related to exec. This new syntax format will enable Monit repeat executing the background script if the error persists for a given number of cycles.

How to verify it
I verified this change on lab device str-s6000-acs-12. Another pytest PR (sonic-net/sonic-mgmt#5492) is submitted in sonic-mgmt repo for review.
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
change during testing and restore it after testing.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Comment on lines +63 to +64
duthost.shell("mv -f /etc/monit/conf.d/monit_* /tmp/")
duthost.shell("cp -f /tmp/monit_telemetry /etc/monit/conf.d/")
Copy link
Contributor Author

@yozhao101 yozhao101 Apr 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I used cp -f /etc/monit/conf.d/monit_telemetry /tmp to only backup Monit configuration file of telemetry. However, when I tested this pytest script on a lab device with 20191130.83 image, it failed on two new test cases test_monit_reset_counter_failure(...) and test_monit_new_syntax(...). I should answer why it fails with 20191130.83 image?

In order to fully understand the following RCA, the prerequisite is to learn asynchronous behavior of Monit: https://mmonit.com/monit/documentation/monit.html#PROGRAM%20STATUS%20TEST. One sentence to summarize this async behavior is at each polling cycle (every 1 minute in SONiC) Monit will fork a thread to execute the script in configuration file but will collect the running status (exit code) of that script at next polling cycle. For example, Suppose the polling cycle of Monit is 1 minute, it forks a thread to execute the script memory_checker at 21:03:00. Then Monit will collect the exit code of this script at 21:04:00.

I think this RCA will also answer why the Monit can not mitigate high memory spike issue of telemetry in 202012 production image but it can do it in 20191130 production image.

  • Why does this pytest script fail on lab device with 20191130.83 image?
    The logic of test function test_monit_reset_counter_failure(...) is: First, stress utility tool is leveraged to consume more than 400MB physical memory in telemetry container, then this function will check whether telemetry will be restarted or not; Second, if telemetry container was restarted, stress utility is leveraged to consume more than 400MB physical memory to check telemetry container will be restarted again or not. Third, if telemetry container was restarted again, then this test function will be failed since this restart is not expected. Otherwise, it will succeed in passing the test to prove Monit is unable to reset its internal counter.

    When this test function test_monit_reset_counter_failure(...) is executed on lab device with 20191130.83 image, at the Second step, the telemetry container is restarted. This is unexpected behavior and thus test case is failed at Third step.

  • Why was telemetry container restarted in the Second step?
    The telemetry container is restarted again due to Monit can reset its counter.

  • Why Monit can reset its counter in the Second step?
    Following is part of Monit telemetry configuration in lab device under testing:

      check program container_memory_telemetry with path "/usr/bin/memory_checker telemetry 419430400"
          if status == 3 for 1 times within 2 cycles then exec "/usr/bin/restart_service telemetry"
    

    I drew a figure (attachment) to present this asynchronous behavior of Monit. telemetry container can be restarted at the First step which is corresponding to the timestamp 21:02:00 in the figure. At timestamp 21:02:00, exit code of memory_checker is collected, telemetry is being restarted and at the same time, memory_checker is being executed to check the memory usage of telemetry.

demonstration_async_Monit

The interesting thing is in 20191130.83 image, the thread to execute memory_checker was scheduled by kernel to run after telemetry was stopped but before telemetry is fully started. However, in 20201231.54 image, this thread was scheduled to run before telemetry was stopped.

If the thread to run memory_checker was scheduled to run before telemetry was stopped, the thread running memory_checker will see memory usage of telemetry is larger than 400MB. Then at timestamp 21:03:00, Monit will collect the exit code (3) of memory_checker and did not reset its internal counter.

If the thread to run memory_checker was scheduled to run after telemetry was stopped but before telemetry is fully started, memory_checker will see memory usage of telemetry is around 70MB much less than 400MB. Then at timestamp 21:03:00, Monit will collect the exit code (0) of memory_checker and then resets its internal counter.

  • Why the thread to run memory_checker in 202012.31 image is scheduled before telemetry is stopped and it is scheduled to run during telemetry is being started in 20191130.83 image?
    In 20201231 image, Monit will monitor the memory usage of telemetry, route_checker, disk_checker, container_checker and system resources usage. At each polling cycle, Monit will fork 3 or 4 threads and memory_checker will immediately be scheduled to run. However, in 20191130.83 image, Monit will fork more than 30 threads and the thread running memory_checker will be scheduled and delayed to run after telemetry was stopped but before it is fully started. This observation is confirmed by the syslogs on lab device.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qiluo-msft Please also review the above comments and RCA!

@yozhao101
Copy link
Contributor Author

remove_cmd_result = duthost.shell("docker exec {} apt-get remove stress -y".format(container_name))

purge

Refers to: tests/memory_checker/test_memory_checker.py:150 in 3754436. [](commit_id = 3754436, deletion_comment = False)

Did you suggest I do not need manually uninstall the stress utility after testing, right?

@qiluo-msft
Copy link
Contributor

remove_cmd_result = duthost.shell("docker exec {} apt-get remove stress -y".format(container_name))

I mean apt-get purge which saves more disk space.


In reply to: 1101782789


Refers to: tests/memory_checker/test_memory_checker.py:150 in 3754436. [](commit_id = 3754436, deletion_comment = False)

qiluo-msft pushed a commit to sonic-net/sonic-buildimage that referenced this pull request Apr 21, 2022
…10288)

Signed-off-by: Yong Zhao <yozhao@microsoft.com>

Why I did it
This PR aims to fix the Monit issue which shows Monit can't reset its counter when monitoring memory usage of telemetry container.

Specifically the Monit configuration file related to monitoring memory usage of telemetry container is as following:

  check program container_memory_telemetry with path "/usr/bin/memory_checker telemetry 419430400"
      if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry"
If memory usage of telemetry container is larger than 400MB for 10 times within 20 cycles (minutes), then it will be restarted.
Recently we observed, after telemetry container was restarted, its memory usage continuously increased from 400MB to 11GB within 1 hour, but it was not restarted anymore during this 1 hour sliding window.

The reason is Monit can't reset its counter to count again and Monit can reset its counter if and only if the status of monitored service was changed from Status failed to Status ok. However, during this 1 hour sliding window, the status of monitored service was not changed from Status failed to Status ok.

Currently for each service monitored by Monit, there will be an entry showing the monitoring status, monitoring mode etc. For example, the following output from command sudo monit status shows the status of monitored service to monitor memory usage of telemetry:

    Program 'container_memory_telemetry'
         status                             Status ok
         monitoring status          Monitored
         monitoring mode          active
         on reboot                      start
         last exit value                0
         last output                    -
         data collected               Sat, 19 Mar 2022 19:56:26
Every 1 minute, Monit will run the script to check the memory usage of telemetry and update the counter if memory usage is larger than 400MB. If Monit checked the counter and found memory usage of telemetry is larger than 400MB for 10 times
within 20 minutes, then telemetry container was restarted. Following is an example status of monitored service:

    Program 'container_memory_telemetry'
         status                             Status failed
         monitoring status          Monitored
         monitoring mode          active
         on reboot                      start
         last exit value                0
         last output                    -
         data collected               Tue, 01 Feb 2022 22:52:55
After telemetry container was restarted. we found memory usage of telemetry increased rapidly from around 100MB to more than 400MB during 1 minute and status of monitored service did not have a chance to be changed from Status failed to Status ok.

How I did it
In order to provide a workaround for this issue, Monit recently introduced another syntax format repeat every <n> cycles related to exec. This new syntax format will enable Monit repeat executing the background script if the error persists for a given number of cycles.

How to verify it
I verified this change on lab device str-s6000-acs-12. Another pytest PR (sonic-net/sonic-mgmt#5492) is submitted in sonic-mgmt repo for review.
to delete binary file and also the related configuration files.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
@yozhao101
Copy link
Contributor Author

remove_cmd_result = duthost.shell("docker exec {} apt-get remove stress -y".format(container_name))

I mean apt-get purge which saves more disk space.

In reply to: 1101782789

Refers to: tests/memory_checker/test_memory_checker.py:150 in 3754436. [](commit_id = 3754436, deletion_comment = False)

Update.

@yozhao101
Copy link
Contributor Author

remove_cmd_result = duthost.shell("docker exec {} apt-get remove stress -y".format(container_name))

I mean apt-get purge which saves more disk space.

In reply to: 1101782789

Refers to: tests/memory_checker/test_memory_checker.py:150 in 3754436. [](commit_id = 3754436, deletion_comment = False)

@qiluo-msft Can you please help me review this PR again?

@yozhao101 yozhao101 merged commit 0b0ddb3 into sonic-net:master Apr 27, 2022
@yozhao101 yozhao101 deleted the test_monit_counter_and_workaround branch April 27, 2022 16:27
wangxin pushed a commit that referenced this pull request Apr 28, 2022
…w syntax as a workaround (#5492)

What is the motivation for this PR?
This PR adds two new test cases to test a Monit issue fixed by the PR (sonic-net/sonic-buildimage#10288) in SONiC image repo.

How did you do it?
Specifically the 'stress' utility is leveraged to increase memory usage of a container continuously, then

Test whether that container can be restarted by the script ran by Monit.

Test whether that container can be restarted by the script ran by Moni; If that container was restarted, then test the script ran
by Monit was unable to restart the container anymore due to Monit failed to reset its internal counter.

Test whether that container can be restarted by the script ran by Moni; If that container was restarted, then test the script ran
by Monit was able to restart the container with the help of new Monit syntax although Monit failed to reset its internal
counter.

How did you verify/test it?
I tested manually against the DuT str-s6000-0n-5 with 20201231.54 and 20191130.83 images. and following is testing results with 20201231.54 image.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants