[bug fix][test_container_checker] change config of monit to stablize the test #7

JibinBao · 2021-08-10T10:46:39Z

Description of PR

Summary:
Because the Monit sampling interval is too long (60s), and the syncd container restart time is rather short (sometimes it just needs about 30s), and the alert message rule is too strict, so sometimes Monit can not monitoring syncd down for 2 times for 2 mins and there are no syncd alert messages in syslog. By changing the relevant config of Monit, we can stabilize the test.

Fixes # (issue)

Type of change

Bug fix
Testbed and Framework(new/improvement)
Test case(new/improvement)

Back port request

201911

Approach

What is the motivation for this PR?

Stabilize test_container_checker by changing some config of Monit.

How did you do it?

Changing the sampling intervals to 10 in /etc/monit/monitrc ensures that the Monit can monitor syncd container down.
Changing the start delay to 10 in /etc/monit/monitrc ensures that the Monit start quicker than syncd start.

## Start Monit in the background (run as a daemon):
#
  set daemon 10             # check services at 1-minute intervals
    with start delay 10    # we delay Monit to start monitoring for 5 minutes
                            # intentionally such that all containers and processes
                            # have ample time to start up.
#

Changing the rule of alerting messages in /etc/monit/conf.d/sonic-host makes it is easy to send alert messages.

check program container_checker with path "/usr/bin/container_checker"
    if status != 0 for 1 times within 1 cycles then alert repeat every 1 cycles

How did you verify/test it?

run test:
py.test container_checker/test_container_checker.py --inventory "../ansible/inventory, ../ansible/veos" --host-pattern arc-switch1025 --module-path ../ansible/library/ --testbed arc-switch1025-t0 --testbed_file ../ansible/testbed.csv --allow_recover

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

…the test #7 (sonic-net#4008) Because the Monit sampling interval is too long (60s), and the syncd container restart time is rather short (sometimes it just needs about 30s), and the alert message rule is too strict, so sometimes Monit can not monitoring syncd down for 2 times for 2 mins and there are no syncd alert messages in syslog. By changing the relevant config of Monit, we can stabilize the test. Changing the sampling intervals to 10 in /etc/monit/monitrc ensures that the Monit can monitor syncd container down. Changing the start delay to 10 in /etc/monit/monitrc ensures that the Monit start quicker than syncd start. ``` ## Start Monit in the background (run as a daemon): # set daemon 10 # check services at 1-minute intervals with start delay 10 # we delay Monit to start monitoring for 5 minutes # intentionally such that all containers and processes # have ample time to start up. # ``` Changing the rule of alerting messages in /etc/monit/conf.d/sonic-host makes it is easy to send alert messages. ``` check program container_checker with path "/usr/bin/container_checker" if status != 0 for 1 times within 1 cycles then alert repeat every 1 cycles ``` #### How did you verify/test it? run test: `py.test container_checker/test_container_checker.py --inventory "../ansible/inventory, ../ansible/veos" --host-pattern arc-switch1025 --module-path ../ansible/library/ --testbed arc-switch1025-t0 --testbed_file ../ansible/testbed.csv --allow_recover`

fix monit sample too long

572cd7d

JibinBao force-pushed the fix_container_check branch from 276ddde to 572cd7d Compare August 11, 2021 05:44

JibinBao changed the title ~~[bug fix][test_container_checker] change sampling time of monit to stablize the test, because syncd start rather quicker~~ [bug fix][test_container_checker] change some config of monit to stablize the test, because syncd start rather quicker Aug 11, 2021

JibinBao changed the title ~~[bug fix][test_container_checker] change some config of monit to stablize the test, because syncd start rather quicker~~ [bug fix][test_container_checker] change config of monit to stablize the test Aug 11, 2021

keboliu approved these changes Aug 11, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug fix][test_container_checker] change config of monit to stablize the test #7

[bug fix][test_container_checker] change config of monit to stablize the test #7

JibinBao commented Aug 10, 2021 •

edited

Loading

[bug fix][test_container_checker] change config of monit to stablize the test #7

Are you sure you want to change the base?

[bug fix][test_container_checker] change config of monit to stablize the test #7

Conversation

JibinBao commented Aug 10, 2021 • edited Loading

Description of PR

Type of change

Back port request

Approach

What is the motivation for this PR?

How did you do it?

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

JibinBao commented Aug 10, 2021 •

edited

Loading