Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug fix][test_container_checker] change config of monit to stablize the test #7

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

JibinBao
Copy link
Owner

@JibinBao JibinBao commented Aug 10, 2021

Description of PR

Summary:
Because the Monit sampling interval is too long (60s), and the syncd container restart time is rather short (sometimes it just needs about 30s), and the alert message rule is too strict, so sometimes Monit can not monitoring syncd down for 2 times for 2 mins and there are no syncd alert messages in syslog. By changing the relevant config of Monit, we can stabilize the test.

Fixes # (issue)

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • Test case(new/improvement)

Back port request

  • 201911

Approach

What is the motivation for this PR?

Stabilize test_container_checker by changing some config of Monit.

How did you do it?

Changing the sampling intervals to 10 in /etc/monit/monitrc ensures that the Monit can monitor syncd container down.
Changing the start delay to 10 in /etc/monit/monitrc ensures that the Monit start quicker than syncd start.

## Start Monit in the background (run as a daemon):
#
  set daemon 10             # check services at 1-minute intervals
    with start delay 10    # we delay Monit to start monitoring for 5 minutes
                            # intentionally such that all containers and processes
                            # have ample time to start up.
#

Changing the rule of alerting messages in /etc/monit/conf.d/sonic-host makes it is easy to send alert messages.

check program container_checker with path "/usr/bin/container_checker"
    if status != 0 for 1 times within 1 cycles then alert repeat every 1 cycles

How did you verify/test it?

run test:
py.test container_checker/test_container_checker.py --inventory "../ansible/inventory, ../ansible/veos" --host-pattern arc-switch1025 --module-path ../ansible/library/ --testbed arc-switch1025-t0 --testbed_file ../ansible/testbed.csv --allow_recover

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

@JibinBao JibinBao changed the title [bug fix][test_container_checker] change sampling time of monit to stablize the test, because syncd start rather quicker [bug fix][test_container_checker] change some config of monit to stablize the test, because syncd start rather quicker Aug 11, 2021
@JibinBao JibinBao changed the title [bug fix][test_container_checker] change some config of monit to stablize the test, because syncd start rather quicker [bug fix][test_container_checker] change config of monit to stablize the test Aug 11, 2021
JibinBao added a commit that referenced this pull request Sep 8, 2021
…the test #7 (sonic-net#4008)

Because the Monit sampling interval is too long (60s), and the syncd container restart time is rather short (sometimes it just needs about 30s), and the alert message rule is too strict, so sometimes Monit can not monitoring syncd down for 2 times for 2 mins and there are no syncd alert messages in syslog. By changing the relevant config of Monit, we can stabilize the test.

Changing the sampling intervals to 10 in /etc/monit/monitrc ensures that the Monit can monitor syncd container down.
Changing the start delay  to 10 in /etc/monit/monitrc ensures that the Monit start quicker than syncd start.


```
## Start Monit in the background (run as a daemon):
#
  set daemon 10             # check services at 1-minute intervals
    with start delay 10    # we delay Monit to start monitoring for 5 minutes
                            # intentionally such that all containers and processes
                            # have ample time to start up.
#
```
Changing the rule of alerting messages in /etc/monit/conf.d/sonic-host makes it is easy to send alert messages.
```
check program container_checker with path "/usr/bin/container_checker"
    if status != 0 for 1 times within 1 cycles then alert repeat every 1 cycles

```
#### How did you verify/test it?
run test:
`py.test container_checker/test_container_checker.py --inventory "../ansible/inventory, ../ansible/veos" --host-pattern arc-switch1025 --module-path                ../ansible/library/ --testbed arc-switch1025-t0 --testbed_file ../ansible/testbed.csv                --allow_recover`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants