Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pmon daemon regression test] Correct the log messages with right daemon name and fix term_and_start_status failure #3922

Merged
merged 8 commits into from
Aug 19, 2021

Conversation

sujinmkang
Copy link
Contributor

@sujinmkang sujinmkang commented Jul 30, 2021

Description of PR

Summary:
Fixes #(sonic-net/sonic-buildimage#8239)

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • Test case(new/improvement)

Back port request

  • 201911
  • 202012

Approach

What is the motivation for this PR?

Needs to fix pmon daemon regression test issues

How did you do it?

  1. Use daemon_name instead of the specific daemon name for log message
  2. Support for both old/new db key support for checking the db value.
  3. Skip term_and_start_status test cases for 201911 and 201811 since it behaves differently
  4. Fix the expected daemon status after sending sig_term based on latest daemon behavior for term_and_start_status
  5. Remove disable_and_enable_autorestart to make sure the daemon regression test doesn't affect any other containers.

How did you verify/test it?

Run test_ledd.py on arista 7050 platform which reports issue.

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

@sujinmkang sujinmkang marked this pull request as ready for review August 3, 2021 15:00
@lguohan
Copy link
Contributor

lguohan commented Aug 4, 2021

can you update your pr description and title to reflect the update?

@sujinmkang sujinmkang changed the title [ledd regression test] Add restart of the daemon when autorestart is not set [ledd regression test] Correct the log messages with right daemon name Aug 4, 2021
@sujinmkang sujinmkang changed the title [ledd regression test] Correct the log messages with right daemon name [pmon daemon regression test] Correct the log messages with right daemon name Aug 4, 2021
@sujinmkang sujinmkang changed the title [pmon daemon regression test] Correct the log messages with right daemon name [pmon daemon regression test] Correct the log messages with right daemon name and fix term_and_start_status failure Aug 11, 2021
@sujinmkang sujinmkang requested a review from lguohan August 13, 2021 00:33
@sujinmkang sujinmkang requested a review from vdahiya12 August 13, 2021 04:09
@antonptashnik
Copy link
Contributor

@sujinmkang could you please specify why is it needed to disable all features' autostart before running these tests ?

@sujinmkang
Copy link
Contributor Author

@antonptashnik originally I don't want to make the PMON container restart during this test since this regression test of some daemons in the critical_process can make the restart the PMON container. I just want to focus on the daemon behavior when it has its running status change. Do you have any concern or suggestion?

@antonptashnik
Copy link
Contributor

@antonptashnik originally I don't want to make the PMON container restart during this test since this regression test of some daemons in the critical_process can make the restart the PMON container. I just want to focus on the daemon behavior when it has its running status change. Do you have any concern or suggestion?

So you say some of these tests can cause PMON to restart and for the tests to notice this moment you disable PMON autorestart for it to stay stopped, right ? If so I understand why we disable autorestart for PMON but I suppose autorestart for other features can be left as is

since we want to make sure that pmon and any other containers are not restarted
during this test.
@@ -145,26 +113,27 @@ def test_pmon_ledd_term_and_start_status(check_daemon_status, duthosts, rand_one
@summary: This test case is to check the ledd terminated and restarted status
"""
duthost = duthosts[rand_one_dut_hostname]

if "201811" in duthost.os_version or "201911" in duthost.os_version:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sujinmkang what is the difference in behavior for SIGTERM across different version?

Copy link
Contributor Author

@sujinmkang sujinmkang Aug 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor master and 202012 daemon has the signal_handler implemented to return the exitcode (128 + signal_number) but previous version just return exit code 0 for sigterm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really want to skip this test for 201811 and 201911? If not, it would be better we handle them differently here or make the exitcode same across 201811, 201911. What about 202012?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor this change was based on the discussion during 202012 issue triage. If there would be any back port happening, then we will need another changes for this.

"Pcied expected terminated status is {} but is {}".format(expected_exited_status, daemon_status))
pytest_assert(daemon_pid == -1,
"Pcied expected pid is -1 but is {}".format(daemon_pid))
logger.info("{} daemon got killed unexpectedly and it is in {} with pid {}".format(daemon_name, daemon_status, daemon_pid))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you logging the info without checking the status if the daemon actually killed or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The daemon_status check is followed right after but this is just added for debugging purpose. I can remove this part.

@@ -180,16 +149,17 @@ def test_pmon_ledd_kill_and_start_status(check_daemon_status, duthosts, rand_one
duthost.stop_pmon_daemon(daemon_name, SIG_KILL, pre_daemon_pid)

daemon_status, daemon_pid = duthost.get_pmon_daemon_status(daemon_name)
logger.info("{} daemon got killed unexpectedly and it is {} with pid {}".format(daemon_name, daemon_status, daemon_pid))
logger.info("{} daemon got killed unexpectedly and it is in {} with pid {}".format(daemon_name, daemon_status, daemon_pid))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check the daemon_status before logging "daemon got killed"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The daemon_status check is followed right after but this is just added for debugging purpose. I can remove this part.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better put a log "sending SIG_KILL to daemon {}" before calling stop_pmon_daemon() for debugging and remove this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test name(test_pmon_ledd_kill_and_start_status) is already indicating this is sending SIG_KILL command.

@@ -154,26 +121,27 @@ def test_pmon_pcied_term_and_start_status(check_daemon_status, duthosts, rand_on
@summary: This test case is to check the pcied terminated and restarted status
"""
duthost = duthosts[rand_one_dut_hostname]

if "201811" in duthost.os_version or "201911" in duthost.os_version:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SIGTERM signal is standard signal and the behavior should be same for all POSIX compliant systems. Can you elaborate what different behavior are you seeing across different os version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor please see the previous comment.

@sujinmkang sujinmkang merged commit 42a3a81 into sonic-net:master Aug 19, 2021
vmittal-msft pushed a commit to vmittal-msft/sonic-mgmt that referenced this pull request Sep 28, 2021
…mon name and fix term_and_start_status failure (sonic-net#3922)

What is the motivation for this PR?
Needs to fix pmon daemon regression test issues

How did you do it?
Use daemon_name instead of the specific daemon name for log message
Support for both old/new db key support for checking the db value.
Skip term_and_start_status test cases for 201911 and 201811 since it behaves differently
Fix the expected daemon status after sending sig_term based on latest daemon behavior for term_and_start_status
Remove disable_and_enable_autorestart to make sure the daemon regression test doesn't affect any other containers.
How did you verify/test it?
Run test_ledd.py on arista 7050 platform which reports issue.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants