-
Notifications
You must be signed in to change notification settings - Fork 727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add warmboot infra for testing with a sonic neighbor #7961
Add warmboot infra for testing with a sonic neighbor #7961
Conversation
The pre-commit check detected issues in the files touched by this pull request. For old issues, it is not mandatory to fix them because they were not caused by this change. It is unfair to blame Detailed pre-commit check results: To run the pre-commit checks locally, you can follow below steps:
|
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
…d on neighbor type Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
789b23b
to
747353d
Compare
The pre-commit check detected issues in the files touched by this pull request. For old issues, it is not mandatory to fix them because they were not caused by this change. It is unfair to blame Detailed pre-commit check results: To run the pre-commit checks locally, you can follow below steps:
|
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
The pre-commit check detected issues in the files touched by this pull request. For old issues, it is not mandatory to fix them because they were not caused by this change. It is unfair to blame Detailed pre-commit check results: To run the pre-commit checks locally, you can follow below steps:
|
@@ -924,7 +929,7 @@ def generate_bidirectional(self): | |||
def put_nowait(self, queue, data): | |||
try: | |||
queue.put_nowait(data) | |||
except queue.Full: | |||
except Queue.Full: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks wrong as Queue
is not same as queue
arg that is passed to method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That line should have an exception type, not an instance. In my case, I got a Python error for Full
being an unknown type.
@@ -1433,7 +1438,8 @@ def neigh_lag_status_check(self): | |||
Ensure there are no interface flaps after warm-boot | |||
""" | |||
for neigh in self.ssh_targets: | |||
self.neigh_handle = Arista(neigh, None, self.test_params) | |||
self.test_params['port_channel_intf_idx'] = [x['ptf_ports'][0] for x in self.vm_dut_map if x['mgmt_addr'] == neigh][0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
x['ptf_ports'][0]
will this always select first lag member, and not all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed to take all lag members.
def verify_bgp_neigh_state(self, dut=None, state="Active"): | ||
raise NotImplementedError | ||
|
||
import arista |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import
statements added to EOF. Please move them to top.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is intentional, so that when arista
and sonic
get imported, the HostDevice
class is defined. There probably is a better Pythonic way to do this though....
import ptf | ||
from ptf.base_tests import BaseTest | ||
from ptf import config | ||
import ptf.testutils as testutils |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not used. Also overlaps w/ from ptf.testutils import *
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some other unused imports in this file too. Please check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imports cleaned up.
@@ -0,0 +1,618 @@ | |||
import ptf | |||
from ptf.base_tests import BaseTest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import not used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import removed.
self.log('Checking BGP GR peer status on VM') | ||
self.check_gr_peer_status(data) | ||
cli_data = {} | ||
#cli_data['lacp'] = self.check_series_status(data, "lacp", "LACP session") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not collect lacp data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't realize that there was a command show interface portchannel
that would fit well here. I'll add that in.
continue | ||
log_time = datetime.datetime.strptime(str(datetime.datetime.now().year) + " " + m.group(1), "%Y %b %d %X") | ||
# Python 3 version (Python 2 doesn't have timestamp(): | ||
#raw_data.append((log_time.timestamp(), m.group(2), m.group(3))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to support testing on both Python2 and 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That line will work for both Python 2 and 3; if we do move to Python 3-only, then the next line could be used for conciseness (although looking at it now, that Python 3 version needs to be updated to match the current code).
|
||
def verify_neigh_lag_no_flap(self): | ||
flap_cnt = sys.maxint | ||
output = self.do_cmd('show interfaces Po1 | json') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will not work for SONiC, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it won't. There's a fair number of things in this file that won't work for SONiC, but I've kept them for now mostly because either they're not being executed for regular warm reboot, or they don't happen to break anything in a bad way.
cli_data['lacp'] = (0, 0) | ||
cli_data['bgp_v4'] = self.check_series_status(data, "bgp_route_v4", "BGP v4 routes") | ||
cli_data['bgp_v6'] = self.check_series_status(data, "bgp_route_v6", "BGP v6 routes") | ||
cli_data['po'] = self.check_lag_flaps("PortChannel1", log_lines, start_time) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any place where this information about lag flap is consumed and failure is raised. I might have been lost, so please point me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It gets printed into the log file.
self.cli_info[ip]['po'][1], |
return 0, 0 | ||
|
||
num_lag_flaps = len([x for x in result_lag_flaps[interface] if x[1] == "DOWN"]) | ||
return 0, num_lag_flaps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the first value being returned is always 0 then why return it? Just return num_lag_flaps
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just copied the same return semantics as Arista has for their check_change_time
function (it's doing the equivalent work as this function). cli_data['po']
would still need two values, since advanced-reboot.py
is always reading the second value (and ignoring the first value).
Handle multiple port channel interfaces, update LACP PDU timing collection code to current version, clean up imports a bit. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
The pre-commit check detected issues in the files touched by this pull request. Detailed pre-commit check results: To run the pre-commit checks locally, you can follow below steps:
|
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
The pre-commit check detected issues in the files touched by this pull request. Detailed pre-commit check results: To run the pre-commit checks locally, you can follow below steps:
|
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
If testing with a SONiC neighbor, fail the warmboot test if there is a port channel flap. This is in preparation for the teamd retry count feature. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
This is used on actual hardware warm-reboot tests. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
The pre-commit check detected issues in the files touched by this pull request. Detailed pre-commit check results: To run the pre-commit checks locally, you can follow below steps:
|
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
… for a command Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
The pre-commit check detected issues in the files touched by this pull request. Detailed pre-commit check results: To run the pre-commit checks locally, you can follow below steps:
|
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
def verify_neigh_lag_no_flap(self): | ||
# Note: this function may have false-positives (with regards to link flaps). The start time used here is | ||
# the system's boot time, not the test start time, which means any LAG flaps before the start of the test | ||
# would get included here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If, on the current VM, there are link flaps prior to the test being run, this may return a false-positive. Any suggestions on fixing this?
@vaibhavhd @yxieca could you re-review this when you get a chance? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update description w/ details about the tests that were done to validate this change.
Why I did it This is to add support for specifying custom retry counts for LACP sessions. This is to make warmboot easier on low-storage and low-memory platforms, by allowing more than 90 seconds of downtime. How I did it How to verify it Tested manually with these cases: Verify that changing the retry count using teamdctl PortChannel101 state item set runner.retry_count 5 takes effect Verify that the retry count change actually affects when the LAG goes down by forcefully killing teamd on one side (i.e. setting the retry count to 5 causes the LAG to go down after 150 seconds) Verify that the retry count gets reset to 3 after the LAG goes down for whatever reason Verify that the retry count gets reset to 3 after some period of time (30 seconds * retry count) Test cases are in sonic-net/sonic-mgmt#7961 and sonic-net/sonic-mgmt#8152. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
…et#13453) Why I did it This is to add support for specifying custom retry counts for LACP sessions. This is to make warmboot easier on low-storage and low-memory platforms, by allowing more than 90 seconds of downtime. How I did it How to verify it Tested manually with these cases: Verify that changing the retry count using teamdctl PortChannel101 state item set runner.retry_count 5 takes effect Verify that the retry count change actually affects when the LAG goes down by forcefully killing teamd on one side (i.e. setting the retry count to 5 causes the LAG to go down after 150 seconds) Verify that the retry count gets reset to 3 after the LAG goes down for whatever reason Verify that the retry count gets reset to 3 after some period of time (30 seconds * retry count) Test cases are in sonic-net/sonic-mgmt#7961 and sonic-net/sonic-mgmt#8152. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
* Add initial framework for testing warm reboot with SONiC neighbors Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Add support for choosing between Arista and Sonic implementation based on neighbor type Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Include timestamps in log output for easier time comparisons from log Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Bug fixes for when there's no port channel flaps Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Update SONiC warm reboot test case Handle multiple port channel interfaces, update LACP PDU timing collection code to current version, clean up imports a bit. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * fixup! Update SONiC warm reboot test case Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Fix some precommit errors Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Add return in case of attempts exceeded Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Reduce to 5 attempts. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Fix pre-commit checks Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Fix more precommit issues Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Fail the warmboot test if there's a port channel flap for SONiC neighbor If testing with a SONiC neighbor, fail the warmboot test if there is a port channel flap. This is in preparation for the teamd retry count feature. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Implement verify_neigh_lag_no_flap This is used on actual hardware warm-reboot tests. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Remove unused import Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Remove unused log message, and add log when failing to get the output for a command Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Remove unused variable Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Add retry count flag for the warm-reboot command on supported images. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> --------- Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
* Add initial framework for testing warm reboot with SONiC neighbors Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Add support for choosing between Arista and Sonic implementation based on neighbor type Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Include timestamps in log output for easier time comparisons from log Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Bug fixes for when there's no port channel flaps Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Update SONiC warm reboot test case Handle multiple port channel interfaces, update LACP PDU timing collection code to current version, clean up imports a bit. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * fixup! Update SONiC warm reboot test case Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Fix some precommit errors Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Add return in case of attempts exceeded Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Reduce to 5 attempts. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Fix pre-commit checks Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Fix more precommit issues Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Fail the warmboot test if there's a port channel flap for SONiC neighbor If testing with a SONiC neighbor, fail the warmboot test if there is a port channel flap. This is in preparation for the teamd retry count feature. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Implement verify_neigh_lag_no_flap This is used on actual hardware warm-reboot tests. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Remove unused import Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Remove unused log message, and add log when failing to get the output for a command Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Remove unused variable Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Add retry count flag for the warm-reboot command on supported images. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> --------- Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Description of PR
Summary:
Fixes # (issue)
Type of change
Back port request
Approach
What is the motivation for this PR?
Add the necessary infra to test warmboot with a sonic neighbor. Currently, it at least does the warmboot and monitors LACPDUs sent by the T0. Fastboot is not tested at all. In addition, for now, it's assumed that it will always be a sonic neighbor; code needs to be added to determine if the neighbors are sonic or Arista.
How did you do it?
How did you verify/test it?
Ran basic warm reboot test cases with sonic neighbor, and verified the output seemed right. Other types of reboot test cases such as fast reboot and SAD test cases have not been tested.
Any platform specific information?
Supported testbed topology if it's a new test case?
Documentation