Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix for mocked T0 DToR TC failures due to config push delta #8363

Conversation

AnantKishorSharma
Copy link
Contributor

@AnantKishorSharma AnantKishorSharma commented May 19, 2023

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit easier:
-->

Description of PR

Summary:
Fixes # (issue)
(A) Test fails because mux toggle json file execution fails as swss container is not running
(B) Test fails because trigger happens before the mux toggle config is pushed from orchagent for all the 36 ports and took effect from sairedis. ports are selected randomly hence the issue is intermittent(if the ports selected out of 36, for that run has the config taken effect at sairedis by the time trigger happens). In ~10 runs, it's observed that it takes anywhere between 18-21s to finish the config at sairedis for all 36 ports(from the time ansible cmd for json is executed). In case of T0 mocked DToR we can not check the mux status so we're relying on sleep to finish config.

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • Test case(new/improvement)

Back port request

  • 201911
  • 202012
  • 202205

Approach

What is the motivation for this PR?

(A) Test fails because mux toggle json file execution fails.
json file execution fails because swss is not running.
swss is not running because it's allowed to restart only 3 times in a 20 min interval and hits that limit.
restart limit is hit because in this test for ASIC type "gb" we restart swss 4 times(twice for each v4 and v6)
reset-failed is called for swss before restart but it does not seem to be flushing the restart rate counter for swss.
Log excerpts for issue (A):

Apr 20 14:55:42.355098 t0-yy38 INFO systemd[1]: swss.service: Scheduled restart job, restart counter is at 3.
Apr 20 14:55:42.355384 t0-yy38 INFO systemd[1]: Stopped switch state service.
Apr 20 14:55:42.355605 t0-yy38 WARNING systemd[1]: swss.service: Start request repeated too quickly.
Apr 20 14:55:42.355796 t0-yy38 WARNING systemd[1]: swss.service: Failed with result 'start-limit-hit'.
Apr 20 14:55:42.355978 t0-yy38 ERR systemd[1]: Failed to start switch state service.
Apr 20 14:55:42.356166 t0-yy38 WARNING systemd[1]: Dependency failed for SNMP container.
Apr 20 14:55:42.356353 t0-yy38 NOTICE systemd[1]: snmp.service: Job snmp.service/start failed with result 'dependency'.
Apr 20 14:55:42.356546 t0-yy38 WARNING systemd[1]: swss.service: Start request repeated too quickly.
**Apr 20 14:55:42.356724 t0-yy38 WARNING systemd[1]: swss.service: Failed with result 'start-limit-hit'.**
Apr 20 14:55:42.356900 t0-yy38 ERR systemd[1]: Failed to start switch state service.
Apr 20 14:55:42.357047 t0-yy38 WARNING systemd[1]: Dependency failed for SNMP container.
Apr 20 14:55:42.357168 t0-yy38 NOTICE systemd[1]: snmp.service: Job snmp.service/start failed with result 'dependency'.
Apr 20 14:56:34.321778 t0-yy38 INFO dockerd[565]: time="2023-04-20T14:56:34.321374481Z" level=error msg="Error setting up exec command in container swss: Container d11306c4041ad1a0bf7d15f81a6ad1066e3879745d726234fc12e162164a7b33 is not running"

(B) #3 is happening before #2 in NOK run
1)when ansible command was executed(syslog)

syslog.1:Jun  6 **14:07:24.093000** mth-t0-64 INFO python[596206]: ansible-command Invoked with _uses_shell=True _raw_params=docker exec swss sh -c "swssconfig /muxactive.json" warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None
**last config push:from orchagent**
syslog.1:Jun  6 **14:07:43.66850**1 mth-t0-64 NOTICE swss#orchagent: :- addOperation: Mux State set to active for port Ethernet96 

2)when it took effect from sairedis(sairedis.rec)

sairedis.rec.1:2023-06-06.14:07:44.241430|c|SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:\{"ip":"192.168.0.26","rif":"oid:0x600000000099d","switch_id":"oid:0x21000000000000"}|SAI_NEIGHBOR_ENTRY_ATTR_DST_MAC_ADDRESS=40:A6:B7:43:75:27

sairedis.rec.1:2023-06-06.14:07:44.242265|c|SAI_OBJECT_TYPE_NEXT_HOP:oid:0x4000000000ae9|SAI_NEXT_HOP_ATTR_TYPE=SAI_NEXT_HOP_TYPE_IP|SAI_NEXT_HOP_ATTR_IP=192.168.0.26|SAI_NEXT_HOP_ATTR_ROUTER_INTERFACE_ID=oid:0x600000000099d
**last one:**
sairedis.rec.1:2023-06-06.**14:07:44.278459**|c|SAI_OBJECT_TYPE_NEXT_HOP:oid:0x4000000000afb|SAI_NEXT_HOP_ATTR_TYPE=SAI_NEXT_HOP_TYPE_IP|SAI_NEXT_HOP_ATTR_IP=192.168.0.9|SAI_NEXT_HOP_ATTR_ROUTER_INTERFACE_ID=oid:0x600000000099d

3)when did trigger happen(test log)

06/06/2023 **14:07:45** testutils.verify_packet                  L2400 DEBUG  | Checking for pkt on device 0, port 39

How did you do it?

(A) config.bcm generation in not required for Cisco gb platform so just skipped one restart to avoid hitting restart limit error.
(B) Introduced a delay of 30s between mux toggle on DUT and send packet from T1(PTF)

How did you verify/test it?

Verified that mux config json is executed successfully and packets are sent to DUT after config is finished and test case passes.

Any platform specific information?

While applying dtor mock config to the dut, we do not need 2 swss restarts in case of Cisco platforms as one of the restart is for generating config.bcm which is Bcm specific

Supported testbed topology if it's a new test case?

Documentation

@Aravind-Subbaroyan
Copy link

@kevinskwang - Could you review this?

@AnantKishorSharma AnantKishorSharma changed the title fix for failures in orchagent_standby_tor_downstream script fix for mocked T0 DToR TC failures due to config push delta Jun 15, 2023
@AnantKishorSharma
Copy link
Contributor Author

@lolyu , @wsycqyz , could you please review this PR?

@kevinskwang
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@kevinskwang
Copy link
Collaborator

/azp run Semgrep

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@Aravind-Subbaroyan
Copy link

@kevinskwang - could you please help to merge?

@AnantKishorSharma
Copy link
Contributor Author

AnantKishorSharma commented Oct 19, 2023

@kevinskwang Could you please help unblock the "expected check". Can we do anything from our side to re-trigger it except a new commit?

@AnantKishorSharma
Copy link
Contributor Author

@kevinskwang , could you please merge this PR?

@AnantKishorSharma
Copy link
Contributor Author

@kevinskwang , all the checks have passed after updating the branch today. Could you please merge?

@kevinskwang kevinskwang merged commit 7679ce2 into sonic-net:master Jan 8, 2024
13 checks passed
mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request Jan 8, 2024
…t#8363)

* fix for failures in orchagent_standby_tor_downstream script

* Update test_orchagent_standby_tor_downstream.py

* fix for mocked T0 DToR TC failures due to config push delta
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202305: #11213

mssonicbld pushed a commit that referenced this pull request Jan 8, 2024
* fix for failures in orchagent_standby_tor_downstream script

* Update test_orchagent_standby_tor_downstream.py

* fix for mocked T0 DToR TC failures due to config push delta
@mssonicbld
Copy link
Collaborator

@AnantKishorSharma PR conflicts with 202205 branch

@AnantKishorSharma
Copy link
Contributor Author

AnantKishorSharma commented Feb 1, 2024

@AnantKishorSharma PR conflicts with 202205 branch

Created #11550 manualy.

wangxin pushed a commit that referenced this pull request Feb 2, 2024
…ush delta (#11550)

Original PR: #8363

(A) Test fails because mux toggle json file execution fails as swss container is not running
(B) Test fails because trigger happens before the mux toggle config is pushed from orchagent for all the 36 ports and took effect from sairedis. ports are selected randomly hence the issue is intermittent(if the ports selected out of 36, for that run has the config taken effect at sairedis by the time trigger happens). In ~10 runs, it's observed that it takes anywhere between 18-21s to finish the config at sairedis for all 36 ports(from the time ansible cmd for json is executed). In case of T0 mocked DToR we can not check the mux status so we're relying on sleep to finish config.
@AnantKishorSharma
Copy link
Contributor Author

@kevinskwang, @mssonicbld , @wangxin , could you please approve this PR for 202311 brnach. I see this is missing in 202311 branch and causing the same failures.

@AnantKishorSharma
Copy link
Contributor Author

@kevinwangsk , could you please add "Approved for 202311 branch" label in this PR to cherry pick? Please let us know if we need to create it manually.

mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request May 30, 2024
…t#8363)

* fix for failures in orchagent_standby_tor_downstream script

* Update test_orchagent_standby_tor_downstream.py

* fix for mocked T0 DToR TC failures due to config push delta
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202311: #13051

@wsycqyz
Copy link
Contributor

wsycqyz commented May 30, 2024

@AnantKishorSharma Please report back if the issue is still there after being merged to 202311

mssonicbld pushed a commit that referenced this pull request May 30, 2024
* fix for failures in orchagent_standby_tor_downstream script

* Update test_orchagent_standby_tor_downstream.py

* fix for mocked T0 DToR TC failures due to config push delta
@AnantKishorSharma
Copy link
Contributor Author

@AnantKishorSharma Please report back if the issue is still there after being merged to 202311

Hi @wsycqyz , after merging this PR we still saw the failure and had to increase the delay, pleas review #13625

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants