Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[neighsyncd] increase neighsyncd timeout #2209

Merged
merged 1 commit into from
Mar 31, 2022

Conversation

stepanblyschak
Copy link
Contributor

@stepanblyschak stepanblyschak commented Mar 29, 2022

Signed-off-by: Stepan Blyschak stepanb@nvidia.com

What I did

Increased the neighsyncd timeout.

Why I did it

Restore_neigh takes a bit more time to start thus it could be that the neighsyncd timeout is not enough to wait for restore_neighbors.

admin@arc-switch1004:~$ sudo zgrep 'restore_neigh' /var/log/syslog.4 | head -n5
Mar 29 09:21:32.330716 arc-switch1004 INFO swss#supervisord 2022-03-29 09:21:32,316 INFO spawned: 'restore_neighbors' with pid 67
Mar 29 09:21:32.336639 arc-switch1004 INFO swss#supervisord 2022-03-29 09:21:32,329 INFO success: restore_neighbors entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
Mar 29 09:21:43.017175 arc-switch1004 INFO swss#restore_neighbor: restore_neighbors service is started

Mar 29 09:23:36.163758 arc-switch1004 INFO swss#restore_neighbor: restore_neighbor service is done for system warmreboot
admin@arc-switch1004:~$ sudo zgrep 'neighsyncd' /var/log/syslog.4 | head -n5
Mar 29 09:21:11.052492 arc-switch1004 NOTICE root: WARMBOOT_FINALIZER : Waiting for components: ' bgp orchagent neighsyncd' to reconcile ...
Mar 29 09:21:32.404728 arc-switch1004 INFO swss#supervisord 2022-03-29 09:21:32,401 INFO spawned: 'neighsyncd' with pid 69
Mar 29 09:21:32.499778 arc-switch1004 NOTICE swss#neighsyncd: :- checkWarmStart: neighsyncd doing warm start, restore count 2
Mar 29 09:21:32.500554 arc-switch1004 NOTICE swss#neighsyncd: :- getWarmStartTimer: warmStartTimer is not configured or invalid for docker: swss, app: neighsyncd
Mar 29 09:21:32.501239 arc-switch1004 NOTICE swss#neighsyncd: :- setWarmStartState: neighsyncd warm start state changed to initialized

Mar 29 09:21:33.534436 arc-switch1004 INFO swss#supervisord 2022-03-29 09:21:33,483 INFO success: neighsyncd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
Mar 29 09:21:43.105898 arc-switch1004 NOTICE swss#python3: :- checkWarmStart: neighsyncd doing warm start, restore count 0
Mar 29 09:23:34.014537 arc-switch1004 ERR swss#neighsyncd: :- main: neighbor table restore is not finished after timed-out, exit!!!

How I verified it

py.test platform_tests/test_advanced_reboot.py::test_warm_reboot_sad[sad_lag_member] --inventory="../ansible/inventory,../ansible/veos" --host-pattern arc-switch1004 --module-path ../ansible/library/ --testbed arc-switch1004-t0-56 --testbed_file ../ansible/testbed.csv --allow_recover --log-cli-level info --skip_sanity

Details if related

Required for:

  • 202111
  • 202106
  • 202012

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
@stepanblyschak
Copy link
Contributor Author

/azpw run Azure.sonic-swss

@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-swss

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@liat-grozovik
Copy link
Collaborator

@neethajohn , @prsunny could you please help to review?

@prsunny
Copy link
Collaborator

prsunny commented Mar 30, 2022

This has been a day 1 change. Just curious, why its impacting now?

@stepanblyschak
Copy link
Contributor Author

@prsunny In a sad-path test case we will definitely wait 110 sec in restore_neighbors.py script due to some neighbors aren't reachable anymore. Now, neighsyncd waits 120 sec, 10 sec more and the assumption was that neighsyncd and restore_neighbors.py start their timers within 10 sec. There is no guaranty for that, the supervisord may start these daemons with some delay and restore_neighbors.py itself starts a bit slowly due to a lot of python heavy imports before starting the timer. So it just happened to work because there were a second or two before neighsyncd timeout expires, however any change in SONiC could make that at the time when restore_neighbors.py starts there was some CPU usage spike, causing a bit more delay then previously and making the neighsyncd crash because of that. If we want to lower the timeout, there should be a more robust mechanism of synchronizing these two components.

@liat-grozovik liat-grozovik merged commit 5575935 into sonic-net:master Mar 31, 2022
qiluo-msft pushed a commit that referenced this pull request Apr 3, 2022
- What I did
Increased the neighsyncd timeout.

- Why I did it
Restore_neigh takes a bit more time to start thus it could be that the neighsyncd timeout is not enough to wait for restore_neighbors

- How I verified it
py.test platform_tests/test_advanced_reboot.py::test_warm_reboot_sad[sad_lag_member] --inventory="../ansible/inventory,../ansible/veos" --host-pattern arc-switch1004 --module-path ../ansible/library/ --testbed arc-switch1004-t0-56 --testbed_file ../ansible/testbed.csv --allow_recover --log-cli-level info --skip_sanity

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
judyjoseph pushed a commit that referenced this pull request Apr 4, 2022
- What I did
Increased the neighsyncd timeout.

- Why I did it
Restore_neigh takes a bit more time to start thus it could be that the neighsyncd timeout is not enough to wait for restore_neighbors

- How I verified it
py.test platform_tests/test_advanced_reboot.py::test_warm_reboot_sad[sad_lag_member] --inventory="../ansible/inventory,../ansible/veos" --host-pattern arc-switch1004 --module-path ../ansible/library/ --testbed arc-switch1004-t0-56 --testbed_file ../ansible/testbed.csv --allow_recover --log-cli-level info --skip_sanity

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
vivekrnv pushed a commit to vivekrnv/sonic-swss that referenced this pull request Jul 25, 2022
- What I did
Increased the neighsyncd timeout.

- Why I did it
Restore_neigh takes a bit more time to start thus it could be that the neighsyncd timeout is not enough to wait for restore_neighbors

- How I verified it
py.test platform_tests/test_advanced_reboot.py::test_warm_reboot_sad[sad_lag_member] --inventory="../ansible/inventory,../ansible/veos" --host-pattern arc-switch1004 --module-path ../ansible/library/ --testbed arc-switch1004-t0-56 --testbed_file ../ansible/testbed.csv --allow_recover --log-cli-level info --skip_sanity

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
@vivekrnv
Copy link
Contributor

Request for 201911

abdosi pushed a commit that referenced this pull request Jul 29, 2022
- What I did
Increased the neighsyncd timeout.

- Why I did it
Restore_neigh takes a bit more time to start thus it could be that the neighsyncd timeout is not enough to wait for restore_neighbors

- How I verified it
py.test platform_tests/test_advanced_reboot.py::test_warm_reboot_sad[sad_lag_member] --inventory="../ansible/inventory,../ansible/veos" --host-pattern arc-switch1004 --module-path ../ansible/library/ --testbed arc-switch1004-t0-56 --testbed_file ../ansible/testbed.csv --allow_recover --log-cli-level info --skip_sanity

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
preetham-singh pushed a commit to preetham-singh/sonic-swss that referenced this pull request Aug 6, 2022
- What I did
Increased the neighsyncd timeout.

- Why I did it
Restore_neigh takes a bit more time to start thus it could be that the neighsyncd timeout is not enough to wait for restore_neighbors

- How I verified it
py.test platform_tests/test_advanced_reboot.py::test_warm_reboot_sad[sad_lag_member] --inventory="../ansible/inventory,../ansible/veos" --host-pattern arc-switch1004 --module-path ../ansible/library/ --testbed arc-switch1004-t0-56 --testbed_file ../ansible/testbed.csv --allow_recover --log-cli-level info --skip_sanity

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants