-
Notifications
You must be signed in to change notification settings - Fork 531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[neighsyncd] increase neighsyncd timeout #2209
Conversation
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
/azpw run Azure.sonic-swss |
/AzurePipelines run Azure.sonic-swss |
Azure Pipelines successfully started running 1 pipeline(s). |
@neethajohn , @prsunny could you please help to review? |
This has been a day 1 change. Just curious, why its impacting now? |
@prsunny In a sad-path test case we will definitely wait 110 sec in restore_neighbors.py script due to some neighbors aren't reachable anymore. Now, neighsyncd waits 120 sec, 10 sec more and the assumption was that neighsyncd and restore_neighbors.py start their timers within 10 sec. There is no guaranty for that, the supervisord may start these daemons with some delay and restore_neighbors.py itself starts a bit slowly due to a lot of python heavy imports before starting the timer. So it just happened to work because there were a second or two before neighsyncd timeout expires, however any change in SONiC could make that at the time when restore_neighbors.py starts there was some CPU usage spike, causing a bit more delay then previously and making the neighsyncd crash because of that. If we want to lower the timeout, there should be a more robust mechanism of synchronizing these two components. |
- What I did Increased the neighsyncd timeout. - Why I did it Restore_neigh takes a bit more time to start thus it could be that the neighsyncd timeout is not enough to wait for restore_neighbors - How I verified it py.test platform_tests/test_advanced_reboot.py::test_warm_reboot_sad[sad_lag_member] --inventory="../ansible/inventory,../ansible/veos" --host-pattern arc-switch1004 --module-path ../ansible/library/ --testbed arc-switch1004-t0-56 --testbed_file ../ansible/testbed.csv --allow_recover --log-cli-level info --skip_sanity Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
- What I did Increased the neighsyncd timeout. - Why I did it Restore_neigh takes a bit more time to start thus it could be that the neighsyncd timeout is not enough to wait for restore_neighbors - How I verified it py.test platform_tests/test_advanced_reboot.py::test_warm_reboot_sad[sad_lag_member] --inventory="../ansible/inventory,../ansible/veos" --host-pattern arc-switch1004 --module-path ../ansible/library/ --testbed arc-switch1004-t0-56 --testbed_file ../ansible/testbed.csv --allow_recover --log-cli-level info --skip_sanity Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
- What I did Increased the neighsyncd timeout. - Why I did it Restore_neigh takes a bit more time to start thus it could be that the neighsyncd timeout is not enough to wait for restore_neighbors - How I verified it py.test platform_tests/test_advanced_reboot.py::test_warm_reboot_sad[sad_lag_member] --inventory="../ansible/inventory,../ansible/veos" --host-pattern arc-switch1004 --module-path ../ansible/library/ --testbed arc-switch1004-t0-56 --testbed_file ../ansible/testbed.csv --allow_recover --log-cli-level info --skip_sanity Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Request for 201911 |
- What I did Increased the neighsyncd timeout. - Why I did it Restore_neigh takes a bit more time to start thus it could be that the neighsyncd timeout is not enough to wait for restore_neighbors - How I verified it py.test platform_tests/test_advanced_reboot.py::test_warm_reboot_sad[sad_lag_member] --inventory="../ansible/inventory,../ansible/veos" --host-pattern arc-switch1004 --module-path ../ansible/library/ --testbed arc-switch1004-t0-56 --testbed_file ../ansible/testbed.csv --allow_recover --log-cli-level info --skip_sanity Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
- What I did Increased the neighsyncd timeout. - Why I did it Restore_neigh takes a bit more time to start thus it could be that the neighsyncd timeout is not enough to wait for restore_neighbors - How I verified it py.test platform_tests/test_advanced_reboot.py::test_warm_reboot_sad[sad_lag_member] --inventory="../ansible/inventory,../ansible/veos" --host-pattern arc-switch1004 --module-path ../ansible/library/ --testbed arc-switch1004-t0-56 --testbed_file ../ansible/testbed.csv --allow_recover --log-cli-level info --skip_sanity Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Signed-off-by: Stepan Blyschak stepanb@nvidia.com
What I did
Increased the neighsyncd timeout.
Why I did it
Restore_neigh takes a bit more time to start thus it could be that the neighsyncd timeout is not enough to wait for restore_neighbors.
How I verified it
Details if related
Required for: