Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[teammgr] Added LAG member check into addLagMember() #2464

Merged
merged 1 commit into from
Nov 1, 2022

Conversation

akokhan
Copy link
Contributor

@akokhan akokhan commented Sep 20, 2022

Signed-off-by: Andriy Kokhan andriyx.kokhan@intel.com

What I did
Added a check into addLagMember() whether this new LAG member still exists in the kernel.

Why I did it
During syncd container autorestart scenario, on syncd exit, the host interfaces (tun/tap netdevs) go to the DOWN state and then get removed.

Due to the validation as follows, the teammgr will receive the notification about the port state change (the information will be updated in the state DB and pubsub message sent) but the port state record will not be removed from the state DB on port delete:

if (master && nlmsg_type == RTM_DELLINK)

Due to this, on port state change notification, the isPortStateOk() will succeed and TeamMgr::addLagMember() will be executed even the host interface was actually removed.

The operation is expected to be ignored if the port is already enslaved:

if (isPortEnslaved(member))

The check fails since the port has already been removed:

return lstat(path.c_str(), &buf) == 0;

As a result, the TeamMgr::addLagMember() logic will be executed and failed:

Jun 21 11:47:12.265955 cab18-7-dut INFO teamd#/supervisord: teammgrd Cannot find device "Ethernet0"
Jun 21 11:47:12.294550 cab18-7-dut INFO teamd#/supervisord: teammgrd libteamdctl: cli_usock_process_msg: usock: Error message received: "NoSuchDev"
Jun 21 11:47:12.294550 cab18-7-dut INFO teamd#/supervisord: teammgrd libteamdctl: cli_usock_process_msg: usock: Error message content: "No such device."
Jun 21 11:47:12.294550 cab18-7-dut INFO teamd#/supervisord: teammgrd command call failed (Invalid argument)
Jun 21 11:47:12.322497 cab18-7-dut INFO teamd#/supervisord: teammgrd libteamdctl: cli_usock_process_msg: usock: Error message received: "NoSuchDev"
Jun 21 11:47:12.322497 cab18-7-dut INFO teamd#/supervisord: teammgrd libteamdctl: cli_usock_process_msg: usock: Error message content: "No such device."
Jun 21 11:47:12.322497 cab18-7-dut INFO teamd#/supervisord: teammgrd command call failed (Invalid argument)
Jun 21 11:47:12.328844 cab18-7-dut ERR teamd#teammgrd: :- checkPortIffUp: Failed to get port Ethernet0 flags
Jun 21 11:47:12.328844 cab18-7-dut ERR teamd#teammgrd: :- addLagMember: Failed to add Ethernet0 to port channel PortChannel102

The issue started to reproduce after #2233

How I verified it

autorestart/test_container_autorestart.py -k 'syncd' 

Signed-off-by: Andriy Kokhan <andriyx.kokhan@intel.com>
@akokhan
Copy link
Contributor Author

akokhan commented Sep 22, 2022

@liorghub , @prsunny , please take a look

@akokhan akokhan changed the title Added LAG member check into addLagMember() [teammgr] Added LAG member check into addLagMember() Sep 23, 2022
@prsunny
Copy link
Collaborator

prsunny commented Sep 23, 2022

When syncd container auto-restart, it will also restart swss and cleanup the STATE_DB. so which scenario is we encounter this?

@prsunny
Copy link
Collaborator

prsunny commented Sep 23, 2022

@liorghub , seems to be introduced after #2233. Can you please review this?

@akokhan akokhan closed this Sep 27, 2022
@akokhan akokhan reopened this Sep 27, 2022
@akokhan
Copy link
Contributor Author

akokhan commented Sep 27, 2022

When syncd container auto-restart, it will also restart swss and cleanup the STATE_DB. so which scenario is we encounter this?

On syncd exit, the host interfaces (tun/tap netdevs) go to the DOWN state and then get removed. Looks like this triggers the notification to teammgr before STATE_DB gets cleaned up. That's why teammgr tries to add LAG members which do not exists any more.

@akokhan
Copy link
Contributor Author

akokhan commented Sep 29, 2022

@judyjoseph , @prsunny , please review. Thanks

1 similar comment
@akokhan
Copy link
Contributor Author

akokhan commented Sep 30, 2022

@judyjoseph , @prsunny , please review. Thanks

@judyjoseph
Copy link
Contributor

@judyjoseph , @prsunny , please review. Thanks

Taking a look @akokhan -- thanks

@akokhan
Copy link
Contributor Author

akokhan commented Oct 5, 2022

@judyjoseph , did you get a chance to review this? Thanks

@msosyak
Copy link
Contributor

msosyak commented Oct 19, 2022

@judyjoseph , @prsunny , please review. Thanks

@akokhan
Copy link
Contributor Author

akokhan commented Oct 25, 2022

@judyjoseph , did you get a chance to check this PR? Thanks

@akokhan
Copy link
Contributor Author

akokhan commented Oct 28, 2022

@prsunny , please approve and merge. Thank you.

@prsunny prsunny merged commit 52c561f into sonic-net:master Nov 1, 2022
yenlu-keith pushed a commit to yenlu-keith/sonic-swss-1 that referenced this pull request Feb 13, 2023
*[teammgr] Added LAG member check into addLagMember()
yxieca pushed a commit that referenced this pull request Feb 21, 2023
*[teammgr] Added LAG member check into addLagMember()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants