Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue: wrong teamd link watch state after warm reboot #14084

Merged
merged 4 commits into from
Apr 7, 2023

Conversation

stephenxs
Copy link
Collaborator

@stephenxs stephenxs commented Mar 6, 2023

Why I did it

Fix issue: wrong teamd link watch state after warm reboot due to TEAM_ATTR_PORT_CHANGED lost

The flag TEAM_ATTR_PORT_CHANGED is maintained by kernel team driver:

  • a flag "changed" is maintained in struct team_port struct
  • the flag is set by __team_port_change_send once relevant information is updated, including port linkup (together with speed, duplex), adding or removing
  • the flag is cleared by team_nl_fill_one_port_get once the updated information has been notified to user space via RTNL

In the userspace, the change flag is maintained by libteam in struct team_port.
The team daemon calls port_priv_change_handler_func on receiving port change event.
The logic in port_priv_change_handler_func

  1. creates the port if it did not exist, which triggers port add event and eventually calls lacp_port_added callback.
  2. triggers port change event if team_port->changed is true, which eventually calls lw_ethtool_event_watch_port_changed to update port state for link watch ethtool.
  3. removes the port if team_port->removed is removed

In lacp_port_added, it calls team_refresh to refresh ifinfo, port info, and option info from the kernel via RTNL.
In this step, port_priv_change_handler_func is called recursively.

  • In the inner call, it won't get TEAM_ATTR_PORT_CHANGED flag because kernel has already notified that.
  • As a result, team_port->changed flag is cleared in the libteam.
  • The port change event won't be triggered from either inner or outer call of port_priv_change_handler_func.

If the port has been up when the port is being added to the team device, the "port up" information is carried in the outer call but will be lost.

In case the flag TEAM_ATTR_PORT_CHANGED is set only in the inner call, function port_priv_change_handler_func can be called in the inner call.
However, it will fail to fetch "enable" options because option_list_init has not be called.

Signed-off-by: Stephen Sun stephens@nvidia.com

How I did it

Fix:
Do not call check_call_change_handlers when parsing RTNL function is called from another check_call_change_handlers recursively.

How to verify it

  • Manually test
  • Regression test
    • warm reboot
    • warm reboot sad lag
    • warm reboot sad lag member
    • warm reboot sad (partial)

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211

Description for the changelog

Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

Signed-off-by: Stephen Sun <stephens@nvidia.com>
@liat-grozovik
Copy link
Collaborator

@vaibhavhd fyi, appreciate your comments.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Signed-off-by: Stephen Sun <stephens@nvidia.com>
@stephenxs stephenxs marked this pull request as ready for review March 29, 2023 01:42
@stephenxs stephenxs requested a review from lguohan as a code owner March 29, 2023 01:42
@qiluo-msft qiluo-msft merged commit 3b5871f into sonic-net:master Apr 7, 2023
@stephenxs stephenxs deleted the fix-teamd-lw-warm-reboot branch April 8, 2023 09:48
mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this pull request Apr 8, 2023
…14084)

#### Why I did it

Fix issue: wrong teamd link watch state after warm reboot due to TEAM_ATTR_PORT_CHANGED lost

The flag TEAM_ATTR_PORT_CHANGED is maintained by kernel team driver:
- a flag "changed" is maintained in struct team_port struct
- the flag is set by __team_port_change_send once relevant information is updated, including port linkup (together with speed, duplex), adding or removing
- the flag is cleared by team_nl_fill_one_port_get once the updated information has been notified to user space via RTNL

In the userspace, the change flag is maintained by libteam in struct team_port.
The team daemon calls port_priv_change_handler_func on receiving port change event.
The logic in port_priv_change_handler_func
1. creates the port if it did not exist, which triggers port add event and eventually calls lacp_port_added callback.
2. triggers port change event if team_port->changed is true, which eventually calls lw_ethtool_event_watch_port_changed to update port state for link watch ethtool.
3. removes the port if team_port->removed is removed

In lacp_port_added, it calls team_refresh to refresh ifinfo, port info, and option info from the kernel via RTNL.
In this step, port_priv_change_handler_func is called recursively.
- In the inner call, it won't get TEAM_ATTR_PORT_CHANGED flag because kernel has already notified that.
- As a result, team_port->changed flag is cleared in the libteam.
- The port change event won't be triggered from either inner or outer call of port_priv_change_handler_func.

If the port has been up when the port is being added to the team device, the "port up" information is carried in the outer call but will be lost.

In case the flag TEAM_ATTR_PORT_CHANGED is set only in the inner call, function port_priv_change_handler_func can be called in the inner call.
However, it will fail to fetch "enable" options because option_list_init has not be called.

Signed-off-by: Stephen Sun <stephens@nvidia.com>

#### How I did it

Fix:
Do not call check_call_change_handlers when parsing RTNL function is called from another check_call_change_handlers recursively.

#### How to verify it

- Manually test
- Regression test
  - warm reboot
  - warm reboot sad lag
  - warm reboot sad lag member
  - warm reboot sad (partial)
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202211: #14575

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants