Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't put down LAG interface when it starts in WR mode #2257

Merged
merged 3 commits into from
Nov 20, 2018

Conversation

pavel-shirshov
Copy link
Contributor

@pavel-shirshov pavel-shirshov commented Nov 14, 2018

- What I did
I changed teamd logic carrier logic for LAG interface in WR mode:

- How I did it
When teamd starts it reads a carrier state for LAG interface. If the state is down, we don't need any special logic here, so teamd will work with the carrier in normal mode. if the state is up, we enable wr_carrier mode. In this mode we will prevent the LAG interface going down. We exits from this mode when either teamd logic decides it's time to raise the interface up, or we have a timer expired.
Current timer value is 3 second. We introduce the timer to be sure we will exit from the mode eventually.

- How to verify it

  1. build code
  2. install it into your DUT
  3. pkill teamd
  4. enable WR mode
  5. restart teamd container. bgpd session must be up

- Description for the changelog

- A picture of a cute animal (not mandatory but encouraged)


- /* initialize carrier control */
- err = team_carrier_set(ctx->th, false);
+ err = team_carrier_set(ctx->th, ctx->warm_start);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of warm start, should we keep the carrier state as is? I,E, don't set the carrier to kernel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we start teamd after system restart, the carrier is down. We need it to be up (if the carrier was up before reboot).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking more about teamd warm-restart. If the carrier was down before warm-restart of teamd docker, should we unconditionally set carrier to true after warm restart?

}
-
- lacp->carrier_up = false;
+ lacp->carrier_up = ctx->warm_start;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably could read the carrier state back and save it to "lacp->carrier_up"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we don't do system reboot your approach will work. When we do the system reboot the carrier state would be 'down' on reading

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.
If we want to support docker level warm restart, sounds like we may need save the carrier state before warm restart or system warm reboot, and restore them afterwards?

return lacp_set_carrier(lacp, true);
+ }
}
+ if (lacp->ctx->warm_start && !lacp->ctx->warm_start_started)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason we need this two lines here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to put the LAG interface up only once, when teamd started. After teamd calculate the interface is allowed to be up in legit way, we disable our shortcut here. So the shortcut mode should work only once, on the start of teamd.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your explanation. I probably missed something here, Isn't "lacp_carrier_init" to set the carrier state after teamd started and this function (lacp_update_carrier) is only hit if we do port update etc later? "lacp_carrier_init" should handle the different reboot modes, but I am not sure if we need do something special in this function for warm reboot/restart case.

@pavel-shirshov
Copy link
Contributor Author

Address PR comments

return lacp_set_carrier(lacp, true);
+ }
}
+ if (lacp->ctx->warm_start && !lacp->ctx->warm_start_started)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change, even if teamd was started with warmboot mode, we should still set the carrier according to the port enabled state (e,g, number of enabled port less than the min_ports , we should put it down. etc.) during the run time to keep the state right? Do you think lacp_carrier_init should be able to restore to whatever state before warmboot and we don't need do anything special here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we start teamd in WR mode we don't have any information about current desired state of the LAG interface.
If the box wasn't restarted, we could read the current state of teamd, and consider that no interfaces wasn't put down during transient period. Then teamd will read saved LACP PDUs, and restore previous state of itself. If some state changed since stop of previous instance of teamd, the new instance of teamd will catch it up eventually (up to 90 seconds in the worst case)
If the box was restarted, we must create LAG interface and change it state to the saved state ASAP. This patch doesn't have this behavior. Current patch doesn't change state the LAG interface

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because, if the previous state was down, we don't put the LAG interface up, only because we're in WR mode.

But if the previous state was up, we put the interface up as soon as we get into lacp_update_carrier function, which is fast. And after the interface is up after we have enough links in the group we put teamd behavior to the normal mode.

What is bad in my patch. If the previous state of LAG was down, and after start we have LAG in down state, it'd be up as soon as we add one interface to a group, so no min_links check. I'm thinking currently how I can fix that.

@pavel-shirshov
Copy link
Contributor Author

Introduced the WR carrier logic. Updated PR overview

@pavel-shirshov pavel-shirshov merged commit f6f8880 into master Nov 20, 2018
@pavel-shirshov pavel-shirshov deleted the pavelsh/fix_wr_start branch November 20, 2018 02:29
qiluo-msft pushed a commit that referenced this pull request Jan 27, 2023
Includes below commits
```
0d5e68f5a [GCU] Ignore bgpraw table in GCU operation (#2628)
22757b1f3 Add interface link-training command into the CLI doc (#2257)
f4f857e10 [GCU] Ignore bgpraw in GCU applier (#2623)
b5ac60036 [muxcable][config] Add support to enable/disable ceasing to be an advertisement interface when `radv` service is stopped (#2622)
981f9531e [chassis][voq] Add "show fabric reachability" command. (#2528)
fba87f43f Revert (#2599)
d6d7ab37f [warm-reboot] Use kexec_file_load instead of kexec_load when available (#2608)
db4683d40 fix show techsupport error (#2597)
3d8e9c62d [GCU] Prohibit removal of PFC_WD POLL_INTERVAL field (#2545)
163e766cc [techsupport] include APPL_STATE_DB dump (#2607)
8703773eb YANG Validation for ConfigDB Updates: RADIUS_SERVER (#2604)
c2d746d4f Remove TODO comment which is no longer relevant (#2600)
f09da9983 [show] Add bgpraw to show run all (#2537)
39ac5641b Extend fast-reboot STATE_DB entry timer (#2577)
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants