Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add teamd patches to solve traffic loss issue when removing port from LAG #14002

Merged
merged 3 commits into from
Apr 7, 2023

Conversation

liorghub
Copy link
Contributor

@liorghub liorghub commented Feb 27, 2023

Why I did it

When removing port from LAG while traffic is running thorough LAG there is traffic disruption of 60 seconds.
Fix issue #14381

How I did it

The patch I added introduces "port_removing" op and call it right before Kernel is asked to remove the port.
Implement the op in LACP runner to disable the port which leads to proper LACPDU send.

How to verify it

Set LAG between 2 switches.
Set LAGs to be router port and set ip address.
In switch A send ping to ip address of LAG in switch B.
In switch B, while ping is running remove port from LAG.
Verify ping is not stopping.

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211

Description for the changelog

Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@liorghub liorghub requested a review from lguohan as a code owner February 27, 2023 16:19
@liorghub liorghub marked this pull request as draft February 27, 2023 16:19
@liat-grozovik
Copy link
Collaborator

@vaibhavhd fyi, appreciate your review comments

@liorghub liorghub marked this pull request as ready for review March 15, 2023 12:48
@liat-grozovik
Copy link
Collaborator

@stepanblyschak , @vaibhavhd appreciate if you can review.

@liat-grozovik
Copy link
Collaborator

@liorghub can you confirm this can be cleanly cherry picked to 202205?

@liorghub
Copy link
Contributor Author

@liat-grozovik

@liorghub can you confirm this can be cleanly cherry picked to 202205?

Yes, this PR can be cleanly cherry picked to 202205.

@dprital
Copy link
Collaborator

dprital commented Mar 21, 2023

@judyjoseph , @saiarcot895 , Can you please review ?

@liorghub
Copy link
Contributor Author

This PR fixes issue #14381

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a .patch suffix to this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@@ -10,3 +10,5 @@
0010-When-read-of-timerfd-returned-0-don-t-consider-this-.patch
0011-Remove-extensive-debug-output.patch
0012-Increase-min_ports-upper-limit-to-1024.patch
0013-set-port-to-disabled-state-during-removal.patch
0014-dont-move-the-port-state-from-disabled-when-admin-state-is-down
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, done.

@saiarcot895
Copy link
Contributor

I tried to repro the issue described here on a t0-64-32 topo with sonic neighbors (with config_db.json modified to remove the "min_links": 2 requirement), and didn't see any ping failures while I removed and added a LAG member from an existing LAG. I did see slightly higher RTT times (probably because the primary interface was being removed, and another interface had to be selected).

Is this happening only on physical hardware, or is it on a virtual setup as well?

@liorghub
Copy link
Contributor Author

I tried to repro the issue described here on a t0-64-32 topo with sonic neighbors (with config_db.json modified to remove the "min_links": 2 requirement), and didn't see any ping failures while I removed and added a LAG member from an existing LAG. I did see slightly higher RTT times (probably because the primary interface was being removed, and another interface had to be selected).

Is this happening only on physical hardware, or is it on a virtual setup as well?

It happens only on physical hardware when both sides are switches (when one side is host and we create bonding, issue is not reproduced).

Copy link
Contributor

@saiarcot895 saiarcot895 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified that this is sort-of happening on virtual switch as well. The ICMP response packet can arrive on the interface that was just removed from the port channel (and so outside of the port channel), if it happens to be the primary interface on the peer virtual switch. It doesn't manifest as a traffic drop because rp_filter checks are disabled, so the kernel accepts it as a valid packet, even though it's outside of the port channel.

@qiluo-msft qiluo-msft merged commit 71f2a6a into sonic-net:master Apr 7, 2023
mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this pull request Apr 16, 2023
… LAG (sonic-net#14002)

#### Why I did it
When removing port from LAG while traffic is running thorough LAG there is traffic disruption of 60 seconds.
Fix issue sonic-net#14381

#### How I did it
The patch I added introduces "port_removing" op and call it right before Kernel is asked to remove the port. 
Implement the op in LACP runner to disable the port which leads to proper LACPDU send.

#### How to verify it
Set LAG between 2 switches.
Set LAGs to be router port and set ip address.
In switch A send ping to ip address of LAG in switch B.
In switch B, while ping is running remove port from LAG.
Verify ping is not stopping.
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202211: #14684

keboliu pushed a commit to keboliu/sonic-buildimage that referenced this pull request Apr 18, 2023
mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this pull request Apr 19, 2023
… LAG (sonic-net#14002)

#### Why I did it
When removing port from LAG while traffic is running thorough LAG there is traffic disruption of 60 seconds.
Fix issue sonic-net#14381

#### How I did it
The patch I added introduces "port_removing" op and call it right before Kernel is asked to remove the port. 
Implement the op in LACP runner to disable the port which leads to proper LACPDU send.

#### How to verify it
Set LAG between 2 switches.
Set LAGs to be router port and set ip address.
In switch A send ping to ip address of LAG in switch B.
In switch B, while ping is running remove port from LAG.
Verify ping is not stopping.
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202205: #14716

mssonicbld pushed a commit that referenced this pull request Apr 19, 2023
… LAG (#14002)

#### Why I did it
When removing port from LAG while traffic is running thorough LAG there is traffic disruption of 60 seconds.
Fix issue #14381

#### How I did it
The patch I added introduces "port_removing" op and call it right before Kernel is asked to remove the port. 
Implement the op in LACP runner to disable the port which leads to proper LACPDU send.

#### How to verify it
Set LAG between 2 switches.
Set LAGs to be router port and set ip address.
In switch A send ping to ip address of LAG in switch B.
In switch B, while ping is running remove port from LAG.
Verify ping is not stopping.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants