Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[teamd] When removing port from portchannel, traffic disruption for 60 seconds #12969

Closed
liorghub opened this issue Dec 6, 2022 · 3 comments
Closed
Assignees
Labels
NVIDIA Triaged this issue has been triaged

Comments

@liorghub
Copy link
Contributor

liorghub commented Dec 6, 2022

Description

Steps to reproduce the issue:

Configure the following on switch 1:
config int speed Ethernet32 10000
config int speed Ethernet40 10000
config portchannel add PortChannel0001
config portchannel member add PortChannel0001 Ethernet32
config portchannel member add PortChannel0001 Ethernet40
config interface ip add PortChannel0001 1.0.0.1/24
config interface ip add PortChannel0001 2001::1/64

Configure the following on switch 2:
config int speed Ethernet72 10000
config int speed Ethernet80 10000
config portchannel add PortChannel0001
config portchannel member add PortChannel0001 Ethernet72
config portchannel member add PortChannel0001 Ethernet80
config interface ip add PortChannel0001 1.0.0.2/24
config interface ip add PortChannel0001 2001::2/64

Run the following on switch 1:
ping 1.0.0.2
Check through which physical port traffic is flowing.
Remove this port from portchannel:
config portchannel member del PortChannel0001 Ethernet32

Describe the results you received:

Ping is stopping for 60 seconds.
After 60 seconds ping runs successfully.

Describe the results you expected:

Expecting traffic loss of several seconds.

Output of show version:

root@sonic:/home/admin# show version

SONiC Software Version: SONiC.202205_1_rc.4-7aa8502d0_Internal
Distribution: Debian 11.5
Kernel: 5.10.0-12-2-amd64
Build commit: 7aa8502d0
Build date: Thu Nov 24 12:21:41 UTC 2022
Built by: sw-r2d2-bot@r-build-sonic-ci03-244

Platform: x86_64-mlnx_msn2410-r0
HwSKU: ACS-MSN2410
ASIC: mellanox
ASIC Count: 1
Serial Number: MT1610X04134
Model Number: MSN2410-CB2F
Hardware Revision: Not Specified
Uptime: 17:23:06 up 0 min,  1 user,  load average: 3.72, 1.07, 0.37
Date: Tue 06 Dec 2022 17:23:06

Docker images:
REPOSITORY                                         TAG                                IMAGE ID       SIZE
docker-orchagent                                   202205_1_rc.4-7aa8502d0_Internal   4934c5e9a09c   478MB
docker-orchagent                                   latest                             4934c5e9a09c   478MB
docker-fpm-frr                                     202205_1_rc.4-7aa8502d0_Internal   df24b8dda4f0   489MB
docker-fpm-frr                                     latest                             df24b8dda4f0   489MB
docker-teamd                                       202205_1_rc.4-7aa8502d0_Internal   2b65c7a94402   459MB
docker-teamd                                       latest                             2b65c7a94402   459MB
docker-platform-monitor                            202205_1_rc.4-7aa8502d0_Internal   66bc72ac3bea   867MB
docker-platform-monitor                            latest                             66bc72ac3bea   867MB
docker-macsec                                      latest                             7eb2050765e6   461MB
docker-syncd-mlnx                                  202205_1_rc.4-7aa8502d0_Internal   c7ce148bb5aa   862MB
docker-syncd-mlnx                                  latest                             c7ce148bb5aa   862MB
docker-snmp                                        202205_1_rc.4-7aa8502d0_Internal   40c5b00be439   488MB
docker-snmp                                        latest                             40c5b00be439   488MB
docker-dhcp-relay                                  latest                             41cb40231b85   453MB
docker-sonic-telemetry                             202205_1_rc.4-7aa8502d0_Internal   1af21b53dbbf   524MB
docker-sonic-telemetry                             latest                             1af21b53dbbf   524MB
docker-lldp                                        202205_1_rc.4-7aa8502d0_Internal   ea3af087e7ce   486MB
docker-lldp                                        latest                             ea3af087e7ce   486MB
docker-database                                    202205_1_rc.4-7aa8502d0_Internal   b66c8dcf5852   443MB
docker-database                                    latest                             b66c8dcf5852   443MB
docker-mux                                         202205_1_rc.4-7aa8502d0_Internal   cb30d70c7d4a   492MB
docker-mux                                         latest                             cb30d70c7d4a   492MB
docker-router-advertiser                           202205_1_rc.4-7aa8502d0_Internal   36d2d71c98b3   443MB
docker-router-advertiser                           latest                             36d2d71c98b3   443MB
docker-sonic-mgmt-framework                        202205_1_rc.4-7aa8502d0_Internal   1df66ddde189   557MB
docker-sonic-mgmt-framework                        latest                             1df66ddde189   557MB
docker-nat                                         202205_1_rc.4-7aa8502d0_Internal   5b47aa79540c   430MB
docker-nat                                         latest                             5b47aa79540c   430MB
docker-sflow                                       202205_1_rc.4-7aa8502d0_Internal   5441306adc85   428MB
docker-sflow                                       latest                             5441306adc85   428MB
urm.nvidia.com/sw-nbu-sws-sonic-docker/doroce      1.0.0-master-internal-8            5c9f7110ace8   611MB
urm.nvidia.com/sw-nbu-sws-sonic-docker/sonic-wjh   1.3.1-202205-internal-7            369c313f8355   648MB

root@sonic:/home/admin# 

Output of show techsupport:

sonic_dump_SW1.tar.gz
sonic_dump_SW2.tar.gz

Additional information you deem important (e.g. issue happens only occasionally):

Issue is in teamd.
In current implementation, after timeout of 60 seconds (in which port did not get LACP BPDU), NOS is being called to remove port from LAG (callback TeamSync::TeamPortSync::onChange() is being called).
Current implementation relies on netdev operational state down (that occurs when removing port from LAG) to propagate to peer switch.
Expected behavior is that the protocol will handle peer port removal from LAG (regardless of peer netdev operational state).

@tjchadaga
Copy link
Contributor

@liorghub - could you please clarify where exactly the ping packets are dropped? please check if the request or the response packets are dropped

@tjchadaga tjchadaga added Triaged this issue has been triaged NVIDIA labels Dec 7, 2022
@liorghub
Copy link
Contributor Author

@tjchadaga - I debugged it, when we remove port from lag in switch A, the request is egressing from the other lag member port (as expected) and arrives to peer switch. Peer switch B does not know about the removal of port from lag in switch A and keep sending the ICMP reply on port which its peer was removed from lag.

@liorghub
Copy link
Contributor Author

Fix for this issue was provided in
#14002

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NVIDIA Triaged this issue has been triaged
Projects
None yet
Development

No branches or pull requests

2 participants