Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

warm-reboot test fails with #4125 #4159

Closed
stepanblyschak opened this issue Feb 18, 2020 · 2 comments
Closed

warm-reboot test fails with #4125 #4159

stepanblyschak opened this issue Feb 18, 2020 · 2 comments

Comments

@stepanblyschak
Copy link
Collaborator

stepanblyschak commented Feb 18, 2020

Description

Image containing following change #4125 does not pass warm-reboot test.
Image with reverted change in arp_update passes warm reboot test every time.

This change causes VLAN neighbor flapping during the test.
Current assumption is that the change is using ping to VLAN neighbor; in warm reboot test conditions there are no real neighbors, so probably kernel decides to remove neighbor, send ARP and add neighbor back, this however causes a gap in forwarding.

Steps to reproduce the issue:

  1. Run ansible warm reboot test with Update arp_update to refresh neighbor entries from APP_DB #4125
  2. Observe ~1.5 sec gap (~200 packets lost)
  3. Observe neighbor flapping in orchagent log
  4. Remove changes added in 4125
  5. Rerun warm reboot ansible test
  6. Observe that test passes

Describe the results you received:

12:38:55         "2020-02-18 10:37:50 : Disruption between packet ID 21223 and 21225. For 0.0070 ", 
12:38:55         "2020-02-18 10:37:51 : Disruption between packet ID 21355 and 21357. For 0.0095 ", 
12:38:55         "2020-02-18 10:37:51 : Disruption between packet ID 21848 and 21850. For 0.0071 ", 
12:38:55         "2020-02-18 10:37:51 : Disruptions happen between 0:01:51.490926 and 0:02:54.568877 after the reboot.", 
12:38:55         "2020-02-18 10:37:51 : Total incoming packets captured 22560", 
12:38:55         "2020-02-18 10:38:44 : Filtered pcap dumped to /tmp/capture_filtered.pcap", 
12:38:55         "2020-02-18 10:38:44 : Packet flow examine finished after 0:06:23.224175", 
12:38:55         "2020-02-18 10:38:44 : The longest disruption lasted 0.039 seconds. 2 packet(s) lost.", 
12:38:55         "2020-02-18 10:38:44 : Total disruptions count is 150. All disruptions lasted 2.892 seconds. Total 160 packet(s) lost", 
Feb 18 09:52:36.587278 r-tigris-13 NOTICE swss#orchagent: :- addNeighbor: Created neighbor 72:06:00:01:01:44 on Vlan1000
Feb 18 09:52:36.587538 r-tigris-13 NOTICE swss#orchagent: :- addNextHop: Created next hop 192.168.0.146 on Vlan1000
Feb 18 09:52:36.744047 r-tigris-13 NOTICE swss#orchagent: :- removeNeighbor: Removed next hop 192.168.0.240 on Vlan1000
Feb 18 09:52:36.744190 r-tigris-13 NOTICE swss#orchagent: :- removeNeighbor: Removed neighbor 72:06:00:01:02:38 on Vlan1000
Feb 18 09:52:36.999140 r-tigris-13 NOTICE swss#orchagent: :- removeNeighbor: Removed next hop 192.168.0.205 on Vlan1000
Feb 18 09:52:36.999236 r-tigris-13 NOTICE swss#orchagent: :- removeNeighbor: Removed neighbor 72:06:00:01:02:03 on Vlan1000
Feb 18 09:52:37.254715 r-tigris-13 NOTICE swss#orchagent: :- removeNeighbor: Removed next hop 192.168.0.4 on Vlan1000
Feb 18 09:52:37.254715 r-tigris-13 NOTICE swss#orchagent: :- removeNeighbor: Removed neighbor 72:06:00:01:00:02 on Vlan1000
Feb 18 09:52:37.326942 r-tigris-13 NOTICE swss#orchagent: :- addNeighbor: Created neighbor 72:06:00:01:04:12 on Vlan1000
Feb 18 09:52:37.327085 r-tigris-13 NOTICE swss#orchagent: :- addNextHop: Created next hop 192.168.1.158 on Vlan1000
Feb 18 09:52:37.414938 r-tigris-13 NOTICE swss#orchagent: :- addNeighbor: Created neighbor 72:06:00:01:01:50 on Vlan1000
Feb 18 09:52:37.415237 r-tigris-13 NOTICE swss#orchagent: :- addNextHop: Created next hop 192.168.0.152 on Vlan1000
Feb 18 09:52:37.511207 r-tigris-13 NOTICE swss#orchagent: :- removeNeighbor: Removed next hop 192.168.0.98 on Vlan1000
Feb 18 09:52:37.511299 r-tigris-13 NOTICE swss#orchagent: :- removeNeighbor: Removed neighbor 72:06:00:01:00:96 on Vlan1000
Feb 18 09:52:37.575024 r-tigris-13 NOTICE swss#orchagent: :- addNeighbor: Created neighbor 72:06:00:01:04:93 on Vlan1000
Feb 18 09:52:37.575342 r-tigris-13 NOTICE swss#orchagent: :- addNextHop: Created next hop 192.168.1.239 on Vlan1000
Feb 18 09:52:37.627138 r-tigris-13 NOTICE swss#orchagent: :- addNeighbor: Created neighbor 72:06:00:01:02:72 on Vlan1000
Feb 18 09:52:37.627358 r-tigris-13 NOTICE swss#orchagent: :- addNextHop: Created next hop 192.168.1.18 on Vlan1000
Feb 18 09:52:37.674966 r-tigris-13 NOTICE swss#orchagent: :- addNeighbor: Created neighbor 72:06:00:01:02:03 on Vlan1000
Feb 18 09:52:37.675130 r-tigris-13 NOTICE swss#orchagent: :- addNextHop: Created next hop 192.168.0.205 on Vlan1000
Feb 18 09:52:37.742524 r-tigris-13 NOTICE swss#orchagent: :- addNeighbor: Created neighbor 72:06:00:01:00:54 on Vlan1000

Describe the results you expected:

13:00:25         "2020-02-18 10:54:01 : Pcap file dumped to /tmp/capture.pcap", 
13:00:25         "2020-02-18 10:54:01 : Packet flow examine started 0:08:36.122991 after the reboot", 
13:00:25         "2020-02-18 10:59:21 : Gaps in forwarding not found.", 
13:00:25         "2020-02-18 10:59:21 : Total incoming packets captured 20329", 
13:00:25         "2020-02-18 11:00:06 : Filtered pcap dumped to /tmp/capture_filtered.pcap

Additional information you deem important (e.g. issue happens only occasionally):

**Output of `show version`:**
Distribution: Debian 9.12
Kernel: 4.9.0-9-2-amd64
Build commit: 887ea003
Build date: Sun Feb 16 03:18:30 UTC 2020
Built by: johnar@jenkins-worker-8

Platform: x86_64-mlnx_msn3800-r0
HwSKU: ACS-MSN3800
ASIC: mellanox
Serial Number: MT1937X00527
Uptime: 11:46:59 up 1 min,  1 user,  load average: 6.52, 2.46, 0.90

Docker images:
REPOSITORY                    TAG                 IMAGE ID            SIZE
docker-syncd-mlnx             HEAD.37-887ea003    30f49954e471        383MB
docker-syncd-mlnx             latest              30f49954e471        383MB
docker-router-advertiser      HEAD.37-887ea003    342b19165cc5        287MB
docker-router-advertiser      latest              342b19165cc5        287MB
docker-sonic-mgmt-framework   HEAD.37-887ea003    a6427362690d        424MB
docker-sonic-mgmt-framework   latest              a6427362690d        424MB
docker-platform-monitor       HEAD.37-887ea003    9579180a6838        626MB
docker-platform-monitor       latest              9579180a6838        626MB
docker-fpm-frr                HEAD.37-887ea003    e3c98c87380a        332MB
docker-fpm-frr                latest              e3c98c87380a        332MB
docker-sflow                  HEAD.37-887ea003    4d6c3d5932c0        312MB
docker-sflow                  latest              4d6c3d5932c0        312MB
docker-lldp-sv2               HEAD.37-887ea003    81d22202eb1b        309MB
docker-lldp-sv2               latest              81d22202eb1b        309MB
docker-dhcp-relay             HEAD.37-887ea003    a0ecd89a6a59        297MB
docker-dhcp-relay             latest              a0ecd89a6a59        297MB
docker-database               HEAD.37-887ea003    bf1bcfc91997        287MB
docker-database               latest              bf1bcfc91997        287MB
docker-teamd                  HEAD.37-887ea003    47d427f46304        312MB
docker-teamd                  latest              47d427f46304        312MB
docker-snmp-sv2               HEAD.37-887ea003    c4077a1ed04f        344MB
docker-snmp-sv2               latest              c4077a1ed04f        344MB
docker-orchagent              HEAD.37-887ea003    40a6afd1a39e        330MB
docker-orchagent              latest              40a6afd1a39e        330MB
docker-nat                    HEAD.37-887ea003    4f09c8c75d74        313MB
docker-nat                    latest              4f09c8c75d74        313MB
docker-sonic-telemetry        HEAD.37-887ea003    48294a0583b1        349MB
docker-sonic-telemetry        latest              48294a0583b1        349MB```

    **Attach debug file `sudo generate_dump`:**

    ```
    (paste your output here)
    ```
@prsunny
Copy link
Contributor

prsunny commented Feb 25, 2020

Can you please try with fix #4165 ?

@stepanblyschak
Copy link
Collaborator Author

Fix works, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants