Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fast-reboot] ARP entries are not restored after fast-reboot #5217

Closed
stepanblyschak opened this issue Aug 19, 2020 · 12 comments · Fixed by #5391 or #5451
Closed

[fast-reboot] ARP entries are not restored after fast-reboot #5217

stepanblyschak opened this issue Aug 19, 2020 · 12 comments · Fixed by #5391 or #5451
Assignees

Comments

@stepanblyschak
Copy link
Collaborator

stepanblyschak commented Aug 19, 2020

Description

Fast-reboot should dump FDB, ARP entries known prior to reboot and set them on HW ASAP after boot.
This is not observed with ARP entries even if I workaround #5216.
Looking at logs I can observe that FDB entries were set, but neighbors were created only when ports become UP and traffic started to flow. I checked dumped arp.json and it is correct.

Steps to reproduce the issue:

  1. sudo fast-reboot -v
  2. show log -f | grep orchagent

Describe the results you received:
Neighbor entries are not restored after fast boot

Describe the results you expected:
Neighbor entries restored ASAP during fast boot

Additional information you deem important (e.g. issue happens only occasionally):

This issue is not 100% reproducible, never seen on SONiC.201911.142-52e45e82 and also it was not observed SONiC.201911.168-309a098.
Probably, it is caused by 5be374c that exposed some race between swssconfig and orchagent.

**Output of `show version`:**
SONiC Software Version: SONiC.201911.172-1abf7be7
Distribution: Debian 9.13
Kernel: 4.9.0-11-2-amd64
Build commit: 1abf7be7
Build date: Tue Aug 18 04:33:33 UTC 2020
Built by: johnar@jenkins-worker-8

Platform: x86_64-mlnx_msn2700-r0
HwSKU: ACS-MSN2700
ASIC: mellanox
Serial Number: MT1822K07815
Uptime: 09:18:51 up 17:21,  1 user,  load average: 0.13, 0.12, 0.10

Docker images:
REPOSITORY                    TAG                   IMAGE ID            SIZE
docker-syncd-mlnx             201911.172-1abf7be7   f1e750afd5fc        392MB
docker-syncd-mlnx             latest                f1e750afd5fc        392MB
docker-sonic-telemetry        201911.172-1abf7be7   7ba22719d1f6        353MB
docker-sonic-telemetry        latest                7ba22719d1f6        353MB
docker-router-advertiser      201911.172-1abf7be7   d6af74e64a7d        289MB
docker-router-advertiser      latest                d6af74e64a7d        289MB
docker-sonic-mgmt-framework   201911.172-1abf7be7   70dd81d12201        429MB
docker-sonic-mgmt-framework   latest                70dd81d12201        429MB
docker-platform-monitor       201911.172-1abf7be7   25bb9a6362ca        659MB
docker-platform-monitor       latest                25bb9a6362ca        659MB
docker-fpm-frr                201911.172-1abf7be7   127bc955a4f9        334MB
docker-fpm-frr                latest                127bc955a4f9        334MB
docker-sflow                  201911.172-1abf7be7   9c8697cd760a        314MB
docker-sflow                  latest                9c8697cd760a        314MB
docker-lldp-sv2               201911.172-1abf7be7   ea53736695ce        311MB
docker-lldp-sv2               latest                ea53736695ce        311MB
docker-dhcp-relay             201911.172-1abf7be7   04d8fc8cc983        299MB
docker-dhcp-relay             latest                04d8fc8cc983        299MB
docker-database               201911.172-1abf7be7   ae9a447b86a6        289MB
docker-database               latest                ae9a447b86a6        289MB
docker-teamd                  201911.172-1abf7be7   fd29716b8b6d        313MB
docker-teamd                  latest                fd29716b8b6d        313MB
docker-snmp-sv2               201911.172-1abf7be7   06e569191f99        347MB
docker-snmp-sv2               latest                06e569191f99        347MB
docker-orchagent              201911.172-1abf7be7   6ffa04ee3ed4        332MB
docker-orchagent              latest                6ffa04ee3ed4        332MB
docker-nat                    201911.172-1abf7be7   4bbbc474ec7d        315MB
docker-nat                    latest                4bbbc474ec7d        315MB
**Attach debug file `sudo generate_dump`:**

```

syslog.txt
(paste your output here)
```

@abdosi
Copy link
Contributor

abdosi commented Aug 31, 2020

Looking into this

@lguohan
Copy link
Collaborator

lguohan commented Sep 2, 2020

@tahmed-dev , can you look at this fast reboot related issue?

@anshuv-mfst
Copy link

@tahmed-dev to look into this issue.

@liat-grozovik
Copy link
Collaborator

@tahmed-dev can you please update on status?

@tahmed-dev
Copy link
Contributor

tahmed-dev commented Sep 14, 2020

@stepanblyschak, is the issue here the FDB are not restored or the restoration process is delayed? if it is the latter, what are quantitative expectations to have FDB entries restored? I can see from the following logs that the process is taking about 55s. However the order of restoration is not clear to me. @prsunny do you see a race condition here?

Aug 19 10:22:57.094601 arc-switch1004 NOTICE swss#swssconfig: :- main: Loading config from JSON file:/fdb.json...
Aug 19 10:22:57.540613 arc-switch1004 INFO swss#supervisord 2020-08-19 10:22:48,413 INFO success: rsyslogd entered RUNNING 
Aug 19 10:22:57.544241 arc-switch1004 INFO swss#supervisord 2020-08-19 10:22:52,088 INFO spawned: 'orchagent' with pid 50
Aug 19 10:22:57.548114 arc-switch1004 INFO swss#supervisord 2020-08-19 10:22:53,512 INFO success: orchagent entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
Aug 19 10:22:57.549062 arc-switch1004 INFO swss#supervisord 2020-08-19 10:22:56,185 INFO spawned: 'restore_neighbors' with pid 54
Aug 19 10:22:57.549620 arc-switch1004 INFO swss#supervisord 2020-08-19 10:22:56,264 INFO success: restore_neighbors entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
Aug 19 10:22:57.549921 arc-switch1004 INFO swss#supervisord 2020-08-19 10:22:56,483 INFO spawned: 'swssconfig' with pid 55
Aug 19 10:22:57.550014 arc-switch1004 INFO swss#supervisord 2020-08-19 10:22:56,541 INFO success: swssconfig entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aug 19 10:23:36.451437 arc-switch1004 NOTICE swss#orchagent: :- doPortTask: Set port Ethernet40 admin status to up
Aug 19 10:23:36.451437 arc-switch1004 NOTICE swss#orchagent: :- doPortTask: Set port Ethernet42 MTU to 9100
Aug 19 10:23:36.885010 arc-switch1004 NOTICE swss#orchagent: :- addFdbEntry: Create dynamic FDB 00:06:07:08:09:0a on Ethernet8
Aug 19 10:23:36.886084 arc-switch1004 INFO syncd#supervisord: syncd Aug 19 10:23:36 NOTICE  SAI_UTILS: mlnx_sai_utils.c[2397]- set_dispatch_attrib_handler: Set QOS_INGRESS_BUFFER_PROFILE_LIST, key:port 10200, val:1 : [ 300000019]
Aug 19 10:23:36.886336 arc-switch1004 NOTICE swss#orchagent: :- addFdbEntry: Create dynamic FDB 24:8a:07:9c:86:0a on Ethernet20
Aug 19 10:23:36.887604 arc-switch1004 NOTICE swss#orchagent: :- addFdbEntry: Create dynamic FDB 24:8a:07:9c:86:0b on Ethernet22
Aug 19 10:23:37.336898 arc-switch1004 NOTICE swss#orchagent: :- addFdbEntry: Create dynamic FDB 72:06:00:01:03:38 on Ethernet10
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aug 19 10:23:39.432375 arc-switch1004 INFO syncd#supervisord: syncd Aug 19 10:23:39 NOTICE  SAI_FDB: mlnx_sai_fdb.c[543]- mlnx_create_fdb_entry: Create FDB entry fdb entry mac [72:06:00:01:04:23] bv_id 3e800000026 (VLAN)
Aug 19 10:23:39.432900 arc-switch1004 INFO syncd#supervisord: syncd Aug 19 10:23:39 NOTICE  SAI_FDB: mlnx_sai_fdb.c[544]- mlnx_create_fdb_entry: Attribs #0 TYPE=DYNAMIC #1 BRIDGE_PORT_ID=BRIDGE_PORT,(0:0),f,0000,0 #2 PACKET_ACTION=FORWARD 
Aug 19 10:23:39.433448 arc-switch1004 INFO syncd#supervisord: syncd Aug 19 10:23:39 NOTICE  SAI_FDB: mlnx_sai_fdb.c[543]- mlnx_create_fdb_entry: Create FDB entry fdb entry mac [72:06:00:01:04:24] bv_id 3e800000026 (VLAN)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aug 19 10:23:42.792126 arc-switch1004 INFO kernel: [   89.880826] IPv6: ADDRCONF(NETDEV_CHANGE): Vlan1000: link becomes ready
Aug 19 10:23:42.812728 arc-switch1004 DEBUG lldp#lldpmgrd: Running command: 'lldpcli configure ports Ethernet6 lldp portidsubtype local etp2b description 'Servers2:eth0''
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aug 19 10:23:46.136631 arc-switch1004 NOTICE swss#orchagent: :- addNeighbor: Created neighbor 72:06:00:01:00:00 on Vlan1000
Aug 19 10:23:46.136631 arc-switch1004 NOTICE swss#orchagent: :- addNextHop: Created next hop 192.168.0.2 on Vlan1000
Aug 19 10:23:46.139355 arc-switch1004 INFO syncd#supervisord: syncd Aug 19 10:23:46 NOTICE  SAI_NEIGHBOR: mlnx_sai_neighbor.c[149]- mlnx_create_neighbor_entry: Create neighbor entry neighbor ip 192.168.0.2 rif 58
Aug 19 10:23:46.139355 arc-switch1004 INFO syncd#supervisord: syncd Aug 19 10:23:46 NOTICE  SAI_NEIGHBOR: mlnx_sai_neighbor.c[150]- mlnx_create_neighbor_entry: Attribs #0 DST_MAC_ADDRESS=[72:06:00:01:00:00] 
Aug 19 10:23:46.139355 arc-switch1004 INFO syncd#supervisord: syncd Aug 19 10:23:46 NOTICE  SAI_NEXT_HOP: mlnx_sai_nexthop.c[286]- mlnx_create_next_hop: Create next hop, #0 TYPE=IP #1 IP=192.168.0.2 #2 ROUTER_INTERFACE_ID=ROUTER_INTERFACE,(0:0),141,0000,0 
Aug 19 10:23:46.140067 arc-switch1004 INFO syncd#supervisord: syncd Aug 19 10:23:46 NOTICE  SAI_NEXT_HOP: mlnx_sai_nexthop.c[417]- mlnx_create_next_hop: Created next hop next hop id 2
Aug 19 10:23:46.149418 arc-switch1004 DEBUG lldp#lldpmgrd: Running command: 'lldpcli configure ports Ethernet42 lldp portidsubtype local etp11b description 'Servers16:eth0''
Aug 19 10:23:46.158124 arc-switch1004 NOTICE swss#orchagent: :- addNeighbor: Created neighbor 52:54:00:8e:db:0d on PortChannel0004
Aug 19 10:23:46.159026 arc-switch1004 NOTICE swss#orchagent: :- addNextHop: Created next hop 10.0.0.63 on PortChannel0004
Aug 19 10:23:46.162175 arc-switch1004 INFO syncd#supervisord: syncd Aug 19 10:23:46 NOTICE  SAI_NEIGHBOR: mlnx_sai_neighbor.c[149]- mlnx_create_neighbor_entry: Create neighbor entry neighbor ip 10.0.0.63 rif 55
Aug 19 10:23:46.162325 arc-switch1004 INFO syncd#supervisord: syncd Aug 19 10:23:46 NOTICE  SAI_NEIGHBOR: mlnx_sai_neighbor.c[150]- mlnx_create_neighbor_entry: Attribs #0 DST_MAC_ADDRESS=[52:54:00:8e:db:0d] 
Aug 19 10:23:46.162433 arc-switch1004 INFO syncd#supervisord: syncd Aug 19 10:23:46 NOTICE  SAI_NEXT_HOP: mlnx_sai_nexthop.c[286]- mlnx_create_next_hop: Create next hop, #0 TYPE=IP #1 IP=10.0.0.63 #2 ROUTER_INTERFACE_ID=ROUTER_INTERFACE,(0:0),101,0000,0 

@stepanblyschak
Copy link
Collaborator Author

@tahmed-dev FDBs are restored, but ARP entries are not. ARP is populated only after the port is operationally up and then the system learns ARP entries. Yesterday I did another test and found out that when I change neighsyncd to start when orchagent is running the ARPs are restored. Looks like a race between swssconfig and neighsyncd. Basically what I did is in https://github.com/Azure/sonic-buildimage/blob/master/dockers/docker-orchagent/supervisord.conf I changed:

[program:neighsyncd]
command=/usr/bin/neighsyncd
priority=7
autostart=false
autorestart=false
stdout_logfile=syslog
stderr_logfile=syslog
dependent_startup=true
dependent_startup_wait_for=swssconfig:exited

to

[program:neighsyncd]
command=/usr/bin/neighsyncd
priority=7
autostart=false
autorestart=false
stdout_logfile=syslog
stderr_logfile=syslog
dependent_startup=true
dependent_startup_wait_for=orchagent:running

Hope, that can help.

@tahmed-dev
Copy link
Contributor

Thanks @stepanblyschak ! I am a bit reluctant to accept that the above change resolves the issue for two reasons, 1) the issues as stated is hard to produces and so we might not be hitting the race condition, and/or 2) the swssconfig process does wait for orchagent process to be running and so the change does cause the neighsyncd to start a bit earlier.

I found a bug with the way arp_update process is started (or not so to speak) and I cannot see this process running on master. I am building a fix for it in order to get arp_update process running again. I also noticed that PR:4165 relocate 5min sleep before the the arp refresh from neigh table. Is this still an issue after enabling the arp_update process?

@tahmed-dev
Copy link
Contributor

Hey @stepanblyschak can you please try once more with PR:5391?

@lguohan lguohan linked a pull request Sep 17, 2020 that will close this issue
3 tasks
@tahmed-dev
Copy link
Contributor

@stepanblyschak please reopen if the issue is not resolved on your end.

@stepanblyschak
Copy link
Collaborator Author

Still reproducible on hash 13d28f9

@tahmed-dev
Copy link
Contributor

@stepanblyschak do you see arp_update process started in the syslog? Do you think PR:4165's relocation of 5min timer to be before the ping cmd has an effect?

@stepanblyschak
Copy link
Collaborator Author

Thanks @tahmed-dev!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants