Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix potential blackholing/looping traffic when link-local was used and refresh ipv6 neighbor to avoid CPU hit #1904

Merged
merged 4 commits into from
Aug 12, 2018

Conversation

zhenggen-xu
Copy link
Collaborator

Fix potential blackholing/looping traffic and refresh ipv6 neighbor to avoid CPU hit

In case ipv6 global addresses were configured on L3 interfaces and used for peering,
and routing protocol was using link-local addresses on the same interfaces as prefered nexthops,
the link-local addresses could be aged out after a while due to no activities towards the link-local
addresses themselves. And when we receive new routes with the link-local nexthops, SONiC won't insert
them to the HW, and thus cause looping or blackholing traffic.

Global ipv6 addresses on L3 interfaces between switches are refreshed by BGP keeplive and other messages.

On server facing side, traffic may hit fowarding plane only, and no refresh for the ipv6 neighbor entries regularly.
This could age-out the linux kernel ipv6 neighbor entries, and HW neighbor table entries could be removed,
and thus traffic going to those neighbors would hit CPU, and cause traffic drop and temperary CPU high load.

Also, if link-local addresses were not learned, we may not get them at all later.

It is intended to fix all above issues.

Changes:
Add ndisc6 package in swss docker and use it for ipv6 ndp ping to update the neighbors' state on Vlan interfaces
Change the default ipv6 neighbor reachable timer to 30mins
Add periodical ipv6 multicast ping to ff02::11 to get/refresh link-local neighbor info.

- What I did
Fix potential blackholing/looping traffic when ipv6 link-local was used, and refresh ipv6 neighbor to avoid CPU hit

- How I did it
Add ndisc6 package in swss docker and use it for ipv6 ndp ping to update the neighbors' state on Vlan interfaces
Change the default ipv6 neighbor reachable timer to 30mins
Add periodical ipv6 multicast ping to ff02::11 to get/refresh link-local neighbor info.

- How to verify it
-- ipv6 neighbors now stay at REACHABLE state for 30mins instead of 30seconds.

admin@lnos-x1-a-asw03:~$ clear arp
fe80::a83c:309a:5ca3:6593 dev Vlan100 lladdr 04:62:73:c4:eb:59 ref 1 used 2470/2470/39 probes 1 REACHABLE
2100::7 dev Vlan100 lladdr 04:62:73:8c:fd:6a ref 1 used 95/9/56 probes 4 REACHABLE
fe80::2a6f:7fff:feba:1cff dev eth0 lladdr 28:6f:7f:ba:1c:ff router used 30606/27426/22 probes 1 STALE
fe80::662:73ff:fe8c:fd6a dev Vlan100 lladdr 04:62:73:8c:fd:6a ref 1 used 85/85/5 probes 1 REACHABLE
fe80::2e0:ecff:fe3b:d6ac dev Ethernet122 lladdr 00:e0:ec:3b:d6:ac router ref 1 used 651/651/52 probes 1 REACHABLE
172.25.11.1 dev eth0 lladdr 28:6f:7f:ba:1c:ff ref 1 used 857/0/852 probes 1 REACHABLE
172.18.1.7 dev Vlan100 lladdr 04:62:73:8c:fd:6a ref 1 used 39523/56/56 probes 6 REACHABLE
172.25.11.46 dev eth0 lladdr 00:e0:ec:3c:09:9a ref 1 used 2475/2472/20 probes 1 REACHABLE

Round 1, deleting 8 entries
Flush is complete after 1 round

admin@lnos-x1-a-asw03:~$ ip neighbor show | grep -v FAILED
172.25.11.1 dev eth0 lladdr 28:6f:7f:ba:1c:ff REACHABLE
172.25.11.46 dev eth0 lladdr 00:e0:ec:3c:09:9a REACHABLE

admin@lnos-x1-a-asw03:~$ docker exec -it swss bash -c "/usr/bin/arp_update"

admin@lnos-x1-a-asw03:~$ ip neighbor show | grep -v FAILED
fe80::662:73ff:fe8c:fd6a dev Vlan100 lladdr 04:62:73:8c:fd:6a DELAY
fe80::2e0:ecff:fe3b:d6ac dev Ethernet122 lladdr 00:e0:ec:3b:d6:ac router DELAY
172.25.11.1 dev eth0 lladdr 28:6f:7f:ba:1c:ff REACHABLE
172.18.1.7 dev Vlan100 lladdr 04:62:73:8c:fd:6a REACHABLE
172.25.11.46 dev eth0 lladdr 00:e0:ec:3c:09:9a REACHABLE

admin@lnos-x1-a-asw03:~$ ip neighbor show | grep -v FAILED
fe80::662:73ff:fe8c:fd6a dev Vlan100 lladdr 04:62:73:8c:fd:6a REACHABLE
fe80::2e0:ecff:fe3b:d6ac dev Ethernet122 lladdr 00:e0:ec:3b:d6:ac router REACHABLE
172.25.11.1 dev eth0 lladdr 28:6f:7f:ba:1c:ff REACHABLE
172.18.1.7 dev Vlan100 lladdr 04:62:73:8c:fd:6a REACHABLE
172.25.11.46 dev eth0 lladdr 00:e0:ec:3c:09:9a REACHABLE

- Description for the changelog

- A picture of a cute animal (not mandatory but encouraged)

…o avoid CPU hit

In case ipv6 global addresses were configured on L3 interfaces and used for peering,
and routing protocol was using link-local addresses on the same interfaces as prefered nexthops,
the link-local addresses could be aged out after a while due to no activities towards the link-local
addresses themselves. And when we receive new routes with the link-local nexthops, SONiC won't insert
them to the HW, and thus cause looping or blackholing traffic.

Global ipv6 addresses on L3 interfaces between switches are refreshed by BGP keeplive and other messages.

On server facing side, traffic may hit fowarding plane only, and no refresh for the ipv6 neighbor entries regularly.
This could age-out the linux kernel ipv6 neighbor entries, and HW neighbor table entries could be removed,
and thus traffic going to those neighbors would hit CPU, and cause traffic drop and temperary CPU high load.

Also, if link-local addresses were not learned, we may not get them at all later.

It is intended to fix all above issues.

Changes:
Add ndisc6 package in swss docker and use it for ipv6 ndp ping to update the neighbors' state on Vlan interfaces
Change the default ipv6 neighbor reachable timer to 30mins
Add periodical ipv6 multicast ping to ff02::11 to get/refresh link-local neighbor info.
Copy link
Contributor

@prsunny prsunny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking for information, did you analyze using neighbor discovery (ndisc6 ?) or any configuration to send NS to achieve this functionality?

ping6cmd="ping6 -I $intf -n -q -i 0 -c 1 -W 0 ff02::1 >/dev/null"
intf_up=$(ip link show $intf | grep "state UP")
if [[ -n "$intf_up" ]]; then
eval $ping6cmd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not look space aligned. Can you check?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


while /bin/true; do
# find L3 interfaces which are UP, send ipv6 multicast pings
echo "{% for (name, prefix) in INTERFACE %} {{name}} {% endfor %}" > /tmp/intf_tmp.j2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would need to handle PORTCHANNEL_INTERFACE as well

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Zhen. Can we concatenate INTERFACE and PC_INTERFACE and do one for-loop. It will avoid the duplicate code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Add PORTCHANNEL_INTERFACE interface for ipv6 multicast ping
format issue
@zhenggen-xu
Copy link
Collaborator Author

ndisc6 tool is not able to discover link-local addresses if you never learnt them for some reason (timing or network issue etc).

# the neighbors state.
# arp_update:
# Send ipv6 multicast pings to all "UP" L3 interfaces including vlan interfaces to
# refresh link-local addresses from neighbors.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a concern if the vlan have many neighbors? one multicast ping will get reply from all neighbors?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, by standard all neighbors should reply to ff02::1, so it will get reply from all neighbors.
Also verified it and it worked as expected.

@@ -10,6 +10,7 @@ RUN apt-get update

RUN apt-get install -f -y ifupdown arping libdbus-1-3 libdaemon0 libjansson4

RUN apt-get install -f -y ndisc6
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add this into vs docker? we need to replicate same env in the vs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@lguohan
Copy link
Collaborator

lguohan commented Aug 10, 2018

does it make sense to eventually move this function into neighbor mgrd? which is our own version of neighbor manager.

looks like we also have issue with linux kernel neighbor gc function. we cannot really trust it. at certain point, we need to develop our own neighbor manager.

@prsunny
Copy link
Contributor

prsunny commented Aug 10, 2018

@lguohan , We would eventually need a neighbor mgr. Atleast I could think of the following cases that may need some user-space handling. We could prioritize based on use-case.

  1. Issue with gc threshold and flush of ARP entries
  2. Periodic refresh of link-local entries.
  3. Handle oper-down issue, where kernel does not delete the ARP entries but it might not be required to occupy HW resource for ARPs that belongs to a down interface
  4. Handle admin-down scenario. If there are large number of ARP entries learnt on an interface, currently we rely on the the netlink notifications from kernel and this could be considerable depending on the interace. Instead a neigbhor manager could just act on the admin-down
  5. Handle ECMP neighbor refresh. If one or two ECMP neigbhor members are resolved and programmed to ASIC, I don't think there is a mechanism to resolve the remaining members of ECMP. This could be handled by Nbr-mgr.

[Open]
6. FDB age-out in kernel and if ARP is not refreshed, kernel floods the IP packet to all member ports of Vlan.
7. Mgmt-VRF ARP reservation in case data-plane ARP fillsup the kernel threshold.

@zhenggen-xu
Copy link
Collaborator Author

It would be good to have a session in community to discuss the things around "neighbor mgrd". Source of truth should be linux kernel IMO.

Anyway, it probably makes sense to defer those implementation to later PR.

@lguohan
Copy link
Collaborator

lguohan commented Aug 12, 2018

retest this please

@lguohan
Copy link
Collaborator

lguohan commented Aug 12, 2018

@prsunny, good summary. add as an issue. #1917

@lguohan lguohan merged commit d761630 into sonic-net:master Aug 12, 2018
@zhenggen-xu zhenggen-xu deleted the github-fork-ndp-update branch June 7, 2019 19:37
stepanblyschak added a commit to stepanblyschak/sonic-buildimage that referenced this pull request Nov 1, 2021
```
ca728b8 [config] fix interface IPv6 address removal. (sonic-net#1819)
0665d6f VxLAN Tunnel Counters and Rates implementation (sonic-net#1748)
80a10dc Fix log_ssd_health hang issue (sonic-net#1904)
ea4a730 [config][cbf] Added config commands for CBF (sonic-net#1799)
02ce8d6 [sonic-package-manager] update FEATURE entries on upgrade (sonic-net#1803)
9f123c0 [generate_dump] remove secrets from dump files (sonic-net#1886)
3a8ab73 [fwutil] Add `fwutil update all` to support the automatic platform component fw updates (sonic-net#1242)
776fddf [sonic-package-manager] code style fixes and enhancements (sonic-net#1802)
f53baac [watermarkstat] Fix for error in processing empty array from couters db (sonic-net#1810)
0b2536b Generic_upater: Apply JSON change (sonic-net#1856)
```

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
liat-grozovik pushed a commit that referenced this pull request Nov 2, 2021
```
ca728b8 [config] fix interface IPv6 address removal. (#1819)
0665d6f VxLAN Tunnel Counters and Rates implementation (#1748)
80a10dc Fix log_ssd_health hang issue (#1904)
ea4a730 [config][cbf] Added config commands for CBF (#1799)
02ce8d6 [sonic-package-manager] update FEATURE entries on upgrade (#1803)
9f123c0 [generate_dump] remove secrets from dump files (#1886)
3a8ab73 [fwutil] Add `fwutil update all` to support the automatic platform component fw updates (#1242)
776fddf [sonic-package-manager] code style fixes and enhancements (#1802)
f53baac [watermarkstat] Fix for error in processing empty array from couters db (#1810)
0b2536b Generic_upater: Apply JSON change (#1856)
```

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
judyjoseph added a commit that referenced this pull request Nov 6, 2021
swss
73caba3 Allow interface type value none (#1991)

utilities
32e530f Allow interface type value none (#1902)
53f066c Fix log_ssd_health hang issue (#1904)
vivekrnv added a commit to vivekrnv/sonic-buildimage that referenced this pull request Nov 8, 2021
48035d75 [202012] [techsupport] Techsupport Error Reporting pending fixes (sonic-net#1854)
8b2ec09a Fix log_ssd_health hang issue (sonic-net#1904)
ac9c4254 Fix the option missing in kernel config issue (sonic-net#1888)
5cc9417a disk_check: Script updated to run good in 201811 & 201911 (sonic-net#1747)

Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
qiluo-msft pushed a commit that referenced this pull request Nov 9, 2021
Submodule update for sonic-utilties

```
48035d75 [202012] [techsupport] Techsupport Error Reporting pending fixes (#1854)
8b2ec09a Fix log_ssd_health hang issue (#1904)
ac9c4254 Fix the option missing in kernel config issue (#1888)
5cc9417a disk_check: Script updated to run good in 201811 & 201911 (#1747)
```
taras-keryk pushed a commit to taras-keryk/sonic-buildimage that referenced this pull request Apr 28, 2022
What I did
Fix sonic-net#9114
The log_ssd_health command hangs due to timeout being used with docker exec -i which also affect warmboot flow.

How I did it
Added foreground option for timeout. This is recommended when not using the command on shell
https://man7.org/linux/man-pages/man1/timeout.1.html

How to verify it
Run log_ssd_health and verify it does not hang

Signed-off-by: Sudharsan Dhamal Gopalarathnam sudharsand@nvidia.com
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants