[orchagent]: Added support of PFC WD for BFN platform #823

vsenchyshyn · 2019-03-29T13:24:04Z

Signed-off-by: Vitaliy Senchyshyn vsenchyshyn@barefootnetworks.com

What I did
Added instance of PfcWdSwOrch in OrchDaemon::init() in case of Barefoot platform. Fixed PFC WD detection logic in pfc_detect_barefoot.lua script.

Why I did it
PFC WD wasn't working for BFN platform.

How I verified it

pfc_wd community test has passed
Tested it manually with BFN platform by simulating PFC storm and verified that PWC storm was successfully detected wen occurred and restored when disappeared by the WD. Validated all the counters used by PFC WD functionality worked properly.

Details if related

Signed-off-by: Vitaliy Senchyshyn <vsenchyshyn@barefootnetworks.com>

msftclas · 2019-03-29T13:24:18Z

All CLA requirements met.

vsenchyshyn · 2019-03-29T13:25:04Z

@stcheng @marian-pritsak @lguohan Please review

orchagent/orchdaemon.cpp

Signed-off-by: Vitaliy Senchyshyn <vsenchyshyn@barefootnetworks.com>

marian-pritsak · 2019-03-29T15:21:45Z

orchagent/pfc_detect_barefoot.lua

@@ -21,70 +21,81 @@ for i = n, 1, -1 do
    local is_deadlock = false
    local pfc_wd_status = redis.call('HGET', counters_table_name .. ':' .. KEYS[i], 'PFC_WD_STATUS')
    local pfc_wd_action = redis.call('HGET', counters_table_name .. ':' .. KEYS[i], 'PFC_WD_ACTION')
-    if pfc_wd_status == 'operational' or pfc_wd_action == 'alert' then


Is this file identical to Mellanox too? If yes, will the symlink work?

Since it was there before and it's related to a different platform It's better to leave it as is.

wendani · 2019-04-02T01:03:20Z

Has it passed the pfc watchdog test?

vsenchyshyn · 2019-04-02T12:15:24Z

Has it passed the pfc watchdog test?
This was tested manually so far. Please take a look at the PR description.

wendani · 2019-04-02T19:55:07Z

orchagent/pfc_detect_barefoot.lua

+                        (occupancy_bytes == 0 and packets - packets_last == 0 and (pfc_duration - pfc_duration_last) > poll_time * 0.8) then
+                        if time_left <= poll_time then
+                            redis.call('HDEL', counters_table_name .. ':' .. port_id, pfc_rx_pkt_key .. '_last')
+                            redis.call('HDEL', counters_table_name .. ':' .. port_id, pfc_duration_key .. '_last')


You do not need to learn hdel _RX_PKTS_last and _RX_PAUSE_DURATION_last on lines 74 & 75 from the Mellanox script. This is to solve the double storm signaling that was observed on Mellanox platforms. I assume barefoot platforms do not have this issue. #697

@andriymoroz-mlnx for comments

this is not Mellanox issue. It is common for this approach. If Barefoot uses the same solution (approach, counters, etc) they can use the same script

The issue should not be related to the counters set Mellanox chooses. Excluding the counter factors, this should also happen on brcm platforms. However, we do not observe the issue on brcm platforms so far.

I think Broadcom uses different way to detect storm and thus do not have this issue

Difference is only in the counter set. The idea behind is the same---no packets going out of the queue and the queue is paused.

This dual storm signaling should only happen if the drop action is not installed by the orchagent fast enough. pfcactionhandler initCounters() will flip the PFC_WD_STATUS from operational to stormed as part of its drop actions. If PFC_WD_STATUS is not operational, the detect script logic will exit early on line 26 without running the actual detect state machine. So there should be no occurrence of dual storm signaling at all. The possible cause I can think of is that the control-plane cpu is not fast enough to schedule running the orchagent drop actions within a polling interval of 200 ms https://github.com/Azure/sonic-swss/blob/f22fb80bdfa10ea38e718996235c99233e08c31a/orchagent/pfc_detect_barefoot.lua#L26

Let's better keep it as is. In case race condition really happens it will be very hard to catch and fix.

I suggest we start from a clean implementation rather than patch it here and there. The double storming signaling can be captured by PFC watchdog test. This is how mlnx found the problem on their platforms. Function-wise, double signaling does not affect the proper detection and the proper restoration.

Last time, you said bfn still uses manual test for PFC watchdog validation. If you have proof that it does exist also on bfn, we can later add the patch back.

@wendani the ct test passed for bfn with these changes. If we are talking about fast or not fast cpu I would say it hard to catch the race condition on all possible CPUs, as bfn sdk could run on different platform vendors and this value could differ. Could we leave this check to avoid further patching and would see in context this feature will evolve?

wendani · 2019-04-02T19:56:34Z

orchagent/pfc_detect_barefoot.lua

+            -- Save values for next run
+                redis.call('HSET', counters_table_name .. ':' .. KEYS[i], 'SAI_QUEUE_STAT_PACKETS_last', packets)
+                redis.call('HSET', counters_table_name .. ':' .. KEYS[i], 'PFC_WD_DETECTION_TIME_LEFT', time_left)
+                if is_deadlock == false then


Drop the 'if is_deadlock == false' condition. #697

orchagent/pfc_detect_barefoot.lua

orchagent/orchdaemon.cpp

vsenchyshyn · 2019-05-29T15:01:31Z

@lguohan I've updated the PR according to baseline changes, but the test log is quite strange as well as the "No test results found" result. Not sure how the PFC WD changes could cause all these fails in mirror and other stuff: https://sonic-jenkins.westus2.cloudapp.azure.com/job/vs/job/sonic-swss-build-pr/40/consoleFull Is this something on your side?

vsenchyshyn · 2019-05-31T08:58:28Z

retest this please

vsenchyshyn · 2019-06-06T08:48:02Z

retest this please

wendani · 2019-06-06T20:32:49Z

retest this please

wendani · 2019-06-07T00:53:19Z

retest this please

wendani · 2019-06-07T17:32:41Z

retest this please

After the change in master branch updating SAI from 3.5.3.1m-25 to 3.7.3.2, we always found kernel panic after running fast-reboot command in testing SONiC with traffic. In the up path of fast-reboot, we can find warning messages like "unhandled irq 16 error" before kernel panic, which implies that some components are not properly closed in the down path. This fix will unload certain kernel modules by stopping opennsl before fast-reboot, which is suggested by BRCM. Note that another part of the fix is to add 'ExecStop=-/etc/init.d/opennsl-modules stop' to sonic-buildimage:platform/broadcom/saibcm-modules/systemd/opennsl-modules.service, which will be included in another pull request.

This fix brings in support for cisco-8000 platform into sonic-sairedis/syncd. It checks for the SONIC_ASIC_TYPE keyword and picks up the PLATFORM type to see if "cisco-8000" word is available. Accordingly, it sources the required paths for SDK to carry on its operations.

[orchagent]: Added support of PFC WD for BFN platform

e5a6217

Signed-off-by: Vitaliy Senchyshyn <vsenchyshyn@barefootnetworks.com>

marian-pritsak reviewed Mar 29, 2019

View reviewed changes

orchagent/orchdaemon.cpp Outdated Show resolved Hide resolved

Fixed review comments

f22fb80

Signed-off-by: Vitaliy Senchyshyn <vsenchyshyn@barefootnetworks.com>

marian-pritsak reviewed Mar 29, 2019

View reviewed changes

marian-pritsak approved these changes Apr 1, 2019

View reviewed changes

wendani self-requested a review April 2, 2019 01:03

wendani suggested changes Apr 2, 2019

View reviewed changes

Use PFC WD ACL handler for BFN platform

e257b37

wendani reviewed Apr 27, 2019

View reviewed changes

orchagent/orchdaemon.cpp Show resolved Hide resolved

Merge branch 'master' into bfn-pfc-wd-support

0f6afdf

vsenchyshyn mentioned this pull request May 29, 2019

[orchagent] PFC WD support for BFN platform #916

Merged

wendani self-requested a review June 5, 2019 16:28

wendani approved these changes Jun 5, 2019

View reviewed changes

Merge branch 'master' into bfn-pfc-wd-support

e52721c

vsenchyshyn force-pushed the bfn-pfc-wd-support branch from 3d4145e to e52721c Compare June 5, 2019 18:01

wendani merged commit cde242b into sonic-net:master Jun 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[orchagent]: Added support of PFC WD for BFN platform #823

[orchagent]: Added support of PFC WD for BFN platform #823

vsenchyshyn commented Mar 29, 2019 •

edited

Loading

msftclas commented Mar 29, 2019 •

edited

Loading

vsenchyshyn commented Mar 29, 2019

marian-pritsak Mar 29, 2019

vsenchyshyn Mar 29, 2019

wendani commented Apr 2, 2019

vsenchyshyn commented Apr 2, 2019

wendani Apr 2, 2019

wendani Apr 2, 2019

andriymoroz-mlnx Apr 3, 2019

wendani Apr 3, 2019

andriymoroz-mlnx Apr 4, 2019

wendani Apr 4, 2019

wendani Apr 5, 2019

vsenchyshyn May 29, 2019

wendani May 29, 2019

NStetskovych-zz Jun 5, 2019

wendani Apr 2, 2019

vsenchyshyn commented May 29, 2019

vsenchyshyn commented May 31, 2019

vsenchyshyn commented Jun 6, 2019

wendani commented Jun 6, 2019

wendani commented Jun 7, 2019

wendani commented Jun 7, 2019

[orchagent]: Added support of PFC WD for BFN platform #823

[orchagent]: Added support of PFC WD for BFN platform #823

Conversation

vsenchyshyn commented Mar 29, 2019 • edited Loading

msftclas commented Mar 29, 2019 • edited Loading

vsenchyshyn commented Mar 29, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wendani commented Apr 2, 2019

vsenchyshyn commented Apr 2, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vsenchyshyn commented May 29, 2019

vsenchyshyn commented May 31, 2019

vsenchyshyn commented Jun 6, 2019

wendani commented Jun 6, 2019

wendani commented Jun 7, 2019

wendani commented Jun 7, 2019

vsenchyshyn commented Mar 29, 2019 •

edited

Loading

msftclas commented Mar 29, 2019 •

edited

Loading