Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[orchagent]: Added support of PFC WD for BFN platform #823
[orchagent]: Added support of PFC WD for BFN platform #823
Changes from all commits
e5a6217
f22fb80
e257b37
0f6afdf
e52721c
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this file identical to Mellanox too? If yes, will the symlink work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it was there before and it's related to a different platform It's better to leave it as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You do not need to learn hdel _RX_PKTS_last and _RX_PAUSE_DURATION_last on lines 74 & 75 from the Mellanox script. This is to solve the double storm signaling that was observed on Mellanox platforms. I assume barefoot platforms do not have this issue. #697
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andriymoroz-mlnx for comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not Mellanox issue. It is common for this approach. If Barefoot uses the same solution (approach, counters, etc) they can use the same script
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue should not be related to the counters set Mellanox chooses. Excluding the counter factors, this should also happen on brcm platforms. However, we do not observe the issue on brcm platforms so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Broadcom uses different way to detect storm and thus do not have this issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Difference is only in the counter set. The idea behind is the same---no packets going out of the queue and the queue is paused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This dual storm signaling should only happen if the drop action is not installed by the orchagent fast enough. pfcactionhandler initCounters() will flip the PFC_WD_STATUS from operational to stormed as part of its drop actions. If PFC_WD_STATUS is not operational, the detect script logic will exit early on line 26 without running the actual detect state machine. So there should be no occurrence of dual storm signaling at all. The possible cause I can think of is that the control-plane cpu is not fast enough to schedule running the orchagent drop actions within a polling interval of 200 ms https://github.com/Azure/sonic-swss/blob/f22fb80bdfa10ea38e718996235c99233e08c31a/orchagent/pfc_detect_barefoot.lua#L26
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's better keep it as is. In case race condition really happens it will be very hard to catch and fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we start from a clean implementation rather than patch it here and there. The double storming signaling can be captured by PFC watchdog test. This is how mlnx found the problem on their platforms. Function-wise, double signaling does not affect the proper detection and the proper restoration.
Last time, you said bfn still uses manual test for PFC watchdog validation. If you have proof that it does exist also on bfn, we can later add the patch back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wendani the ct test passed for bfn with these changes. If we are talking about fast or not fast cpu I would say it hard to catch the race condition on all possible CPUs, as bfn sdk could run on different platform vendors and this value could differ. Could we leave this check to avoid further patching and would see in context this feature will evolve?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop the 'if is_deadlock == false' condition. #697