-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update fabric link monitoring plan. #1013
Update fabric link monitoring plan. #1013
Conversation
@arlakshm for awareness |
doc/voq/fabric.md
Outdated
@@ -171,6 +172,85 @@ PORT RxCells TxCells Crc Fec Corrected | |||
|
|||
In a later phase, a `show fabric status` command will be added to show the remote switch ID and link ID for each fabric link of an ASIC. This will be obtained from the SAI_PORT_ATTR_FABRIC_REACHABILITY port attribute of the fabric port. Note that for fabric links that do not have a link partner because of the configuration of the chassis, this will show the status as `down`. The status will also be `down` for fabric links that are down due to some other physical error. To identify links that are down due to error vs links that are not expected to be up because of the chassis connectivity, we need to build up a list of expected fabric connectivity for each ASIC. This can be computed ahead of time based on the vendor configuration and populated in the minigraph. This will be implemented in a later phase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add if this is applicable to all line cards and sup, or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commands are for both linecards and fabric cards on sup. I updated the document with this information as well.
doc/voq/fabric.md
Outdated
The following proposed CLI is used to show the traffic among fabric links on both fabric ASICs and forwarding ASICs. | ||
|
||
``` | ||
> show fabric counters rate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add units. And add more rows to show the output is per link.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated document with this . thank you
doc/voq/fabric.md
Outdated
#### 2.8.1.1 Cli commands | ||
|
||
``` | ||
> config fabric port set [port_id] isolate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you change the command config fabric port isolate <port_id>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated document . thank you
doc/voq/fabric.md
Outdated
``` | ||
|
||
``` | ||
> config fabric port remove [port_id] isolate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you change the command config fabric port unisolate <port_id>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated document . thank you
``` | ||
> config fabric port remove [port_id] isolate | ||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please elaborate what isolate
operation does on the fabric port.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated document with what isolation means . thank you
doc/voq/fabric.md
Outdated
|
||
#### 2.8.1.2 Monitoring algorithm | ||
|
||
Instead of reacting to the counter changes, Orchagent adds a new poller and periodically polls status of all fabric links. By default, the total number of received cells, cells with crc errors, cells with uncorrectable errors are fetched from all serdes links periodically and the error rates are caculated using these numbers. If any one of the error rate is above the threshold for a number of consecutive polls, the link is identified as a unhealthy link. Then the link is automatically isolated to not distribute traffic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the above section CLI command is defined to isolate a fabric link, in this section it is mentioned The link is automatically isolated to not distribute traffic.
If the links isolated automatically, can elaborate the use case of the isolate CLIs ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question. The fabric inks are isolated automatically by the algorithm if they are unhealthy.
The two commands are kind of additional debug tool. The commands can be used to manually isolate and unisolate a fabric link and can help us debug on the system as well as force isolate a fabric link.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kenneth-arista , @jfeng-arista
According to this document, currently if fabric links are bad the links are isolated automatically by SDK. Can you confirm this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the monitoring algorithm detected the link, and considered it as unhealthy, the algorithm calls sdk api to isolate the link, users do not need issue any CLI.
#### 2.8.2.1 Cli command | ||
|
||
``` | ||
> config fabric monitor capacity threshold <50-100> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add if the command is applicable to linecard or supervisor or both
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated document . thank you
doc/voq/fabric.md
Outdated
|
||
#### 2.8.2.2 Monitoring algorithm | ||
|
||
Orchagent will track the total number of fabric links that are isolated. Once the number of total operational fabric links is below a configured threshold, alert users with a system log. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will a log message be generated when the number of operational links on fabric asic goes below threshold? This message will be useful to know that the fabric asic is not working properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are focusing on Linecard side for now, and sys log when the fabric capacity is not enough to sustain linerate for forwarding ASIC.
We can extend some check in future for Fabric cards.
#### 2.8.1.1 Fabric link monitoring criteria | ||
|
||
The fabric link monitoring algorithm checks two type of errors on a link: crc errors and uncorrectable errors. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these already defined in SAI port stats? Can we list the exact sai_port_stat_t type being used for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
SAI_PORT_STAT_IF_IN_ERRORS is for crc erros
SAI_PORT_STAT_IF_IN_FEC_NOT_CORRECTABLE_FRAMES is for uncorrectable erros.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks. Please add the above mapping in the doc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also from SONiC point of view, it should monitor SAI counters. Can we confirm (add in doc) that it is SAI_PORT_STAT_IF_IN_ERRORS that are being monitored from counter db or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the exact relationship between "#crcCells" and "SAI_PORT_STAT_IF_IN_ERRORS"? Is #crcCells a value that can be read from asic registers/SDK counters? E.g, we'll expect SAI_PORT_STAT_IF_IN_ERRORS increment by one if #crcCells increments by one?
``` | ||
|
||
Besides the fabric link monitoring algorithm, the above two commands are added. The commands can be used to manually isolate and unisolate a fabric link ( i.e. take the link out of service and put the link back into service ). The two commands can help us debug on the system as well as force isolate a fabric link. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this command be done on supervisor and line cards? which SAI attribute will be used for this?
In this SAI PR opencomputeproject/SAI#1764, it is applicable for fabric switch only.
However, should we also need to isolate a link from line card , can this SAI attribute be used then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commands can be used on both supervisor and Linecards for isolate a fabric port. SAI_PORT_ATTR_FABRIC_ISOLATE will be used for this.
opencomputeproject/SAI#1764 is different and that is NOT for link level isolation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which SAI attribute is needed to do this fabric port isolation then? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SAI_PORT_ATTR_FABRIC_ISOLATE
70894b5
to
57b1dd0
Compare
@rlhui can we get an approving review since all comments have been addressed? |
@jfeng-arista There are many PRs to be picked up for this feature. Since each PR approval run at is own timeline, can you share if there are any dependency that we need to be aware of? Example of dependency concerns are something such as certain PR needs to go in first or else it may cause:
If there are dependency issue, please specify them clearly in this PR as well as in the corresponding PR for the dependency... |
|
||
#### 2.8.1.2 Monitoring algorithm | ||
|
||
Instead of reacting to the counter changes, Orchagent adds a new poller and periodically polls status of all fabric links. By default, the total number of received cells, cells with crc errors, cells with uncorrectable errors are fetched from all serdes links periodically and the error rates are calculated using these numbers. If any one of the error rates is above the threshold for a number of consecutive polls, the link is identified as an unhealthy link. Then the link is automatically isolated to not distribute traffic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please confirm (add in doc) whether the automatic isolation of unhealthy link is in 202205 release
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Software based fabric link/capacity monitoring is not committed for 202205. The support will merge to master. The SAI support for the isolate operation is not present in 202205.
Update "Fabric port support on Sonic" HLD
PRs related to updated sections are: